CN109215647A

Movatterモバイル変換

Info

Publication number: CN109215647A
Application number: CN201811004154.2A
Authority: CN
Inventors: 李深; 胡亚光
Original assignee: Chumen Wenwen Information Technology Co Ltd
Current assignee: Chumen Wenwen Information Technology Co Ltd
Priority date: 2018-08-30
Filing date: 2018-08-30
Publication date: 2019-01-15

Abstract

The embodiment of the invention provides a kind of voice awakening method, electronic equipment and non-transient computer readable storage mediums, are applied to technical field of voice recognition.This method comprises: sequentially inputting the audio frequency characteristics extracted from voice signal into the first speech recognition modeling, when the confidence level for determining audio frequency characteristics reaches the first confidence threshold value and not up to the first threshold wake-up value, the second audio frequency characteristics are sequentially input to the first speech recognition modeling, and the first audio frequency characteristics in the second speech recognition modeling determined are sequentially input to the second speech recognition modeling, when meeting the first preset condition, it determines and executes wake operation, wherein, when first preset condition includes: that the first speech recognition modeling detects the second audio frequency characteristics, detect that confidence level reaches the first threshold wake-up value, and/or, when second speech recognition modeling detects the first audio frequency characteristics, detect that confidence level reaches the second threshold wake-up value.The embodiment of the present invention realizes how to realize that voice wakes up.

Description

Voice wake-up method, electronic device and non-transitory computer readable storage medium

Technical Field

The embodiment of the invention relates to the technical field of voice recognition, in particular to a voice awakening method, electronic equipment and a non-transitory computer readable storage medium.

Background

With the development of information technology, voice recognition technology has also developed, and products using voice recognition, such as conversation assistants, smart robots, smart watches, and the like, are increasing. These products all enhance the user experience and improve the level of natural human-computer interaction through speech recognition.

In speech recognition, one very important technique is keyword detection, which may also be generally referred to as voice wake-up, where a user typically activates a device using a specific voice wake-up word for subsequent voice interaction.

Therefore, how to detect whether the user voice contains a specific voice wake-up word to perform voice wake-up becomes a key issue.

Disclosure of Invention

The embodiment of the invention provides a voice awakening method, electronic equipment and a non-transitory computer readable storage medium, which can solve the problem of how to perform voice awakening according to user voice. The technical scheme is as follows:

in a first aspect, a voice wake-up method is provided, where the method includes:

sequentially inputting audio features extracted from the voice signals into the first voice recognition model;

when the confidence degree of the audio features reaches a first confidence degree threshold value and does not reach a first awakening threshold value, sequentially inputting second audio features to the first voice recognition model, and sequentially inputting first audio features to the second voice recognition model, wherein the first audio features are the audio features of a first preset frame number just before the first voice recognition model detects the first confidence degree threshold value, the first confidence degree threshold value is the minimum value of the confidence degrees of the audio features which need to be input to the second voice recognition model for voice recognition, and the second audio features are the audio features of a second preset frame number after the first audio features;

when a first preset condition is met, determining to execute a wakeup operation;

the first preset condition includes at least one of:

when the first voice recognition model detects the second audio frequency characteristic, detecting that the confidence coefficient of the audio frequency characteristic reaches a first awakening threshold value;

and when the second voice recognition model detects the first audio features, detecting that the confidence coefficient of the audio features reaches a second awakening threshold value.

In a second aspect, a voice wake-up apparatus is provided, the apparatus comprising:

the first input module is used for sequentially inputting the audio features extracted from the voice signals into the first voice recognition model;

the second input module is used for sequentially inputting second audio features to the first voice recognition model and sequentially inputting first audio features to the second voice recognition model when the confidence coefficient of the audio features reaches a first confidence coefficient threshold and does not reach a first awakening threshold, wherein the first audio features are the audio features of a first preset frame number just before the first confidence coefficient threshold is detected by the first voice recognition model, the first confidence coefficient threshold is the minimum value of the confidence coefficient of the audio features which need to be input to the second voice recognition model for voice recognition, and the second audio features are the audio features of a second preset frame number after the first audio features;

the second determining module is used for determining to execute the awakening operation when the first preset condition is met;

the first preset condition includes at least one of:

In a third aspect, an electronic device is provided, which includes:

at least one processor;

and at least one memory, bus connected with the processor; wherein,

the processor and the memory complete mutual communication through the bus;

the processor is configured to call the program instructions in the memory to perform the voice wake-up method shown in the first aspect.

In a fourth aspect, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of voice wake-up of the first aspect.

The technical scheme provided by the embodiment of the invention has the following beneficial effects:

the embodiment of the invention provides a voice awakening method, an electronic device and a non-transitory computer readable storage medium, wherein audio features extracted from a voice signal are sequentially input into a first voice recognition model, when the confidence coefficient of the audio features reaches a first confidence coefficient threshold and does not reach the first awakening threshold, second audio features are sequentially input into the first voice recognition model, first audio features are sequentially input into a second voice recognition model, the first audio features are audio features of a first preset frame number just before the first confidence coefficient threshold is detected by the first voice recognition model, the first confidence coefficient threshold is the minimum value of the confidence coefficient of the audio features needing to be input into the second voice recognition model for voice recognition, the second audio features are audio features of a second preset frame number after the first audio features, and when a first preset condition is met, determining to execute a wake-up operation, wherein the first preset condition comprises: when the first voice recognition model detects the second audio frequency characteristic, the confidence coefficient is detected to reach a first awakening threshold value, and/or when the second voice recognition model detects the first audio frequency characteristic, the confidence coefficient is detected to reach a second awakening threshold value. In other words, in the embodiment of the present invention, the confidence of the audio feature may be detected through the first speech recognition model and/or the second speech recognition model to determine whether the user speech includes the preset wake-up word, and further determine whether to execute the wake-up operation, so that whether the user speech includes the specific speech wake-up word may be detected to perform the speech wake-up.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings used in the description of the embodiments of the present invention will be briefly described below.

Fig. 1 is a schematic flowchart of a voice wake-up method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a voice wake-up apparatus according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of another voice wake-up apparatus according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device awakened by voice according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the embodiments of the present invention, and are not to be construed as limiting the invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, embodiments of the present invention will be described in further detail with reference to the accompanying drawings.

There are two ways of voice wake-up in the prior art, including: the first and second modes;

one of the ways is single step verification to determine whether to perform voice wakeup operation. Specifically, the method comprises the steps of performing 'streaming' processing on an input audio signal, judging whether the audio signal input by a user contains a preset awakening word in real time, and determining to execute awakening operation once the confidence coefficient of the audio signal is detected to reach an awakening threshold value.

And step two is two-step verification to determine whether to execute the voice wakeup operation. Specifically, the method comprises the steps of processing an input audio signal in a streaming mode, calculating and judging in real time, caching the audio signal in a preset time period, verifying the cached audio signal in the preset time period by using a second algorithm or a second model when the confidence coefficient reaches a preset threshold value, detecting whether the cached audio signal in the preset time period contains a specific voice awakening word, and determining to execute awakening operation when the confidence coefficient verified by the second algorithm or the second model reaches the awakening threshold value.

In order to ensure the real-time performance of voice awakening and reduce the computing power consumption of equipment, in a first voice awakening mode in the prior art, whether a voice signal contains a specific voice awakening word is detected through an algorithm and a model used in single-step verification so as to determine to execute awakening operation, so that the preparation degree of voice awakening is low, the false awakening rate is high, and balance is difficult to find between the improvement of the awakening rate and the reduction of the false awakening rate;

in the second voice wake-up mode in the prior art, in order to improve the readiness of voice wake-up, a two-step verification mode is adopted, but the second verification step usually adopts an algorithm or a model with higher complexity and larger calculation amount, so that after the user finishes speaking the wake-up word, the first verification step passes, the second verification step needs to wait for the verification of the verification model to determine whether to execute the voice wake-up operation, and therefore, a longer period of time exists between the user finishes speaking the wake-up word and determining whether to execute the wake-up operation, so that the delay of voice wake-up is larger, and the user experience is poorer.

The voice wake-up method, the electronic device and the non-transitory computer readable storage medium provided by the embodiments of the present invention are provided to solve the above technical problems in the prior art.

The following describes in detail the technical solutions of the embodiments of the present invention and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present invention will be described below with reference to the accompanying drawings.

Example one

An embodiment of the present invention provides a voice wake-up method, which is applied to an electronic device with a microphone, and as shown in fig. 1, the method includes:

step S101, sequentially inputting the audio features extracted from the voice signals into the first voice recognition model.

For the embodiment of the invention, the electronic equipment acquires the sound signals in the environment in real time through the microphone, performs digital-to-analog conversion, noise reduction and other processing on the acquired sound signals, encodes the processed sound signals according to a specific format, and inputs the encoded audio data into the voice awakening software module. In the embodiment of the invention, the voice awakening software module divides the streaming audio data input in real time into a frame according to a certain time interval and extracts corresponding audio features.

For example, the sampling rate for a particular format may be 16 bits or 16 kHz; the certain time interval may be 10ms, 15ms, and 20 ms.

For embodiments of the present invention, the audio features extracted from the audio data may include at least one of: Mel-Frequency Cepstral Coefficients (MFCC) characteristic information; mel-scale Filter Bank (F-Bank) feature information; constant Q-value cepstrum coefficient (CQCC) characteristic information; perceptual linear prediction coefficient (PLP) feature information; linear Prediction Cepstrum Coefficient (LPCC) characteristic information.

For the embodiment of the invention, the audio features extracted from the speech signal are sequentially input into the first speech recognition model.

And S102, when the confidence coefficient of the audio features reaches the first confidence coefficient threshold value and does not reach the first awakening threshold value, sequentially inputting the second audio features to the first voice recognition model, and sequentially inputting the first audio features to the second voice recognition model.

The first audio characteristic is the audio characteristic of a first preset frame number just before the first speech recognition model detects the first confidence threshold value, the first confidence threshold value is the minimum value of the confidence of the audio characteristic which needs to be input to the second speech recognition model for speech recognition, and the second audio characteristic is the audio characteristic of a second preset frame number after the first audio characteristic.

Step S102 may include step S1021 (not shown) and step S1022 (not shown), wherein,

step S1021, when the confidence coefficient of the audio features is determined to reach the first confidence coefficient threshold value and not reach the first awakening threshold value, determining that the first audio features to be input into the second voice recognition model are the audio features of a first preset frame number just before the first confidence coefficient threshold value is detected.

The first confidence threshold is the minimum value of the confidence of the audio features which need to be input into the second speech recognition model for speech recognition.

For the embodiment of the invention, when the confidence of the audio features is detected by the first voice recognition model, and the confidence of the detected audio features reaches the first confidence threshold and does not reach the first awakening threshold, the audio features representing the first preset frame number just before the detection reaches the first confidence threshold need to be detected again by the second voice recognition model to determine whether to execute the voice awakening operation, so that the audio features just before the detection reaches the first confidence threshold, the first preset frame number, are used as the first audio features to be input into the second voice recognition model.

For the embodiment of the present invention, the first preset frame number may be set by a user, may be set by an electronic device, or may be set by an electronic device manufacturer. The present invention is not limited to the embodiments.

For the embodiment of the present invention, the first preset frame number may be set based on the byte length of the preset wakeup word.

For example, the first preset number of frames may be 150 frames, 120 frames, or 180 frames.

For the embodiment of the present invention, if it is detected that the number of frames of the audio features right before reaching the first confidence threshold is smaller than the first preset number of frames, the audio features right before detecting that the number of frames of the audio features right before reaching the first confidence threshold are used as the first audio features to be input into the second speech recognition model.

Step S1022, sequentially inputting the second audio features to the first speech recognition model, and sequentially inputting the determined first audio features to be input into the second speech recognition model to the second speech recognition model.

And the second audio characteristic is the audio characteristic which is behind the first audio characteristic by a second preset frame number.

For the embodiment of the present invention, the second preset frame number may be set by a user, may be set by an electronic device, or may be set by an electronic device manufacturer. The present invention is not limited to the embodiments.

For the embodiment of the present invention, the second preset frame number may be set based on the byte length of the preset wakeup word.

For example, the second preset number of frames may be 50 frames, 80 frames, or 20 frames.

For the embodiment of the present invention, both the first speech recognition model and the second speech recognition model can be neural networks.

And step S103, when a first preset condition is met, determining to execute a wakeup operation.

Wherein the first preset condition comprises at least one of the following:

when the first voice recognition model detects the second audio frequency characteristic, detecting that the confidence coefficient reaches a first awakening threshold value;

and when the second voice recognition model detects the first audio features, detecting that the confidence coefficient reaches a second awakening threshold value.

For the embodiment of the invention, when the second audio frequency characteristic is detected by the first voice recognition model and the confidence coefficient reaches the first awakening threshold value, the awakening operation is determined to be executed, and if the first audio frequency characteristic is detected by the second voice recognition model at the moment, the second voice recognition model is instructed to stop detecting the first audio frequency characteristic; and when the first audio features are detected through the second voice recognition model and the confidence coefficient reaches a second awakening threshold value, determining to execute awakening operation, and if the first voice recognition model detects the second audio features at the moment, indicating the first voice recognition model to stop detecting the second audio features.

For the embodiment of the invention, when the detection is finished through the first voice recognition model and the confidence coefficient is not detected to reach the awakening threshold value, the second voice recognition model detects the first audio frequency characteristic at the moment and does not detect the awakening threshold value at present, the first audio frequency characteristic is sent to the second voice recognition model, and if the confidence coefficient detected by the second voice model in the process of detecting the first audio frequency characteristic reaches the awakening threshold value, the sent second audio frequency characteristic is not detected any more.

For the embodiment of the present invention, the first wake-up threshold is the minimum value of the confidence coefficient of the audio feature corresponding to the determination of the execution of the voice wake-up operation in the process of detecting the audio feature by the first voice recognition model, and the second wake-up threshold is the minimum value of the confidence coefficient of the audio feature which is determined to be capable of executing the voice wake-up operation in the process of detecting the audio feature by the second voice recognition model.

For example, the first wake-up threshold may be 0.9, the second wake-up threshold may be 0.8, or both the first and second wake-up thresholds may be 0.8.

For the embodiment of the present invention, the first wake-up threshold and the second wake-up threshold may be the same or different. The present invention is not limited to the embodiments.

The embodiment of the invention provides a voice awakening method, which comprises the steps of sequentially inputting audio features extracted from voice signals into a first voice recognition model, sequentially inputting second audio features into the first voice recognition model when the confidence coefficient of the audio features reaches a first confidence coefficient threshold and does not reach the first awakening threshold, sequentially inputting first audio features into a second voice recognition model, wherein the first audio features are the audio features of a first preset frame number just before the first confidence coefficient threshold is detected by the first voice recognition model, the first confidence coefficient threshold is the minimum value of the confidence coefficient of the audio features needing to be input into the second voice recognition model for voice recognition, the second audio features are the audio features of a second preset frame number after the first audio features, and determining to execute awakening operation when a first preset condition is met, wherein, the first preset condition comprises: when the first voice recognition model detects the second audio frequency characteristic, the confidence coefficient is detected to reach a first awakening threshold value, and/or when the second voice recognition model detects the first audio frequency characteristic, the confidence coefficient is detected to reach a second awakening threshold value. In other words, in the embodiment of the present invention, the confidence of the audio feature may be detected through the first speech recognition model and/or the second speech recognition model to determine whether the user speech includes the preset wake-up word, and further determine whether to execute the wake-up operation, so that whether the user speech includes the specific speech wake-up word may be detected to perform the speech wake-up.

Example two

Another possible implementation manner of the embodiment of the present invention further includes, on the basis of the operation shown in the first embodiment, the operation shown in the second embodiment, wherein,

step S102 further includes a step Sa (not shown) and a step Sb (not shown), wherein,

and step Sa, if a second preset condition is met, sequentially inputting the second audio features into a second voice recognition model.

Wherein the second preset condition comprises: when the first voice recognition model detects the second audio frequency characteristic, the confidence coefficient is not detected to reach the first awakening threshold value, and when the second voice recognition model detects the first audio frequency characteristic, the confidence coefficient is detected to not reach the second awakening threshold value.

And Sb, if the confidence coefficient reaches a second awakening threshold value when the second voice recognition model detects the second audio frequency characteristic, determining to execute voice awakening operation.

For the embodiment of the invention, when the first voice recognition model detects the second audio frequency characteristic, the confidence coefficient is not detected to reach the first awakening threshold value, and when the second voice recognition model detects the first audio frequency characteristic, the confidence coefficient is detected to not reach the second awakening threshold value, the second audio frequency characteristic is sequentially input into the second voice recognition model, and when the confidence coefficient of the second voice recognition model detects the second audio frequency characteristic to reach the second awakening threshold value, the voice awakening operation can be executed, namely whether the awakening operation is executed or not is determined after the first voice recognition model and the second voice recognition model through the audio frequency characteristic, so that the accuracy of voice awakening can be improved, and the user experience can be improved.

For the embodiment of the invention, if the third preset condition is met, the second audio characteristic is sent to the second voice recognition model; then when the second voice recognition model detects the first audio features and the confidence coefficient of the undetected audio features reaches a second awakening degree threshold value, sequentially inputting the second audio features to the second voice recognition model to verify the confidence coefficient corresponding to the audio features, and when the second voice recognition model detects the second audio features and the confidence coefficient of the detected audio features reaches the second awakening degree threshold value, determining to execute awakening operation; when the second voice recognition model detects the second audio frequency characteristic and does not detect that the confidence coefficient of the audio frequency characteristic reaches a second awakening degree threshold value, determining not to execute voice awakening operation; further, when the confidence coefficient of the detected audio feature reaches the second arousal threshold value in the process of detecting the first audio feature by the second speech recognition model, the speech arousal operation is determined to be executed, and the second speech recognition model does not need to detect the second audio feature.

Wherein the third preset condition comprises: when the first voice recognition model detects the second audio frequency characteristic, the confidence coefficient is not detected to reach the first awakening threshold value, and at the moment, the second voice recognition model does not detect that the confidence coefficient of the audio frequency characteristic reaches the second awakening threshold value.

For the embodiment of the invention, if the confidence coefficient of the second voice recognition model reaches the second awakening threshold value when the first voice recognition model detects the first audio feature, although the second audio feature is already sent to the second voice awakening threshold value, the second voice recognition model does not calculate the second audio feature, so that the calculation pressure of the electronic equipment can be reduced, the voice awakening delay is reduced, and the user experience can be further improved.

EXAMPLE III

Another possible implementation manner of the embodiment of the present invention further includes, on the basis of the operation shown in the first embodiment or the second embodiment, the operation shown in the third embodiment, wherein,

step S101 is followed by step Sc (not shown), wherein,

and step Sc, if the confidence coefficient of the detected audio frequency characteristic reaches a first awakening threshold value when the first voice recognition model detects the first audio frequency characteristic, determining to execute awakening operation.

For the embodiment of the invention, the audio features extracted from the user voice are sequentially input into the first voice recognition model for voice recognition, and if the first voice recognition model detects the first audio features and detects that the confidence coefficient of the audio features reaches the first awakening threshold value, the awakening operation is directly executed without starting the second voice recognition model.

For the embodiment of the invention, the audio features are sequentially input into the first voice recognition model, and if the first voice recognition model detects the first audio features and detects that the confidence coefficient of the audio features reaches the first awakening threshold value, the awakening operation is directly executed without verification through the second voice recognition model, so that the voice awakening delay can be reduced, the computing pressure of the electronic equipment is reduced, and the user experience can be improved.

Example four

Another possible implementation manner of the embodiment of the present invention further includes the operation shown in the fourth embodiment on the basis of the operations shown in the first to third embodiments, wherein,

the method further comprises a step Sd (not shown), wherein,

and Sd, if the confidence coefficient of the audio features which are not detected by the second voice recognition model reaches a second awakening threshold value within the preset time, stopping running the second voice recognition model.

For the embodiment of the invention, if the second voice recognition model is used for secondary verification within the preset time, but the confidence coefficient of the audio features is not detected to reach the second confidence coefficient threshold value, the second voice recognition model is stopped to operate, and the voice features are recognized only through the first voice recognition model.

For the embodiment of the invention, the preset time can be set by the electronic equipment, can also be configured by a user, or can be set by a voice recognition model manufacturer. The present invention is not limited to the embodiments.

For example, the preset time may be 5 seconds(s), 8s, or 10 s.

For the embodiment of the invention, because the network structure of the second speech recognition model is complex and the calculation mode is complex, the calculation pressure for calculating the audio features through the second speech recognition model is large, so that if the confidence coefficient of the audio features which is not detected through the second speech recognition model reaches the second awakening threshold value within the preset time, the second speech recognition model stops running, thereby reducing the calculation overhead and the time delay of the speech awakening, and further improving the user experience.

EXAMPLE five

Another possible implementation manner of the embodiment of the present invention further includes the operation shown in the fifth embodiment on the basis of any one of the first to fourth embodiments, wherein,

step S101 further includes a step Se (not shown), wherein,

and step Se, training the first voice recognition model and the second voice recognition model.

Specifically, the training of the first speech recognition model in step Se includes step Se1 (not shown), wherein,

step Se1, training a first speech recognition model based on the plurality of first training samples.

Wherein the first training sample comprises: the first audio feature is carried with first label information, the first label information is used for representing whether the first audio feature is an audio feature corresponding to a preset awakening word, and the first audio feature is a Mel Frequency Cepstral Coefficients (MFCC).

For the embodiment of the present invention, the first audio feature may be MFCC, and may also be Mel-scale Filter Bank (Mel-scale Filter Bank, F-Bank), Constant Q-value Cepstral Coefficients (CQCC), Perceptual Linear Prediction Coefficients (PLP), Linear Prediction Cepstral Coefficients (LPCC), and any combination thereof.

Specifically, the training of the second speech recognition model in step Se includes step Se2 (not shown), wherein,

step Se2, training a second speech recognition model based on the plurality of second training samples.

Wherein the second training sample comprises: carry the second audio frequency characteristic of second label information, second label information is used for representing whether the second audio frequency characteristic is the audio frequency characteristic that predetermines the word of awakening and correspond, and the second audio frequency characteristic includes: mel-frequency cepstral coefficients, and pitch.

For the embodiments of the present invention, the MFCCs in the second audio feature may or may not be the same as the MFCCs in the first audio feature. The embodiments of the present invention are not limited.

For embodiments of the present invention, the second audio feature includes a tone, and may further include: MFCC, F-Bank, CQCC, PLP, LPCC, and any combination.

For the embodiment of the invention, the first voice recognition model and the second voice recognition model are respectively trained through different audio features, so that the first voice recognition model and the second voice recognition model can calculate the confidence degrees of the input audio features from different dimensions to determine whether to execute the awakening operation, thereby improving the accuracy of voice awakening and further improving the user experience.

EXAMPLE six

Another possible implementation manner of the embodiment of the present invention further includes, on the basis of the first embodiment, the steps shown in the sixth embodiment, wherein,

step S101 further includes step Sf (not shown in the figure) -step Sg (not shown in the figure), wherein,

and step Sf, determining the frame number of the audio features to be cached or the time length of the voice signals to be cached based on the byte number of the preset awakening word.

For the embodiment of the invention, the number of bytes of the preset awakening word is in direct proportion to the number of frames of the audio features to be cached (or the time length of the voice signals to be cached).

For example, the preset wakeup word is "hello XX", the number of bytes of the preset wakeup word is 8, the number of frames for determining the audio feature to be cached may be 200, or the time length of the voice signal to be cached is 110 ms.

And step Sg, caching the audio features corresponding to the frame number of the audio features to be cached, or caching the voice signals of the time length.

For the embodiment of the invention, the electronic device can buffer the audio features of the preset frame number while inputting the audio features into the first voice recognition model, or buffer the voice signals of the preset time length; or after the audio features are input into the first voice recognition model, the electronic equipment caches the audio features of a preset frame number, or caches voice signals of a preset time length. The present invention is not limited to the embodiments. In the embodiment of the present invention, the preset frame number and the preset time length are determined in step Sh.

For the embodiment of the present invention, when detecting that the voice feature is detected by the first voice recognition model and the first preset condition is satisfied, the audio feature of the first preset frame number just before the first confidence threshold is detected needs to be input to the second voice recognition model for secondary verification, or the second voice recognition model does not detect the wake-up threshold when detecting the second voice feature, and the wake-up threshold is not detected when detecting the first voice feature by the current second voice recognition model, the second voice recognition model also needs to be input to the second voice recognition model, so that the audio feature corresponding to the frame number of the audio feature to be cached or the voice signal of the time length needs to be cached.

For the embodiment of the invention, based on the number of bytes of the preset awakening word, the number of frames of the audio features to be cached or the time length of the voice signals to be cached is determined, and the audio features corresponding to the number of frames of the audio features to be cached or the voice signals of the time length are cached, i.e. the audio features or the voice signals are cached according to the number of bytes of the preset awakening word, so that on the premise of verifying voice awakening, the caching pressure can be reduced, the voice awakening time delay is reduced, and the user experience is improved.

EXAMPLE seven

Another possible implementation manner of the embodiment of the present invention further includes, on the basis of the step shown in the first embodiment, the step shown in the seventh embodiment, wherein,

determining that the confidence level of the audio feature meets the first confidence level threshold and does not meet the arousal threshold, comprises a step Si (not labeled in the figure), wherein,

and step Si, when a preset output item output by the first voice recognition model is detected, determining that the audio feature confidence coefficient reaches a first confidence coefficient threshold value and does not reach a first awakening threshold value.

And the preset output item is used for representing that the audio features can be input into the second speech recognition model for verification.

For the embodiment of the present invention, the preset output item is a confidence level that the voice feature needs to be input into the second voice recognition model for verification, that is, a probability that the voice feature needs to be input into the second voice recognition model for verification; or the preset output item may also be a preset identification, for example, "0" or "1", for identifying whether the audio feature is currently input to the second speech recognition model for verification.

For the embodiment of the invention, the first voice recognition model is added with the preset output item, and whether the voice characteristics are required to be input into the second voice recognition model for secondary verification can be directly determined through the preset output item, so that the voice awakening time delay can be reduced.

Example eight

As shown in fig. 2, a schematic structural diagram of a voice wake-up apparatus 20 according to an embodiment of the present invention may include: a first input module 201, a second input module 202, a first determination module 203, wherein,

a first input module 201, configured to sequentially input the audio features extracted from the speech signal into the first speech recognition model.

The second input module 202 is configured to, when it is determined that the confidence level of the audio feature reaches the first confidence level threshold and does not reach the first wake-up threshold, sequentially input a second audio feature to the first speech recognition model, and sequentially input the first audio feature to the second speech recognition model, where the first audio feature is an audio feature in which the first speech recognition model detects a first preset frame number just before reaching the first confidence level threshold, the first confidence level threshold is a minimum value of the confidence level of the audio feature that needs to be input to the second speech recognition model for speech recognition, and the second audio feature is an audio feature in which the second preset frame number is later than the first audio feature.

For the embodiment of the present invention, the first input module 201 and the second input module 202 may be the same input module or different input modules. The present invention is not limited to the embodiments.

A second determining module 203, configured to determine to perform a wake-up operation when the first preset condition is met.

Wherein the first preset condition comprises at least one of the following:

The embodiment of the invention provides a voice awakening device, which is characterized in that audio features extracted from voice signals are sequentially input into a first voice recognition model, when the confidence coefficient of the audio features reaches a first confidence coefficient threshold and does not reach the first awakening threshold, second audio features are sequentially input into the first voice recognition model, first audio features are sequentially input into a second voice recognition model, the first audio features are the audio features of a first preset frame number just before the first confidence coefficient threshold is detected by the first voice recognition model, the first confidence coefficient threshold is the minimum value of the confidence coefficient of the audio features needing to be input into the second voice recognition model for voice recognition, the second audio features are the audio features of a second preset frame number after the first audio features, and when a first preset condition is met, the awakening operation is determined to be executed, wherein, the first preset condition comprises: when the first voice recognition model detects the second audio frequency characteristic, the confidence coefficient is detected to reach a first awakening threshold value, and/or when the second voice recognition model detects the first audio frequency characteristic, the confidence coefficient is detected to reach a second awakening threshold value. In other words, in the embodiment of the present invention, the confidence of the audio feature may be detected through the first speech recognition model and/or the second speech recognition model to determine whether the user speech includes the preset wake-up word, and further determine whether to execute the wake-up operation, so that whether the user speech includes the specific speech wake-up word may be detected to perform the speech wake-up.

The voice wake-up apparatus of the embodiment of the present invention can execute the voice wake-up method provided in the first embodiment of the present invention, and the implementation principle is similar, which is not described herein again.

Example nine

As shown in fig. 3, a schematic structural diagram of another voice wake-up apparatus provided in the embodiment of the present invention, a device 30 in the embodiment of the present invention may include: a first input module 301, a second input module 302, a first determination module 303, wherein,

a first input module 301, configured to sequentially input the audio features extracted from the speech signal into the first speech recognition model.

The first input module 301 in fig. 3 has the same or similar function as the first input module 201 in fig. 2.

The second input module 302 is configured to, when it is determined that the confidence level of the audio feature reaches the first confidence level threshold and does not reach the first wake-up threshold, sequentially input the second audio feature to the first speech recognition model, and sequentially input the first audio feature to the second speech recognition model.

The first audio feature is an audio feature of a first preset frame number just before the first speech recognition model detects the first confidence threshold, the first confidence threshold is the minimum value of the confidence of the audio feature which needs to be input to the second speech recognition model for speech recognition, and the second audio feature is an audio feature of a second preset frame number after the first audio feature.

Wherein the second input module 302 in fig. 3 has the same or similar function as the second input module 202 in fig. 2.

A first determining module 303, configured to determine to perform a wake-up operation when a first preset condition is met.

Wherein the first preset condition comprises at least one of the following:

The first determining module 303 in fig. 3 has the same or similar function as the first determining module 203 in fig. 2.

Further, as shown in fig. 3, the apparatus 30 further includes: a third input module 304, a second determination module 305, wherein,

and the third input module 304 is configured to sequentially input the second audio features to the second speech recognition model when a second preset condition is met.

For the embodiment of the present invention, the first input module 301, the second input module 302, and the third input module 304 may be different input modules, or may be the same input module, or any two of them may be the same input module. The present invention is not limited to the embodiments.

A second determining module 305, configured to determine to perform a voice wakeup operation when the second speech recognition model detects that the confidence level reaches a second wakeup threshold when detecting the second audio feature.

Further, as shown in fig. 3, the apparatus 30 further includes: a third determination module 306, wherein,

the third determining module 306 is configured to determine to perform a wake-up operation when the first speech recognition model detects that the confidence level of the audio feature reaches the confidence level threshold while detecting the first audio feature.

Further, as shown in fig. 3, the apparatus 30 further includes: a block 307 is executed in which, among other things,

and a running module 307, configured to stop running the second speech recognition model when the confidence level that the audio feature is not detected by the second speech recognition model reaches a second confidence level threshold within a preset time.

Further, the apparatus 30 further comprises: the training module 308 may, among other things,

a training module 308 for training the first speech recognition model and the second speech recognition model.

Specifically, as shown in fig. 3, the training module 308 includes: a first training unit 3081 and a second training unit 3082, wherein,

the first training unit 3081 is configured to train a first speech recognition model based on a plurality of first training samples.

Wherein the first training sample comprises: and carrying a first audio feature of first label information, wherein the first label information is used for representing whether the first audio feature is an audio feature corresponding to a preset awakening word or not, and the first audio feature is a Mel frequency cepstrum coefficient.

The second training unit 3082 is configured to train a second speech recognition model based on a plurality of second training samples.

For the embodiment of the present invention, the first training unit 3081 and the second training unit 3082 may be the same training unit or different training units. The present invention is not limited to the embodiments.

Further, as shown in fig. 3, the apparatus 30 further includes: a fourth determination module 309, a caching module 310, wherein,

a fourth determining module 309, configured to determine, based on the number of bytes of the preset wakeup word, a frame number of the audio feature to be cached or a time length of the voice signal to be cached.

The buffer module 310 is configured to buffer audio features corresponding to the number of frames of the audio features to be buffered, or buffer voice signals of a time length.

Further, as shown in fig. 3, the apparatus 30 further includes: a fifth determining module 311, wherein,

the fifth determining module 311 is configured to determine, when a preset output item output by the first speech recognition model is detected, that the confidence of the audio feature reaches the first confidence threshold and does not reach the first wake-up threshold.

For the embodiment of the present invention, the first determining module 303, the second determining module 305, the third determining module 306, the fourth determining module 309, and the fifth determining module 311 may be the same determining module, or may be different determining modules, or some of the determining modules may be the same determining module, and others may be different determining modules. The present invention is not limited to the embodiments.

The voice wake-up apparatus of the embodiment of the present invention can execute the voice wake-up method shown in any one of the first to seventh embodiments of the present invention, and the implementation principles thereof are similar, and are not described herein again.

Example ten

An embodiment of the present invention provides an electronic device, and as shown in fig. 4, an electronic device 4000 shown in fig. 4 includes: a processor 4001 and a memory 4003. Processor 4001 is coupled to memory 4003, such as via bus 4002. Optionally, electronic device 4000 may also include communication interface 4004. Note that the communication interface 4004 is not limited to one in practical applications, and the configuration of the electronic apparatus 4000 is not limited to the embodiment of the present invention.

The processor 4001 is applied to an embodiment of the present invention, and is configured to implement functions of a first input module, a first determining module, a second input module, and a second determining module shown in fig. 2 or fig. 3, and a third input module, a third determining module, a fourth determining module, an operating module, a training module, a fifth determining module, a caching module, and a sixth determining module shown in fig. 3.

Processor 4001 may be a CPU, general purpose processor, DSP, ASIC, FPGA or other programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or execute the various illustrative logical blocks, modules, and circuits described in connection with the embodiment disclosure. The processor 4001 may also be a combination that performs a computational function, including, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, or the like.

Bus 4002 may include a path that carries information between the aforementioned components. Bus 4002 may be a PCI bus, EISA bus, or the like. The bus 4002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 4, but this does not indicate only one bus or one type of bus.

Memory 4003 may be, but is not limited to, a ROM or other type of static storage device that can store static information and instructions, a RAM or other type of dynamic storage device that can store information and instructions, an EEPROM, a CD-ROM or other optical disk storage, an optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.

The memory 4003 is used for storing application codes for implementing aspects of embodiments of the present invention, and is controlled by the processor 4001 for execution. The processor 4001 is configured to execute application code stored in the memory 4003 to implement the actions of the voice wake-up apparatus provided by the embodiment shown in fig. 2 or fig. 3.

The embodiment of the invention provides electronic equipment, which is characterized in that audio features extracted from a voice signal are sequentially input into a first voice recognition model, when the confidence coefficient of the audio features reaches a first confidence coefficient threshold and does not reach a first awakening threshold, second audio features are sequentially input into the first voice recognition model, first audio features are sequentially input into a second voice recognition model, the first audio features are audio features of a first preset frame number just before the first confidence coefficient threshold is detected by the first voice recognition model, the first confidence coefficient threshold is the minimum value of the confidence coefficient of the audio features needing to be input into the second voice recognition model for voice recognition, the second audio features are audio features of a second preset frame number after the first audio features, and when a first preset condition is met, awakening operation is determined to be executed, wherein, the first preset condition comprises: when the first voice recognition model detects the second audio frequency characteristic, the confidence coefficient is detected to reach a first awakening threshold value, and/or when the second voice recognition model detects the first audio frequency characteristic, the confidence coefficient is detected to reach a second awakening threshold value. In other words, in the embodiment of the present invention, the confidence of the audio feature may be detected through the first speech recognition model and/or the second speech recognition model to determine whether the user speech includes the preset wake-up word, and further determine whether to execute the wake-up operation, so that whether the user speech includes the specific speech wake-up word may be detected to perform the speech wake-up.

The embodiment of the invention provides electronic equipment applicable to any embodiment of the method. And will not be described in detail herein.

EXAMPLE eleven

The embodiment of the present invention provides a non-transitory computer-readable storage medium, which stores computer instructions, and the computer instructions enable a computer to execute the method for voice wake-up shown in any one of the first to seventh embodiments.

The embodiment of the present invention provides a non-transitory computer readable storage medium, wherein audio features extracted from a speech signal are sequentially input into a first speech recognition model, when it is determined that a confidence level of the audio features reaches a first confidence level threshold and does not reach a first wake-up threshold, second audio features are sequentially input into the first speech recognition model, and first audio features are sequentially input into a second speech recognition model, the first audio features are audio features which are detected by the first speech recognition model by a first preset frame number just before the first confidence level threshold, the first confidence level threshold is a minimum value of confidence levels of the audio features which need to be input into the second speech recognition model for speech recognition, the second audio features are audio features which are detected by the first speech recognition model by a second preset frame number after the first audio features, and when a first preset condition is satisfied, determining to execute a wake-up operation, wherein the first preset condition comprises: when the first voice recognition model detects the second audio frequency characteristic, the confidence coefficient is detected to reach a first awakening threshold value, and/or when the second voice recognition model detects the first audio frequency characteristic, the confidence coefficient is detected to reach a second awakening threshold value. In other words, in the embodiment of the present invention, the confidence of the audio feature may be detected through the first speech recognition model and/or the second speech recognition model to determine whether the user speech includes the preset wake-up word, and further determine whether to execute the wake-up operation, so that whether the user speech includes the specific speech wake-up word may be detected to perform the speech wake-up.

The embodiment of the invention provides a non-transitory computer readable storage medium which is suitable for any embodiment of the method. And will not be described in detail herein.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A voice wake-up method, comprising:

the first preset condition includes at least one of:

2. The method of claim 1, wherein the second audio features are sequentially input into the first speech recognition model and the determined first audio features to be input into the second speech recognition model are sequentially input into the second speech recognition model, and thereafter further comprising:

if the second preset condition is met, sequentially inputting the second audio features into a second voice recognition model;

if the second voice recognition model detects that the confidence coefficient of the audio frequency feature reaches a second awakening threshold value when detecting the second audio frequency feature, determining to execute voice awakening operation;

the second preset condition includes: when the first voice recognition model detects the second audio frequency characteristic, the confidence coefficient of the detected audio frequency characteristic does not reach the first awakening threshold value, and when the second voice recognition model detects the first audio frequency characteristic, the confidence coefficient of the detected audio frequency characteristic does not reach the second awakening threshold value.

3. The method of claim 1 or 2, wherein the audio features extracted from the speech signal are sequentially input into the first speech recognition model, and thereafter further comprising:

and if the confidence coefficient of the detected audio features reaches the confidence coefficient threshold value when the first voice recognition model detects the first audio features, determining to execute the awakening operation.

4. The method according to any one of claims 1-3, further comprising:

and if the audio feature confidence coefficient is not detected to reach a second confidence coefficient threshold value through a second speech recognition model within the preset time, stopping running the second speech recognition model.

5. The method of any of claims 1-4, wherein the audio features extracted from the speech signal are sequentially input into the first speech recognition model, and wherein the method further comprises:

training the first speech recognition model and the second speech recognition model;

wherein training the first speech recognition model comprises:

training the first speech recognition model based on a plurality of first training samples;

the first training sample comprises: the method comprises the steps that first audio features carrying first label information are used for representing whether the first audio features are audio features corresponding to preset awakening words or not, and the first audio features are Mel frequency cepstrum coefficients;

wherein training the second speech recognition model comprises:

training the second speech recognition model based on a plurality of second training samples;

the second training sample comprises: carrying second audio features of second label information, where the second label information is used to characterize whether the second audio features are audio features corresponding to preset wake-up words, and the second audio features include: mel-frequency cepstral coefficients, and pitch.

6. The method of claim 1, wherein the audio features extracted from the speech signal are sequentially input into the first speech recognition model, further comprising:

determining the frame number of the audio features to be cached or the time length of the voice signals to be cached based on the byte number of the preset awakening words;

and caching the audio features corresponding to the frame number of the audio features to be cached, or caching the voice signals of the time length.

7. The method of claim 1, wherein determining that the confidence level of the audio feature meets the first confidence threshold and does not meet the wake threshold comprises:

and when a preset output item output by the first voice recognition model is detected, determining that the confidence coefficient of the audio feature reaches a first confidence coefficient threshold value and does not reach a first awakening threshold value, wherein the preset output item is used for representing that the audio feature can be input into a second voice recognition model for detection.

8. A voice wake-up apparatus, comprising:

the first preset condition includes at least one of:

9. An electronic device, comprising:

a processor, a memory, a communication interface, and a bus;

wherein,

the processor, the memory and the communication interface complete mutual communication through the bus;

the communication interface is used for information transmission between the test equipment and the communication equipment of the display device;

the processor is configured to call program instructions in the memory to perform the voice wake-up method of claims 1-7.

10. A non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform the method of voice wake-up of any one of claims 1 to 7.