CN110580908A

Movatterモバイル変換

Info

Publication number: CN110580908A
Application number: CN201910932340.0A
Authority: CN
Inventors: 匡方军; 李深; 雷欣; 李志飞
Original assignee: Chumen Wenwen Information Technology Co Ltd
Current assignee: Chumen Wenwen Information Technology Co Ltd
Priority date: 2019-09-29
Filing date: 2019-09-29
Publication date: 2019-12-17

Abstract

The invention discloses a method and a device for detecting command words supporting different languages, which comprises the steps of firstly collecting voice signals at least comprising two different languages; then extracting the audio features of the voice signals; further carrying out classification prediction on the extracted audio features through a deep learning network to obtain a classification prediction result; and if the classification prediction result is that the probability of the command word is maximum, taking the command word with the maximum probability as an output result.

Description

Command word detection method and device supporting different languages

Technical Field

The invention relates to a language identification technology, in particular to a method and equipment for detecting command words supporting different languages.

Background

With the continuous development of scientific technology, the voice interaction technology has been widely applied to embedded devices, such as mobile phones, watches, sound boxes, earphones, and the like. In order to reduce the operating power consumption of the device, a specific command word is generally used to wake up the device, such as "Hey Siri" of an apple mobile phone, and "favorite classmates" of a millet AI sound box.

Disclosure of Invention

In order to overcome the defects of the current command word detection system, the embodiment of the invention creatively provides a command word detection method and device supporting different languages.

according to a first aspect of the present invention, there is provided a command word detection method supporting different languages, the method comprising: collecting voice signals at least comprising two different languages; extracting audio features of the voice signal; carrying out classification prediction on the extracted audio features through a deep learning network to obtain a classification prediction result; and if the classification prediction result is that the probability of the command word is maximum, taking the command word with the maximum probability as an output result.

According to an embodiment of the present invention, the performing classification prediction on the extracted audio features through a deep learning network to obtain a classification prediction result includes: classifying the extracted audio features through a deep learning network supporting multiple languages to obtain the probability of command words and non-command words; and judging the category of the word with the highest probability in the obtained probabilities of the command word and the non-command word to obtain a classification prediction result.

According to an embodiment of the invention, the method further comprises: and if the probability that the classification prediction result is the non-command word is the maximum, judging that the voice signal does not contain the command word.

According to an embodiment of the present invention, extracting the audio feature of the speech signal includes: extracting Fbank characteristics of the voice signal; or, extracting MFCC features of the speech signal.

According to an embodiment of the present invention, the classifying and predicting the extracted audio features through a deep learning network includes: and performing classified prediction on the extracted audio features by means of a Recurrent Neural Network (RNN), a Convolutional Neural Network (CNN) or a Time Delay Neural Network (TDNN).

According to an embodiment of the present invention, when performing classification prediction on the extracted audio features by means of CNN, the method further includes: caching results of intermediate nodes in the previous N-round classification prediction process, wherein the value of N is a positive integer; correspondingly, the classification prediction of the extracted audio features by means of the CNN comprises the following steps: and (4) taking the results of the intermediate nodes in the classification prediction process of the previous N rounds in the cache and the audio features of the current round as the input of the CNN to perform classification prediction.

According to an embodiment of the invention, the method further comprises: and controlling the operation of executing the command word with the maximum probability.

according to a second aspect of the present invention, there is also provided a command word detecting apparatus supporting different languages, the apparatus including: the acquisition module is used for acquiring voice signals at least comprising two different languages; the characteristic extraction module is used for extracting the audio characteristic of the voice signal; the classification prediction module is used for performing classification prediction on the extracted audio features through a deep learning network to obtain a classification prediction result; and the output module is used for taking the command word with the maximum probability as an output result if the classification prediction result is that the probability of the command word is maximum.

according to an embodiment of the present invention, the classification prediction module includes: the classification unit is used for classifying the extracted audio features through a deep learning network supporting multiple languages to obtain the probability of command words and non-command words; and the judging unit is used for judging the category of the word with the highest probability in the obtained probabilities of the command word and the non-command word to obtain a classification prediction result.

According to an embodiment of the present invention, the output module is further configured to: and if the probability that the classification prediction result is the non-command word is the maximum, judging that the voice signal does not contain the command word.

According to an embodiment of the present invention, the feature extraction module is specifically configured to extract an Fbank feature of the voice signal; or, extracting MFCC features of the speech signal.

According to an embodiment of the present invention, the classification prediction module is specifically configured to perform classification prediction on the extracted audio features by using a recurrent neural network RNN, a convolutional neural network CNN, or a time-delay neural network TDNN.

According to an embodiment of the invention, the apparatus further comprises: the storage module is used for caching results of intermediate nodes in the previous N-round classification prediction process when the extracted audio features are classified and predicted by means of CNN, and the value of N is a positive integer; correspondingly, the classification prediction module is specifically configured to perform classification prediction by using the result of the intermediate node in the classification prediction process of the previous N rounds in the cache and the audio features of the current round as input of the CNN.

According to an embodiment of the invention, the apparatus further comprises: and the control execution module is used for controlling and executing the operation of the command word with the maximum probability.

According to an embodiment of the invention, the device is a smart headset or a microphone.

The embodiment of the invention provides a command word detection method and device supporting different languages, which comprises the steps of firstly collecting voice signals at least comprising two different languages; then extracting the audio features of the voice signals; further carrying out classification prediction on the extracted audio features through a deep learning network to obtain a classification prediction result; and if the classification prediction result is that the probability of the command word is maximum, taking the command word with the maximum probability as an output result. Therefore, the invention can classify and predict the audio characteristics of the extracted voice signal by constructing the deep learning network supporting multi-language and multi-command words, and directly obtain the probability of the command words and non-command words, thereby overcoming the problems that the existing command word detection method only supports single language and only supports the detection of single command words; moreover, storage resources and operation resources are saved to a great extent, and the expandability is good.

It is to be understood that the teachings of the present invention need not achieve all of the above-described benefits, but rather that specific embodiments may achieve specific technical results, and that other embodiments of the present invention may achieve benefits not mentioned above.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Fig. 1 is a schematic diagram showing a basic principle of a command word detection method in the related art;

FIG. 2 is a first flowchart illustrating an implementation of a command word detection method supporting different languages according to an embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating a basic principle of a command word detection method supporting different languages according to an embodiment of the present invention;

FIG. 4 is a flowchart illustrating a specific implementation of a command word detection method for supporting different languages according to an exemplary application of the present invention;

Fig. 5 is a schematic diagram illustrating a configuration of a command word detection device supporting different languages according to an embodiment of the present invention.

Detailed Description

The principles and spirit of the present invention will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are given only to enable those skilled in the art to better understand and to implement the present invention, and do not limit the scope of the present invention in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

The technical solution of the present invention is further elaborated below with reference to the drawings and the specific embodiments.

fig. 1 is a schematic diagram illustrating a basic principle of a command word detection method in the related art. Referring to fig. 1, the existing command word detection method mainly uses an input speech signal as an input of a deep learning network after feature extraction; the output of the deep learning network is the probability of each word in the command words, and after the probability of the input voice signal being the command word can be obtained through the post-processing module. When the probability is larger than a given threshold value, the system judges that the input voice signal contains a command word; otherwise, it is determined that no command word is detected. It is obvious that the current command word detection method supports only a single language and only the detection of a single command word.

in order to add support for multiple command words of different languages to the existing command word detection method, a simple and direct processing method is to run multiple models simultaneously, wherein each model is used to detect wake-up words of different languages. But this consumes more memory and computational resources and is poorly scalable.

therefore, in order to solve the problem that the conventional command word detection method only supports a single language and only supports the detection of a single command word, the present invention creatively provides a command word detection method supporting different languages as shown in fig. 2. Referring to fig. 2, the method for detecting command words supporting different languages according to the embodiment of the present invention includes: operation 201, collecting voice signals at least including two different languages; operation 202, extracting an audio feature of the speech signal; operation 203, performing classification prediction on the extracted audio features through a deep learning network to obtain a classification prediction result; in operation 204, if the classification prediction result is that the probability of the command word is the maximum, the command word with the maximum probability is taken as an output result.

in operation 201, a speech signal including at least two different languages is collected. The data format of the voice signal can be wav format, and 16-bit, 16000Hz sampling frequency and single channel are adopted. Of course, it should be understood by those skilled in the art that the foregoing specific parameters are only an exemplary speech signal, and the embodiments of the present invention do not limit the specific parameters of the speech signal.

In operation 202, feature extraction may be performed on the collected voice signal in units of frames. Specifically, FBank feature extraction may be performed on the acquired speech signal; or, MFCC feature extraction is performed on the acquired speech signal. Of course, the feature extraction method for the voice signal is not limited to Fbank and MFCC, but may be any other feature extraction method that meets the conditions, which is improved or created in the present or future.

In an application example, the feature extraction may also be performed on the collected speech signal frame by frame. For example, the feature extraction may be performed on the first frame, the second frame, and the third frame … in sequence according to the manner of acquiring the speech signal frames in real time. Therefore, the completeness of feature extraction can be ensured, the accuracy of subsequent operation is further ensured, and the accuracy of the command word detection method is finally ensured.

In yet another application example, feature extraction may also be performed on the acquired speech signal by frame skipping at a certain step size. For example, feature extraction may be performed from the first frame, the third frame, and the fifth frame … by frame skipping in sequence according to the manner of the speech signal frames acquired in real time. Therefore, because signal overlap exists between frames, the completeness of feature extraction can be ensured on the basis of saving the complexity of the whole command word detection method by reasonably selecting the step length setting of frame skipping, the accuracy of subsequent operation is further ensured, and the accuracy of the command word detection method can be finally ensured.

According to an embodiment of the present invention, in operation 203, classification prediction may be performed on the extracted audio features, typically by means of RNN, CNN or TDNN. Specifically, the most suitable network structure may be adopted according to the result of the deep learning network training. Wherein, when performing classification prediction on the extracted audio features by means of CNN, the method further comprises: caching results of intermediate nodes in the previous N-round classification prediction process, wherein the value of N is a positive integer; correspondingly, the classification prediction of the extracted audio features by means of the CNN comprises the following steps: and (4) taking the results of the intermediate nodes in the classification prediction process of the previous N rounds in the cache and the audio features of the current round as the input of the CNN to perform classification prediction.

In operation 203, the extracted audio features may be first classified through a deep learning network supporting multiple languages, to obtain probabilities including command words and non-command words; and then judging the category of the word with the highest probability in the obtained probabilities of the command word and the non-command word so as to obtain a classification prediction result.

Referring to the basic principle shown in fig. 3, the embodiment of the present invention can perform classification prediction on the audio features of the extracted speech signal by constructing a deep learning network supporting multiple language multiple command words, and directly obtain the probabilities of multiple command words, such as the probability of command word 1, the probability of command word 2 …, and the probability of non-command word.

Referring to the application example shown in fig. 4, after obtaining the probabilities including the command word and the non-command word in operation 204, the category of the word with the highest probability in the obtained probabilities of the command word and the non-command word is determined, so as to obtain a classification prediction result, if the category of the word with the highest probability is the command word, the command word with the highest probability category is output, otherwise, the command word is not detected, that is, the voice signal does not include the command word.

after operation 204, the method further comprises, in accordance with an embodiment of the present invention: and controlling the operation of executing the command word with the maximum probability.

For example, if the command word with the highest output probability is to wake up the device, such as "Hey Siri" of an apple phone, or "favorite classmates" of a millet AI speaker, the device can directly control to execute the corresponding wake-up operation.

based on the above-mentioned command word detection method supporting different languages, an embodiment of the present invention further provides a device for detecting a command word supporting different languages, as shown in fig. 5, where the device 50 includes: the acquisition module 501 is configured to acquire voice signals at least including two different languages; a feature extraction module 502, configured to extract an audio feature of the speech signal; the classification prediction module 503 is configured to perform classification prediction on the extracted audio features through a deep learning network to obtain a classification prediction result; an output module 504, configured to take the command word with the largest probability as an output result if the classification prediction result is that the probability of the command word is the largest.

according to an embodiment of the present invention, the classification prediction module 503 includes: the classification unit is used for classifying the extracted audio features through a deep learning network supporting multiple languages to obtain the probability of command words and non-command words; and the judging unit is used for judging the category of the word with the highest probability in the obtained probabilities of the command word and the non-command word to obtain a classification prediction result.

According to an embodiment of the present invention, the output module 504 is further configured to: and if the probability that the classification prediction result is the non-command word is the maximum, judging that the voice signal does not contain the command word.

according to an embodiment of the present invention, the feature extraction module 502 is specifically configured to extract an Fbank feature of the voice signal; or, extracting MFCC features of the speech signal.

According to an embodiment of the present invention, the classification prediction module 503 is specifically configured to perform classification prediction on the extracted audio features by using a recurrent neural network RNN, a convolutional neural network CNN, or a time-delay neural network TDNN.

According to an embodiment of the invention, the apparatus 50 further comprises: the storage module is used for caching results of intermediate nodes in the previous N-round classification prediction process when the extracted audio features are classified and predicted by means of CNN, and the value of N is a positive integer; correspondingly, the classification prediction module 503 is specifically configured to perform classification prediction by using the result of the intermediate node in the classification prediction process of the previous N rounds in the cache and the audio feature of the current round as the input of the CNN.

according to an embodiment of the invention, the apparatus 50 further comprises: and the control execution module is used for controlling and executing the operation of the command word with the maximum probability.

The device 50 may be a smart headset or a microphone according to an embodiment of the present invention.

Also, based on the command word detection method supporting different languages as described above, an embodiment of the present invention further provides a computer-readable storage medium storing a program, which, when executed by a processor, causes the processor to perform at least the following operation steps: operation 201, collecting voice signals at least including two different languages; operation 202, extracting an audio feature of the speech signal; operation 203, performing classification prediction on the extracted audio features through a deep learning network to obtain a classification prediction result; in operation 204, if the classification prediction result is that the probability of the command word is the maximum, the command word with the maximum probability is taken as an output result.

Here, it should be noted that: the above description of the embodiments of the command word detecting device and the computer storage medium supporting different languages is similar to the description of the method embodiments shown in fig. 2 to 4, and has similar beneficial effects to the method embodiments shown in fig. 2 to 4, and therefore, the description thereof is omitted. For technical details that are not disclosed in the embodiment of the apparatus of the present invention, please refer to the description of the method embodiment shown in fig. 2 to 4 of the present invention, which is for brevity and will not be described again.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units; can be located in one place or distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all the functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: various media that can store program codes, such as a removable Memory device, a Read Only Memory (ROM), a magnetic disk, or an optical disk.

Alternatively, the integrated unit of the present invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, a ROM, a magnetic or optical disk, or other various media that can store program code.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A method for detecting command words supporting different languages is characterized by comprising the following steps:

collecting voice signals at least comprising two different languages;

Extracting audio features of the voice signal;

Carrying out classification prediction on the extracted audio features through a deep learning network to obtain a classification prediction result;

And if the classification prediction result is that the probability of the command word is maximum, taking the command word with the maximum probability as an output result.

2. The method of claim 1, wherein performing classification prediction on the extracted audio features through a deep learning network to obtain a classification prediction result comprises:

Classifying the extracted audio features through a deep learning network supporting multiple languages to obtain the probability of command words and non-command words;

and judging the category of the word with the highest probability in the obtained probabilities of the command word and the non-command word to obtain a classification prediction result.

3. the method of claim 1, further comprising:

and if the probability that the classification prediction result is the non-command word is the maximum, judging that the voice signal does not contain the command word.

4. The method of claim 1, wherein extracting audio features of the speech signal comprises:

extracting Fbank characteristics of the voice signal; or, extracting MFCC features of the speech signal.

5. The method of claim 1, wherein performing classification prediction on the extracted audio features through a deep learning network comprises:

And performing classified prediction on the extracted audio features by means of a Recurrent Neural Network (RNN), a Convolutional Neural Network (CNN) or a Time Delay Neural Network (TDNN).

6. The method according to claim 5, wherein when performing class prediction of the extracted audio features by CNN, the method further comprises:

Caching results of intermediate nodes in the previous N-round classification prediction process, wherein the value of N is a positive integer;

Correspondingly, the classification prediction of the extracted audio features by means of the CNN comprises the following steps:

And (4) taking the results of the intermediate nodes in the classification prediction process of the previous N rounds in the cache and the audio features of the current round as the input of the CNN to perform classification prediction.

7. The method according to any one of claims 1 to 6, further comprising:

and controlling the operation of executing the command word with the maximum probability.

8. a command word detecting apparatus supporting different languages, the apparatus comprising:

The acquisition module is used for acquiring voice signals at least comprising two different languages;

The characteristic extraction module is used for extracting the audio characteristic of the voice signal;

The classification prediction module is used for performing classification prediction on the extracted audio features through a deep learning network to obtain a classification prediction result;

and the output module is used for taking the command word with the maximum probability as an output result if the classification prediction result is that the probability of the command word is maximum.

9. The apparatus of claim 8, wherein the classification prediction module comprises:

The classification unit is used for classifying the extracted audio features through a deep learning network supporting multiple languages to obtain the probability of command words and non-command words;

and the judging unit is used for judging the category of the word with the highest probability in the obtained probabilities of the command word and the non-command word to obtain a classification prediction result.

10. The device of claim 8 or 9, wherein the device is a smart headset or microphone.