CN109272989B

Movatterモバイル変換

Info

Publication number: CN109272989B
Application number: CN201810992991.4A
Authority: CN
Inventors: 徐晴晴; 陈宇; 杨楠; 耿岭
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2018-08-29
Filing date: 2018-08-29
Publication date: 2021-08-10
Anticipated expiration: 2038-08-29
Also published as: CN109272989A

Abstract

The disclosure relates to a voice wake-up method, a voice wake-up device and a computer readable storage medium, and relates to the technical field of computers. The method of the present disclosure comprises: carrying out beam forming on the voice signals in a plurality of preset directions to obtain a plurality of beams; inputting the beam into a pre-trained keyword recognition model to obtain the probability of the beam containing the keyword; determining a wave beam pointing to the direction of a sound source as a sound source wave beam according to the probability that the wave beam contains the keywords and the signal quality of the wave beam; and determining whether to wake up the system according to the feature matching results of the sound source beams at a plurality of continuous moments. According to the method, the existing sound source positioning method and voice awakening process are not adopted, and the beam forming algorithm and the sound source positioning algorithm are decoupled, so that the influence of the sound source positioning precision on the beam forming algorithm direction is avoided, the awakening accuracy of the voice system is improved, and the user experience is improved.

Description

Voice wake-up method, apparatus and computer readable storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a voice wake-up method and apparatus, and a computer-readable storage medium.

Background

With the development of computer technology, the need for human and machine information communication is more and more urgent. Voice, one of the most natural ways of human interaction, is also one of the most important ways people want to communicate with computers instead of mouse and keyboard. With the increasing urgent development requirements of intelligent terminals such as smart homes, intelligent vehicles and intelligent conference systems, the technology of the intelligent voice wake-up system as an entrance of the intelligent terminal is receiving more and more attention.

The speech communication process is interfered by the surrounding environment and the propagation medium (such as echo, reverberation and interference sound source, etc.), so that the comprehension of the speech by the computer is reduced sharply. Since noise interference always comes from all directions, it becomes very difficult to capture pure speech with a single microphone. At present, a voice awakening system is mainly based on a microphone array method, and carries out time-space domain processing on voice collected by a plurality of microphones, so that the purposes of noise suppression and voice enhancement are achieved.

The voice wake-up method known to the inventors generally comprises the following steps: the method comprises the steps of collecting voice signals through a microphone array, preprocessing the voice signals, determining the angle and the direction of a sound source through a sound source positioning and tracking technology, generating beams pointing to the angle and the direction of the sound source through a beam forming technology, transmitting the formed beams to a voice recognition system for identification, and determining whether to awaken the system or not.

Disclosure of Invention

The inventor finds that: current sound source localization can be broadly divided into three categories according to localization principles: maximum output power based steerable beamforming techniques, time difference of arrival based techniques, and high resolution spectral estimation based localization. The performance of the three sound source positioning algorithms is sharply reduced in an environment with serious reverberation and noise interference, and the angle and the direction of a sound source cannot be accurately positioned, so that subsequent voice recognition is directly influenced, and a voice awakening result is influenced.

One technical problem to be solved by the present disclosure is: how to improve the accuracy of voice awakening and improve the user experience.

According to some embodiments of the present disclosure, there is provided a voice wake-up method, including: carrying out beam forming on the voice signals in a plurality of preset directions to obtain a plurality of beams; inputting the beam into a pre-trained keyword recognition model to obtain the probability of the beam containing the keyword; determining a wave beam pointing to the direction of a sound source as a sound source wave beam according to the probability that the wave beam contains the keywords and the signal quality of the wave beam; and determining whether to wake up the system according to the feature matching results of the sound source beams at a plurality of continuous moments.

In some embodiments, inputting the beams into a pre-trained keyword recognition model comprises: and selecting partial beams according to the signal quality of the beams and inputting the partial beams into a pre-trained keyword recognition model.

In some embodiments, selecting the partial beams based on the signal quality of the beams includes: determining the signal quality of the beam according to at least one of the energy and the signal-to-noise ratio of the beam in a fixed time window; and selecting partial beams with the signal quality higher than the signal quality threshold.

In some embodiments, determining a beam pointing in the direction of the sound source as the sound source beam based on the probability that the beam contains the keyword and the signal quality of the beam comprises: carrying out weighted summation on the probability that the beam contains the keywords and the signal quality of the beam to obtain the importance degree of the beam; and selecting the wave beam with the highest importance degree as the sound source wave beam, and determining the direction pointed by the sound source wave beam as the sound source direction.

In some embodiments, determining whether to wake up the system according to the result of feature matching of the sound source beam at a plurality of consecutive time instances comprises: matching the sound source directions pointed by the sound source beams at a plurality of continuous moments, and determining whether the sound source beams at the plurality of continuous moments contain keywords or not; and when the sound source directions pointed by the sound source beams at a plurality of continuous moments are consistent and the sound source beams at the plurality of continuous moments all contain keywords, waking up the system.

In some embodiments, beamforming the voice signal in a predetermined plurality of directions, the obtaining a plurality of beams comprises: determining the weight of each path of voice signals received by a microphone relative to a preset direction according to the direction of point source noise, the proportion of the point source noise and the white noise and a directional vector of the preset direction; and according to the weight of each path of voice signal received by the microphone relative to the preset direction, carrying out weighted summation on each path of voice signal received by the microphone, and determining the wave beam in the preset direction.

In some embodiments, the weight of each path of speech signal received by the microphone relative to the predetermined direction is calculated according to the following formula:

wherein, W_m(k) Is the weight vector of each path of voice signal received by the microphone in the mth wave beam processing process relative to the preset direction, k is the number of different frequency bands of the signals received by the microphone,

is the covariance matrix of the noise during the mth beam processing,

is composed of

The inverse of the matrix is then applied to the matrix,

for a microphone array pointing vector in a predetermined direction during the mth beam processing,

is composed of

By conjugate transposition of alpha_psnAs noiseProportion of point source interference noise at medium predetermined orientation, 1-alpha_psnIs the proportion of white noise in the noise,

for the pointing vector of the predetermined azimuth point source interference noise during the mth beam processing,

is composed of

And (4) conjugate transposition.

In some embodiments, the method further comprises: performing a beam forming process on the voice signal in a plurality of predetermined directions to obtain a plurality of beams; labeling keywords of the multiple beams to serve as training beams; and inputting the training wave beam into the keyword recognition model for training to obtain a pre-trained keyword recognition model.

In some embodiments, before the beamforming the voice signal in the predetermined plurality of directions further comprises: the voice signal received through the microphone is subjected to echo cancellation.

In some embodiments, the keyword recognition model comprises: a deep learning model or a hidden markov model.

According to other embodiments of the present disclosure, there is provided a voice wake-up apparatus including: the device comprises a beam forming module, a processing module and a processing module, wherein the beam forming module is used for carrying out beam forming on a voice signal in a plurality of preset directions to obtain a plurality of beams; the keyword identification module is used for inputting the beam into a keyword identification model trained in advance to obtain the probability of the beam containing the keyword; the sound source determining module is used for determining a beam pointing to the sound source direction as a sound source beam according to the probability that the beam contains the keyword and the signal quality of the beam; and the voice awakening module is used for determining whether to awaken the system or not according to the feature matching result of the sound source wave beams at a plurality of continuous moments.

In some embodiments, the apparatus further comprises: and the beam selection module is used for selecting partial beams according to the signal quality of the beams and sending the partial beams to the keyword recognition module so that the keyword recognition module can input the received beams into a keyword recognition model trained in advance.

In some embodiments, the beam selection module is configured to determine a signal quality of a beam based on at least one of an energy and a signal-to-noise ratio of the beam within a fixed time window; and selecting partial beams with the signal quality higher than the signal quality threshold.

In some embodiments, the sound source determining module is configured to perform weighted summation on the probability that the beam includes the keyword and the signal quality of the beam to obtain the importance degree of the beam, select the beam with the highest importance degree as the sound source beam, and determine the direction pointed by the sound source beam as the sound source direction.

In some embodiments, the voice wake-up module is configured to match sound source directions pointed by sound source beams at multiple consecutive time instances, and determine whether the sound source beams at the multiple consecutive time instances all contain a keyword, and wake up the system if the sound source directions pointed by the sound source beams at the multiple consecutive time instances are consistent, and the sound source beams at the multiple consecutive time instances all contain the keyword.

In some embodiments, the beam forming module is configured to determine a weight of each path of voice signals received by the microphone with respect to a predetermined direction according to a direction of the point source noise, a ratio of the point source noise to the white noise, and a directional vector of the predetermined direction, and perform weighted summation on each path of voice signals received by the microphone according to the weight of each path of voice signals received by the microphone with respect to the predetermined direction to determine a beam in the predetermined direction.

is the covariance matrix of the noise during the mth beam processing,

is composed of

The inverse of the matrix is then applied to the matrix,

is composed of

By conjugate transposition of alpha_psnFor the proportion of the noise of the point source interfering at a predetermined azimuth, 1-alpha_psnIs the proportion of white noise in the noise,

is composed of

And (4) conjugate transposition.

In some embodiments, the apparatus further comprises: and the model training module is used for carrying out a beam forming process on the voice signal in a plurality of preset directions to obtain a plurality of beams, carrying out keyword labeling on the beams to be used as training beams, and inputting the training beams into the keyword recognition model for training to obtain a pre-trained keyword recognition model.

In some embodiments, the apparatus further comprises: and the echo cancellation module is used for carrying out echo cancellation on the voice signal received by the microphone.

According to still other embodiments of the present disclosure, there is provided a voice wake-up apparatus including: a memory; and a processor coupled to the memory, the processor configured to perform the voice wake-up method of any of the preceding embodiments based on instructions stored in the memory device.

According to still further embodiments of the present disclosure, there is provided a computer-readable storage medium having a computer program stored thereon, wherein the program, when executed by a processor, implements the voice wake-up method of any of the preceding embodiments.

According to the method, a voice signal is subjected to wave beam forming in multiple directions to obtain multiple wave beams, the multiple wave beams are input into a keyword recognition model, the probability that the multiple wave beams contain keywords is recognized, then a sound source wave beam is selected based on the probability that the wave beams contain the keywords and the signal quality of the wave beams, and whether the system is awakened or not is determined according to feature matching results of the sound source wave beams at multiple moments. According to the method, the existing sound source positioning method and voice awakening process are not adopted, and the beam forming algorithm and the sound source positioning algorithm are decoupled, so that the influence of the sound source positioning precision on the beam forming algorithm direction is avoided, the awakening accuracy of the voice system is improved, and the user experience is improved.

Other features of the present disclosure and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 illustrates a flow diagram of a voice wake-up method of some embodiments of the present disclosure.

Fig. 2 shows a flow diagram of a voice wake-up method of further embodiments of the present disclosure.

Fig. 3 shows a schematic structural diagram of a voice wake-up apparatus according to some embodiments of the present disclosure.

Fig. 4 shows a schematic structural diagram of a voice wake-up apparatus according to another embodiment of the present disclosure.

Fig. 5 shows a schematic structural diagram of a voice wake-up apparatus according to still other embodiments of the present disclosure.

Fig. 6 shows a schematic structural diagram of a voice wake-up apparatus according to still other embodiments of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments. The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

The present disclosure provides a voice wake-up method, and some embodiments of the voice wake-up method of the present disclosure are described below in conjunction with fig. 1.

Fig. 1 is a flow chart of some embodiments of a voice wake-up method of the present disclosure. As shown in fig. 1, the method of this embodiment includes: steps S102 to S108.

In step S102, a voice signal is beamformed in a plurality of predetermined directions, resulting in a plurality of beams.

A plurality of microphones, namely microphone arrays, can be arranged on the voice recognition system for voice wake-up to receive voice signals of users. The speech signal may first be pre-processed, for example, the speech signal received by the microphone array may be Echo cancelled, for example, by an Acoustic Echo Cancellation (AEC) algorithm.

The preprocessed voice signals may be beamformed in a predetermined plurality of directions using a phased array beamforming algorithm. The meaning of the phased array is that M directions are preset (evenly distributed on a circle), and M times of weighting and processing are carried out on the multi-path voice signals received by the microphone array to form M paths of voice signals which are enhanced aiming at the respective specific directions. For example, M directions uniformly distributed on a predetermined circle may be used as the beam forming directions, i.e., the formed beam points to the predetermined M directions. The beamforming algorithm may employ, for example, an MVDR (Minimum Variance distortion free Response) algorithm, a GSC (generalized Sidelobe Canceller), a TF-GSC (Transfer Function generalized Sidelobe Canceller), and the like. Beamforming in a predetermined plurality of directions can be achieved by existing algorithms, which are not described herein.

The present disclosure also provides an improved beamforming algorithm, described below.

In some embodiments, the weights of the paths of voice signals received by the microphone relative to the predetermined direction are determined according to the direction of the point source noise, the proportion of the point source noise to the white noise and the directional vector of the predetermined direction; and according to the weight of each path of voice signal received by the microphone relative to the preset direction, carrying out weighted summation on each path of voice signal received by the microphone, and determining the wave beam in the preset direction. Beamforming may be performed according to the following equation.

Xⁿ(k,l)＝fft(xⁿ(t)) (1)

In the formula (1), xⁿ(t) is the speech signal received by the nth microphone, FFT (·) means Fast Fourier Transform (FFT) of the speech signal to obtain Xⁿ(k, l) is xⁿ(t) SFFT amplitude of kth frequency band in the l-th time period, wherein l represents that the speech signal is windowed and divided into l time periods to be processed respectively, and k represents the frequency band number of each speech signal after FFT conversion.

y_m(t)＝ifft(Y_m(k,l)) (3)

In the formulae (2) and (3), y_m(t) is the output signal of the mth predetermined azimuth beam formed by the phased array beam, ifft (-) represents the inverse fast Fourier transform, Y_m(k, l) is y_m(t) SFFT magnitude of kth band of the l-th period.

In the mth beam processing process, the weight of the voice signal received by the nth microphone in the kth frequency band of the ith time period.

As can be seen from the above formula, determine

A signal representative of the beam m in the predetermined direction can be determined.

In the formula (4), W_m(k) Processing the m-th beamThe weight vector of each received voice signal relative to the predetermined direction,

for an n-dimensional vector, the signal weight vector for each time segment can be considered the same, known

Then can obtain

Which represents the weight of the voice signal received by the nth microphone in the kth frequency band during the mth beam processing.

Is the covariance matrix of the noise during the mth beam processing,

is composed of

And (4) inverting the matrix.

For microphone array pointing vectors that are expected to enhance the bearing (i.e. the predetermined bearing) during the mth beam processing,

is that n is a column vector and,

is composed of

And (4) conjugate transposition.

By setting a predetermined direction.

Further according to the formula (5) can obtain

α_psnFor the proportion of stationary azimuthal point source interference noise in the noise, 1-alpha_psnIs the proportion of white noise in the noise. Alpha is alpha_psnMay be obtained from testing or experience.

The pointing vector of the fixed azimuth point source interference noise in the mth beam processing process,

is composed of

And (4) conjugate transposition.

May be obtained from testing or experience.

The beam signals in each predetermined direction can be calculated by the above formulas, and a plurality of beam forming processes can be executed in parallel to obtain a plurality of beams.

In step S104, the beam is input to a keyword recognition model trained in advance, and the probability that the beam includes the keyword is obtained.

The voice system judges whether to record subsequent voice and perform voice recognition by recognizing the keywords in the voice, namely, whether to awaken the voice system subsequently is determined by retrieving the keywords in the voice. The keyword recognition model is, for example, a deep learning model, a hidden markov model, or the like. Examples of the Deep learning model include DNN (Deep Neural Networks), RNN (Recurrent Neural Networks), CRNN (Convolutional Recurrent Neural Networks), and the like. These models are all existing models and are not described in detail herein. When performing the keyword recognition model training, a plurality of beams may be generated according to the embodiment of step S102, and whether the plurality of beams contain the keyword is labeled as a training beam; and inputting the training wave beam into the keyword recognition model for off-line training to obtain a pre-trained keyword recognition model. And then inputting the beam into a pre-trained keyword recognition model, so as to obtain the probability of the beam containing the keyword.

In step S106, a beam pointing to the sound source direction is determined as a sound source beam according to the probability that the beam contains the keyword and the signal quality of the beam.

In some embodiments, the signal quality of the beam is determined based on at least one of an energy and a signal-to-noise ratio of the beam within a fixed time window. The higher the energy of the beam within a fixed time window, the higher the signal-to-noise ratio, and the better the signal quality. For example, the energy and signal-to-noise ratio of a beam within a fixed time window may be calculated, and the weighted sum of the two parameters may be used to determine the signal quality of the beam. The weights of energy and signal-to-noise ratio can be set according to actual requirements. The energy and signal-to-noise ratio may be normalized and weighted.

In some embodiments, the probability that the beam contains the keyword and the signal quality of the beam are subjected to weighted summation to obtain the importance degree of the beam; and selecting the wave beam with the highest importance degree as the sound source wave beam, and determining the direction pointed by the sound source wave beam as the sound source direction. The beam signal quality in the sound source direction is better, and the probability that the beam contains the keyword can be identified to be higher, so that the sound source beam can be selected according to the probability that the beam contains the keyword and the signal quality of the beam. For example, the energy power 'and the signal-to-noise ratio SNR' of the K wave beams in a fixed time window are calculated, and normalization processing is carried out at the same time to obtain

Obtaining the keyword recognition probability output by the keyword recognition model of the kth wave beam through the keyword recognition model as NNscore_kAnd further calculates the importance of the kth beam,

in step S108, it is determined whether to wake up the system according to the result of feature matching of the sound source beam at a plurality of consecutive time instances.

The system can be further determined whether to wake up or not directly according to whether the keyword probability of the sound source wave beam exceeds a threshold value or not. But the accuracy of the wake-up can be further improved by feature matching of the sound source beam for a plurality of consecutive time instants.

In some embodiments, the sound source directions pointed by the sound source beams at the current time and a preset number of continuous multiple times before are matched, and whether the sound source beams at the continuous multiple times all contain keywords is determined; and when the sound source directions pointed by the sound source beams at a plurality of continuous moments are consistent and the sound source beams at the plurality of continuous moments all contain keywords, waking up the system. Otherwise, the system is not woken up. Namely, whether the system is awakened or not is confirmed according to the consistency of the results of the keyword recognition and positioning judgment module at the time t and the previous time (t-p, t-p +1 … …, t-1 and t). If the keyword recognition and positioning results at the previous moment and the later moment are consistent, the system is awakened, otherwise, the system cannot be awakened.

In the method of the above embodiment, the voice signal is formed into a beam in multiple directions to obtain multiple beams, the multiple beams are input to the keyword recognition model to recognize the probability that the multiple beams contain the keyword, a sound source beam is selected based on the probability that the beam contains the keyword and the signal quality of the beam, and whether the system is awakened or not is determined according to the feature matching result of the sound source beam at multiple moments. According to the method, the existing sound source positioning method and voice awakening process are not adopted, and the beam forming algorithm and the sound source positioning algorithm are decoupled, so that the influence of the sound source positioning precision on the beam forming algorithm direction is avoided, the awakening accuracy of the voice system is improved, and the user experience is improved.

Further embodiments of the disclosed voice wake-up method are described below in conjunction with fig. 2.

Fig. 2 is a flowchart of another embodiment of a voice wake-up method according to the present disclosure. As shown in fig. 2, the method of this embodiment includes: steps S202 to S214.

In step S202, a speech signal of a user is received through a microphone array.

In step S204, echo cancellation is performed on the multi-path speech signals received by the microphone array.

In step S206, the received voice signal is beamformed in a plurality of predetermined directions, resulting in a plurality of beams.

In step S208, a partial beam is selected according to the signal quality of the beam.

In some embodiments, the signal quality of the beam is determined based on at least one of the energy and the signal-to-noise ratio of the beam within a fixed time window; and selecting partial beams with the signal quality higher than the signal quality threshold. E.g., weights of energy and signal-to-noise ratio of the beam within a fixed time window, determines the signal quality of the beam. The weights of energy and signal-to-noise ratio can be set according to actual requirements. For example, the energy power and the SNR of each beam in a fixed time window are calculated respectively, and normalization processing is performed simultaneously to obtain

Further calculating signal quality scores for the respective beams

Selecting a signal quality score_k(k 1,2 … … M) higher than the signal quality threshold, or selecting a beam with a signal quality ranked at a predetermined rank.

By selecting the beams with better quality through the method, the calculation amount of the subsequent process can be reduced, and the system efficiency and the awakening accuracy rate are improved.

In step S210, the selected partial beams are input into a pre-trained keyword recognition model, so as to obtain the probability of the beams including the keywords.

In step S212, a beam pointing to the sound source direction is determined as a sound source beam according to the probability that the beam contains the keyword and the signal quality of the beam.

In step S214, it is determined whether to wake up the system according to the result of feature matching of the sound source beam at a plurality of consecutive time instances.

The present disclosure also provides a voice wake-up apparatus, which is described below with reference to fig. 3.

Fig. 3 is a block diagram of some embodiments of the disclosed voice wake-up apparatus. As shown in fig. 3, theapparatus 30 of this embodiment includes: abeam forming module 302, akeyword recognition module 304, a soundsource determination module 306, and a voice wake-upmodule 308.

Thebeam forming module 302 is configured to perform beam forming on the voice signal in a plurality of predetermined directions, so as to obtain a plurality of beams.

In some embodiments, thebeam forming module 302 is configured to determine a weight of each voice signal received by the microphone with respect to a predetermined direction according to a direction of the point source noise, a ratio of the point source noise to the white noise, and a directional vector of the predetermined direction, and perform weighted summation on each voice signal received by the microphone according to the weight of each voice signal received by the microphone with respect to the predetermined direction to determine a beam in the predetermined direction.

In some embodiments, beamforming may be performed according to the following formula. The same as in the previous embodiment.

Xⁿ(k,l)＝fft(xⁿ(t)) (1)

Wherein x isⁿ(t) is the speech signal received by the nth microphone, FFT (·) means Fast Fourier Transform (FFT) of the speech signal to obtain Xⁿ(k, l) is xⁿ(t) SFFT amplitude of kth frequency band in the l-th time period, wherein l represents that the speech signal is windowed and divided into l time periods to be processed respectively, and k represents the frequency band number of each speech signal after FFT conversion.

y_m(t)＝ifft(y_m(k，l)) (3)

Wherein, y_m(t) is the output signal of the mth predetermined azimuth beam formed by the phased array beam, ifft (-) represents the inverse fast Fourier transform, Y_m(k, l) is y_m(t) SFFT magnitude of kth band of the l-th period.

As can be seen from the above formula, determine

Wherein, W_m(k) The weighting vector of each path of voice signal received by the microphone in the mth beam processing process relative to the preset direction,

Then can obtain

Is the covariance matrix of the noise during the mth beam processing,

is composed of

And (4) inverting the matrix.

is that n is a column vector and,

is composed of

And (4) conjugate transposition.

By setting a predetermined direction.

is composed of

And (4) conjugate transposition.

May be obtained from testing or experience.

Thekeyword recognition module 304 is configured to input the beam into a pre-trained keyword recognition model to obtain a probability that the beam includes the keyword.

The soundsource determining module 306 is configured to determine a beam pointing to a sound source direction as a sound source beam according to the probability that the beam contains the keyword and the signal quality of the beam.

In some embodiments, the soundsource determining module 306 is configured to perform weighted summation on the probability that the beam includes the keyword and the signal quality of the beam to obtain the importance degree of the beam, select the beam with the highest importance degree as the sound source beam, and determine the direction pointed by the sound source beam as the sound source direction.

The voice wake-upmodule 308 is configured to determine whether to wake up the system according to the feature matching result of the sound source beam at multiple consecutive time instances.

In some embodiments, the voice wake-upmodule 308 is configured to match sound source directions pointed by sound source beams at multiple consecutive time instances, and determine whether the sound source beams at the multiple consecutive time instances all contain a keyword, and wake up the system if the sound source directions pointed by the sound source beams at the multiple consecutive time instances are consistent, and the sound source beams at the multiple consecutive time instances all contain the keyword.

Further embodiments of the disclosed voice wake-up apparatus are described below in conjunction with fig. 4.

Fig. 4 is a block diagram of another embodiment of a voice wakeup device according to the present disclosure. As shown in fig. 4, theapparatus 40 of this embodiment includes: anecho cancellation module 402, abeam forming module 404, a beam selection module 406, a keyword recognition module 408, a sound source determination module 410, a voice wake-up module 412, and a model training module 414.

Theecho cancellation module 402 is used for performing echo cancellation on a voice signal received through a microphone.

Thebeam forming module 404 is configured to perform beam forming on the voice signal in a predetermined plurality of directions, so as to obtain a plurality of beams. Thebeamforming module 404 functions the same as thebeamforming module 302.

The beam selection module 406 is configured to select a part of the beams according to the signal quality of the beams, and send the part of the beams to the keyword recognition module, so that the keyword recognition module 408 inputs the received beams into a keyword recognition model trained in advance.

In some embodiments, the beam selection module 406 is configured to determine a signal quality of a beam based on at least one of an energy and a signal-to-noise ratio of the beam within a fixed time window; and selecting partial beams with the signal quality higher than the signal quality threshold.

The keyword recognition module 408 is configured to input the beam into a pre-trained keyword recognition model to obtain a probability that the beam contains the keyword. The keyword recognition module 408 functions the same as thekeyword recognition module 304.

The sound source determining module 410 is configured to determine a beam pointing to a sound source direction as a sound source beam according to the probability that the beam contains the keyword and the signal quality of the beam. The sound source determination module 410 is functionally identical to the soundsource determination module 306.

The voice wake-up module 412 is configured to determine whether to wake up the system according to the feature matching result of the sound source beam at multiple consecutive time instances. The voice wakeup module 412 is functionally identical to thevoice wakeup module 308.

The model training module 414 is configured to perform a beamforming process on the voice signal in a plurality of predetermined directions to obtain a plurality of beams, perform keyword labeling on the plurality of beams to obtain training beams, and input the training beams into the keyword recognition model for training to obtain a pre-trained keyword recognition model.

The model training module 414 may also be configured to receive the multiple beams obtained by thebeam forming module 404 or the multiple beams obtained by the beam selecting module 406, perform keyword labeling on the multiple beams to obtain training beams, and input the training beams into the keyword recognition model for training to obtain a pre-trained keyword recognition model.

The voice wake apparatus in the embodiments of the present disclosure may each be implemented by various computing devices or computer systems, which are described below in conjunction with fig. 5 and 6.

Fig. 5 is a block diagram of some embodiments of the disclosed voice wake-up apparatus. As shown in fig. 5, theapparatus 50 of this embodiment includes: amemory 510 and aprocessor 520 coupled to thememory 510, theprocessor 520 configured to perform a voice wake-up method in any of the embodiments of the present disclosure based on instructions stored in thememory 510.

Memory 110 may include, for example, system memory, fixed non-volatile storage media, and the like. The system memory stores, for example, an operating system, an application program, a Boot Loader (Boot Loader), a database, and other programs.

Fig. 6 is a block diagram of another embodiment of a voice wakeup device according to the present disclosure. As shown in fig. 6, theapparatus 60 of this embodiment includes:memory 610 andprocessor 620 are similar tomemory 510 andprocessor 520, respectively. Aninput output interface 630, anetwork interface 640, astorage interface 650, and the like may also be included. These

interfaces

630, 640, 650 and the connections between thememory 610 and theprocessor 620 may be, for example, via abus 660. The input/output interface 630 provides a connection interface for input/output devices such as a display, a mouse, a keyboard, and a touch screen. Thenetwork interface 640 provides a connection interface for various networking devices, such as a database server or a cloud storage server. Thestorage interface 650 provides a connection interface for external storage devices such as an SD card and a usb disk.

The present disclosure also provides a computer-readable storage medium having a computer program stored thereon, wherein the program, when executed by a processor, implements the voice wake-up method of any of the foregoing embodiments.

As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only exemplary of the present disclosure and is not intended to limit the present disclosure, so that any modification, equivalent replacement, or improvement made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

Claims

1. A voice wake-up method, comprising:

carrying out beam forming on the voice signals in a plurality of preset directions to obtain a plurality of beams;

inputting the wave beam into a pre-trained keyword recognition model to obtain the probability of the wave beam containing the keyword;

determining a beam pointing to the direction of a sound source as a sound source beam according to the probability that the beam contains the keywords and the signal quality of the beam;

determining whether to wake up the system according to the feature matching results of the sound source beams at a plurality of continuous moments;

wherein, the determining whether to wake up the system according to the feature matching result of the sound source beams at a plurality of continuous moments comprises:

matching sound source directions pointed by sound source beams at a plurality of continuous moments, and determining whether the sound source beams at the plurality of continuous moments contain keywords or not;

and when the sound source directions pointed by the sound source beams at a plurality of continuous moments are consistent and the sound source beams at the plurality of continuous moments all contain keywords, waking up the system.

2. The voice wake-up method of claim 1, wherein,

the inputting the beam into a pre-trained keyword recognition model comprises:

and selecting partial beams to input a pre-trained keyword recognition model according to the signal quality of the beams.

3. The voice wake-up method of claim 2, wherein,

said selecting a portion of beams based on the signal quality of the beams comprises:

determining a signal quality of the beam based on at least one of an energy and a signal-to-noise ratio of the beam within a fixed time window;

and selecting partial beams with the signal quality higher than the signal quality threshold.

4. The voice wake-up method of claim 1, wherein,

determining a beam pointing to a sound source direction according to the probability that the beam contains the keyword and the signal quality of the beam, wherein the determining the beam as the sound source beam comprises:

carrying out weighted summation on the probability that the beam contains the keywords and the signal quality of the beam to obtain the importance degree of the beam;

and selecting the wave beam with the highest importance degree as a sound source wave beam, and determining the direction pointed by the sound source wave beam as the sound source direction.

5. The voice wake-up method of claim 1, wherein,

the beamforming the voice signal in a plurality of predetermined directions to obtain a plurality of beams includes:

determining the weight of each path of voice signals received by a microphone relative to a preset direction according to the direction of point source noise, the proportion of the point source noise to the white noise and a directional vector of the preset direction;

and according to the weight of each path of voice signal received by the microphone relative to the preset direction, carrying out weighted summation on each path of voice signal received by the microphone, and determining the wave beam in the preset direction.

6. The voice wake-up method of claim 5, wherein,

the weight of each path of voice signals received by the microphone relative to the preset direction is calculated according to the following formula:

is the covariance matrix of the noise during the mth beam processing,

is composed of

The inverse of the matrix is then applied to the matrix,

is composed of

is composed of

And (4) conjugate transposition.

7. The voice wake-up method of claim 1 further comprising:

performing a beam forming process on the voice signal in a plurality of predetermined directions to obtain a plurality of beams;

labeling keywords of the multiple beams to serve as training beams;

and inputting the training wave beam into a keyword recognition model for training to obtain a pre-trained keyword recognition model.

8. The voice wake-up method of claim 1, wherein,

before the beamforming the voice signal in a predetermined plurality of directions, the method further comprises:

the voice signal received through the microphone is subjected to echo cancellation.

9. Voice wake-up method according to any of the claims 1 to 8,

the keyword recognition model includes: a deep learning model or a hidden markov model.

10. A voice wake-up apparatus comprising:

the device comprises a beam forming module, a processing module and a processing module, wherein the beam forming module is used for carrying out beam forming on a voice signal in a plurality of preset directions to obtain a plurality of beams;

the keyword identification module is used for inputting the beam into a keyword identification model trained in advance to obtain the probability of the beam containing the keyword;

the sound source determining module is used for determining a wave beam pointing to the sound source direction as a sound source wave beam according to the probability that the wave beam contains the keywords and the signal quality of the wave beam;

the voice awakening module is used for determining whether to awaken the system or not according to the feature matching results of the sound source wave beams at a plurality of continuous moments;

the voice awakening module is used for matching the sound source directions pointed by the sound source beams at a plurality of continuous moments and determining whether the sound source beams at the plurality of continuous moments contain keywords, and awakening the system under the condition that the sound source directions pointed by the sound source beams at the plurality of continuous moments are consistent and the sound source beams at the plurality of continuous moments contain the keywords.

11. The voice wake-up apparatus according to claim 10, further comprising:

and the beam selection module is used for selecting partial beams according to the signal quality of the beams and sending the partial beams to the keyword recognition module so that the keyword recognition module can input the received beams into a keyword recognition model trained in advance.

12. The voice wake-up device of claim 11, wherein,

the beam selection module is used for determining the signal quality of the beam according to at least one of the energy and the signal-to-noise ratio of the beam in a fixed time window; and selecting partial beams with the signal quality higher than the signal quality threshold.

13. The voice wake-up device of claim 10, wherein,

the sound source determining module is used for weighting and summing the probability that the wave beam contains the keywords and the signal quality of the wave beam to obtain the importance degree of the wave beam, selecting the wave beam with the highest importance degree as a sound source wave beam, and determining the direction pointed by the sound source wave beam as the sound source direction.

14. The voice wake-up device of claim 10, wherein,

the beam forming module is used for determining the weight of each path of voice signals received by the microphone relative to the preset direction according to the direction of point source noise, the proportion of the point source noise and white noise and the directional vector of the preset direction, and carrying out weighted summation on each path of voice signals received by the microphone according to the weight of each path of voice signals received by the microphone relative to the preset direction to determine the beam of the preset direction.

15. The voice wake-up device of claim 14, wherein,

is the covariance matrix of the noise during the mth beam processing,

is composed of

The inverse of the matrix is then applied to the matrix,

is composed of

is composed of

And (4) conjugate transposition.

16. The voice wake-up apparatus according to claim 10, further comprising:

and the model training module is used for carrying out a beam forming process on the voice signal in a plurality of preset directions to obtain a plurality of beams, carrying out keyword labeling on the beams to be used as training beams, and inputting the training beams into the keyword recognition model for training to obtain a pre-trained keyword recognition model.

17. The voice wake-up apparatus according to claim 10, further comprising:

and the echo cancellation module is used for carrying out echo cancellation on the voice signal received by the microphone.

18. Voice wake-up device according to any of the claims 10 to 17,

19. A voice wake-up apparatus comprising:

a memory; and

a processor coupled to the memory, the processor configured to perform the voice wake-up method of any of claims 1-9 based on instructions stored in the memory device.

20. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 9.