Disclosure of Invention
The inventor finds that: current sound source localization can be broadly divided into three categories according to localization principles: maximum output power based steerable beamforming techniques, time difference of arrival based techniques, and high resolution spectral estimation based localization. The performance of the three sound source positioning algorithms is sharply reduced in an environment with serious reverberation and noise interference, and the angle and the direction of a sound source cannot be accurately positioned, so that subsequent voice recognition is directly influenced, and a voice awakening result is influenced.
One technical problem to be solved by the present disclosure is: how to improve the accuracy of voice awakening and improve the user experience.
According to some embodiments of the present disclosure, there is provided a voice wake-up method, including: carrying out beam forming on the voice signals in a plurality of preset directions to obtain a plurality of beams; inputting the beam into a pre-trained keyword recognition model to obtain the probability of the beam containing the keyword; determining a wave beam pointing to the direction of a sound source as a sound source wave beam according to the probability that the wave beam contains the keywords and the signal quality of the wave beam; and determining whether to wake up the system according to the feature matching results of the sound source beams at a plurality of continuous moments.
In some embodiments, inputting the beams into a pre-trained keyword recognition model comprises: and selecting partial beams according to the signal quality of the beams and inputting the partial beams into a pre-trained keyword recognition model.
In some embodiments, selecting the partial beams based on the signal quality of the beams includes: determining the signal quality of the beam according to at least one of the energy and the signal-to-noise ratio of the beam in a fixed time window; and selecting partial beams with the signal quality higher than the signal quality threshold.
In some embodiments, determining a beam pointing in the direction of the sound source as the sound source beam based on the probability that the beam contains the keyword and the signal quality of the beam comprises: carrying out weighted summation on the probability that the beam contains the keywords and the signal quality of the beam to obtain the importance degree of the beam; and selecting the wave beam with the highest importance degree as the sound source wave beam, and determining the direction pointed by the sound source wave beam as the sound source direction.
In some embodiments, determining whether to wake up the system according to the result of feature matching of the sound source beam at a plurality of consecutive time instances comprises: matching the sound source directions pointed by the sound source beams at a plurality of continuous moments, and determining whether the sound source beams at the plurality of continuous moments contain keywords or not; and when the sound source directions pointed by the sound source beams at a plurality of continuous moments are consistent and the sound source beams at the plurality of continuous moments all contain keywords, waking up the system.
In some embodiments, beamforming the voice signal in a predetermined plurality of directions, the obtaining a plurality of beams comprises: determining the weight of each path of voice signals received by a microphone relative to a preset direction according to the direction of point source noise, the proportion of the point source noise and the white noise and a directional vector of the preset direction; and according to the weight of each path of voice signal received by the microphone relative to the preset direction, carrying out weighted summation on each path of voice signal received by the microphone, and determining the wave beam in the preset direction.
In some embodiments, the weight of each path of speech signal received by the microphone relative to the predetermined direction is calculated according to the following formula:
wherein, W
m(k) Is the weight vector of each path of voice signal received by the microphone in the mth wave beam processing process relative to the preset direction, k is the number of different frequency bands of the signals received by the microphone,
is the covariance matrix of the noise during the mth beam processing,
is composed of
The inverse of the matrix is then applied to the matrix,
for a microphone array pointing vector in a predetermined direction during the mth beam processing,
is composed of
By conjugate transposition of alpha
psnAs noiseProportion of point source interference noise at medium predetermined orientation, 1-alpha
psnIs the proportion of white noise in the noise,
for the pointing vector of the predetermined azimuth point source interference noise during the mth beam processing,
is composed of
And (4) conjugate transposition.
In some embodiments, the method further comprises: performing a beam forming process on the voice signal in a plurality of predetermined directions to obtain a plurality of beams; labeling keywords of the multiple beams to serve as training beams; and inputting the training wave beam into the keyword recognition model for training to obtain a pre-trained keyword recognition model.
In some embodiments, before the beamforming the voice signal in the predetermined plurality of directions further comprises: the voice signal received through the microphone is subjected to echo cancellation.
In some embodiments, the keyword recognition model comprises: a deep learning model or a hidden markov model.
According to other embodiments of the present disclosure, there is provided a voice wake-up apparatus including: the device comprises a beam forming module, a processing module and a processing module, wherein the beam forming module is used for carrying out beam forming on a voice signal in a plurality of preset directions to obtain a plurality of beams; the keyword identification module is used for inputting the beam into a keyword identification model trained in advance to obtain the probability of the beam containing the keyword; the sound source determining module is used for determining a beam pointing to the sound source direction as a sound source beam according to the probability that the beam contains the keyword and the signal quality of the beam; and the voice awakening module is used for determining whether to awaken the system or not according to the feature matching result of the sound source wave beams at a plurality of continuous moments.
In some embodiments, the apparatus further comprises: and the beam selection module is used for selecting partial beams according to the signal quality of the beams and sending the partial beams to the keyword recognition module so that the keyword recognition module can input the received beams into a keyword recognition model trained in advance.
In some embodiments, the beam selection module is configured to determine a signal quality of a beam based on at least one of an energy and a signal-to-noise ratio of the beam within a fixed time window; and selecting partial beams with the signal quality higher than the signal quality threshold.
In some embodiments, the sound source determining module is configured to perform weighted summation on the probability that the beam includes the keyword and the signal quality of the beam to obtain the importance degree of the beam, select the beam with the highest importance degree as the sound source beam, and determine the direction pointed by the sound source beam as the sound source direction.
In some embodiments, the voice wake-up module is configured to match sound source directions pointed by sound source beams at multiple consecutive time instances, and determine whether the sound source beams at the multiple consecutive time instances all contain a keyword, and wake up the system if the sound source directions pointed by the sound source beams at the multiple consecutive time instances are consistent, and the sound source beams at the multiple consecutive time instances all contain the keyword.
In some embodiments, the beam forming module is configured to determine a weight of each path of voice signals received by the microphone with respect to a predetermined direction according to a direction of the point source noise, a ratio of the point source noise to the white noise, and a directional vector of the predetermined direction, and perform weighted summation on each path of voice signals received by the microphone according to the weight of each path of voice signals received by the microphone with respect to the predetermined direction to determine a beam in the predetermined direction.
In some embodiments, the weight of each path of speech signal received by the microphone relative to the predetermined direction is calculated according to the following formula:
wherein, W
m(k) Is the weight vector of each path of voice signal received by the microphone in the mth wave beam processing process relative to the preset direction, k is the number of different frequency bands of the signals received by the microphone,
is the covariance matrix of the noise during the mth beam processing,
is composed of
The inverse of the matrix is then applied to the matrix,
for a microphone array pointing vector in a predetermined direction during the mth beam processing,
is composed of
By conjugate transposition of alpha
psnFor the proportion of the noise of the point source interfering at a predetermined azimuth, 1-alpha
psnIs the proportion of white noise in the noise,
for the pointing vector of the predetermined azimuth point source interference noise during the mth beam processing,
is composed of
And (4) conjugate transposition.
In some embodiments, the apparatus further comprises: and the model training module is used for carrying out a beam forming process on the voice signal in a plurality of preset directions to obtain a plurality of beams, carrying out keyword labeling on the beams to be used as training beams, and inputting the training beams into the keyword recognition model for training to obtain a pre-trained keyword recognition model.
In some embodiments, the apparatus further comprises: and the echo cancellation module is used for carrying out echo cancellation on the voice signal received by the microphone.
In some embodiments, the keyword recognition model comprises: a deep learning model or a hidden markov model.
According to still other embodiments of the present disclosure, there is provided a voice wake-up apparatus including: a memory; and a processor coupled to the memory, the processor configured to perform the voice wake-up method of any of the preceding embodiments based on instructions stored in the memory device.
According to still further embodiments of the present disclosure, there is provided a computer-readable storage medium having a computer program stored thereon, wherein the program, when executed by a processor, implements the voice wake-up method of any of the preceding embodiments.
According to the method, a voice signal is subjected to wave beam forming in multiple directions to obtain multiple wave beams, the multiple wave beams are input into a keyword recognition model, the probability that the multiple wave beams contain keywords is recognized, then a sound source wave beam is selected based on the probability that the wave beams contain the keywords and the signal quality of the wave beams, and whether the system is awakened or not is determined according to feature matching results of the sound source wave beams at multiple moments. According to the method, the existing sound source positioning method and voice awakening process are not adopted, and the beam forming algorithm and the sound source positioning algorithm are decoupled, so that the influence of the sound source positioning precision on the beam forming algorithm direction is avoided, the awakening accuracy of the voice system is improved, and the user experience is improved.
Other features of the present disclosure and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.
Detailed Description
The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments. The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.
The present disclosure provides a voice wake-up method, and some embodiments of the voice wake-up method of the present disclosure are described below in conjunction with fig. 1.
Fig. 1 is a flow chart of some embodiments of a voice wake-up method of the present disclosure. As shown in fig. 1, the method of this embodiment includes: steps S102 to S108.
In step S102, a voice signal is beamformed in a plurality of predetermined directions, resulting in a plurality of beams.
A plurality of microphones, namely microphone arrays, can be arranged on the voice recognition system for voice wake-up to receive voice signals of users. The speech signal may first be pre-processed, for example, the speech signal received by the microphone array may be Echo cancelled, for example, by an Acoustic Echo Cancellation (AEC) algorithm.
The preprocessed voice signals may be beamformed in a predetermined plurality of directions using a phased array beamforming algorithm. The meaning of the phased array is that M directions are preset (evenly distributed on a circle), and M times of weighting and processing are carried out on the multi-path voice signals received by the microphone array to form M paths of voice signals which are enhanced aiming at the respective specific directions. For example, M directions uniformly distributed on a predetermined circle may be used as the beam forming directions, i.e., the formed beam points to the predetermined M directions. The beamforming algorithm may employ, for example, an MVDR (Minimum Variance distortion free Response) algorithm, a GSC (generalized Sidelobe Canceller), a TF-GSC (Transfer Function generalized Sidelobe Canceller), and the like. Beamforming in a predetermined plurality of directions can be achieved by existing algorithms, which are not described herein.
The present disclosure also provides an improved beamforming algorithm, described below.
In some embodiments, the weights of the paths of voice signals received by the microphone relative to the predetermined direction are determined according to the direction of the point source noise, the proportion of the point source noise to the white noise and the directional vector of the predetermined direction; and according to the weight of each path of voice signal received by the microphone relative to the preset direction, carrying out weighted summation on each path of voice signal received by the microphone, and determining the wave beam in the preset direction. Beamforming may be performed according to the following equation.
Xn(k,l)=fft(xn(t)) (1)
In the formula (1), xn(t) is the speech signal received by the nth microphone, FFT (·) means Fast Fourier Transform (FFT) of the speech signal to obtain Xn(k, l) is xn(t) SFFT amplitude of kth frequency band in the l-th time period, wherein l represents that the speech signal is windowed and divided into l time periods to be processed respectively, and k represents the frequency band number of each speech signal after FFT conversion.
ym(t)=ifft(Ym(k,l)) (3)
In the formulae (2) and (3), y
m(t) is the output signal of the mth predetermined azimuth beam formed by the phased array beam, ifft (-) represents the inverse fast Fourier transform, Y
m(k, l) is y
m(t) SFFT magnitude of kth band of the l-th period.
In the mth beam processing process, the weight of the voice signal received by the nth microphone in the kth frequency band of the ith time period.
As can be seen from the above formula, determine
A signal representative of the beam m in the predetermined direction can be determined.
In the formula (4), W
m(k) Processing the m-th beamThe weight vector of each received voice signal relative to the predetermined direction,
for an n-dimensional vector, the signal weight vector for each time segment can be considered the same, known
Then can obtain
Which represents the weight of the voice signal received by the nth microphone in the kth frequency band during the mth beam processing.
Is the covariance matrix of the noise during the mth beam processing,
is composed of
And (4) inverting the matrix.
For microphone array pointing vectors that are expected to enhance the bearing (i.e. the predetermined bearing) during the mth beam processing,
is that n is a column vector and,
is composed of
And (4) conjugate transposition.
By setting a predetermined direction.
Further according to the formula (5) can obtain
α
psnFor the proportion of stationary azimuthal point source interference noise in the noise, 1-alpha
psnIs the proportion of white noise in the noise. Alpha is alpha
psnMay be obtained from testing or experience.
The pointing vector of the fixed azimuth point source interference noise in the mth beam processing process,
is composed of
And (4) conjugate transposition.
May be obtained from testing or experience.
The beam signals in each predetermined direction can be calculated by the above formulas, and a plurality of beam forming processes can be executed in parallel to obtain a plurality of beams.
In step S104, the beam is input to a keyword recognition model trained in advance, and the probability that the beam includes the keyword is obtained.
The voice system judges whether to record subsequent voice and perform voice recognition by recognizing the keywords in the voice, namely, whether to awaken the voice system subsequently is determined by retrieving the keywords in the voice. The keyword recognition model is, for example, a deep learning model, a hidden markov model, or the like. Examples of the Deep learning model include DNN (Deep Neural Networks), RNN (Recurrent Neural Networks), CRNN (Convolutional Recurrent Neural Networks), and the like. These models are all existing models and are not described in detail herein. When performing the keyword recognition model training, a plurality of beams may be generated according to the embodiment of step S102, and whether the plurality of beams contain the keyword is labeled as a training beam; and inputting the training wave beam into the keyword recognition model for off-line training to obtain a pre-trained keyword recognition model. And then inputting the beam into a pre-trained keyword recognition model, so as to obtain the probability of the beam containing the keyword.
In step S106, a beam pointing to the sound source direction is determined as a sound source beam according to the probability that the beam contains the keyword and the signal quality of the beam.
In some embodiments, the signal quality of the beam is determined based on at least one of an energy and a signal-to-noise ratio of the beam within a fixed time window. The higher the energy of the beam within a fixed time window, the higher the signal-to-noise ratio, and the better the signal quality. For example, the energy and signal-to-noise ratio of a beam within a fixed time window may be calculated, and the weighted sum of the two parameters may be used to determine the signal quality of the beam. The weights of energy and signal-to-noise ratio can be set according to actual requirements. The energy and signal-to-noise ratio may be normalized and weighted.
In some embodiments, the probability that the beam contains the keyword and the signal quality of the beam are subjected to weighted summation to obtain the importance degree of the beam; and selecting the wave beam with the highest importance degree as the sound source wave beam, and determining the direction pointed by the sound source wave beam as the sound source direction. The beam signal quality in the sound source direction is better, and the probability that the beam contains the keyword can be identified to be higher, so that the sound source beam can be selected according to the probability that the beam contains the keyword and the signal quality of the beam. For example, the energy power 'and the signal-to-noise ratio SNR' of the K wave beams in a fixed time window are calculated, and normalization processing is carried out at the same time to obtain

Obtaining the keyword recognition probability output by the keyword recognition model of the kth wave beam through the keyword recognition model as NNscore
kAnd further calculates the importance of the kth beam,
in step S108, it is determined whether to wake up the system according to the result of feature matching of the sound source beam at a plurality of consecutive time instances.
The system can be further determined whether to wake up or not directly according to whether the keyword probability of the sound source wave beam exceeds a threshold value or not. But the accuracy of the wake-up can be further improved by feature matching of the sound source beam for a plurality of consecutive time instants.
In some embodiments, the sound source directions pointed by the sound source beams at the current time and a preset number of continuous multiple times before are matched, and whether the sound source beams at the continuous multiple times all contain keywords is determined; and when the sound source directions pointed by the sound source beams at a plurality of continuous moments are consistent and the sound source beams at the plurality of continuous moments all contain keywords, waking up the system. Otherwise, the system is not woken up. Namely, whether the system is awakened or not is confirmed according to the consistency of the results of the keyword recognition and positioning judgment module at the time t and the previous time (t-p, t-p +1 … …, t-1 and t). If the keyword recognition and positioning results at the previous moment and the later moment are consistent, the system is awakened, otherwise, the system cannot be awakened.
In the method of the above embodiment, the voice signal is formed into a beam in multiple directions to obtain multiple beams, the multiple beams are input to the keyword recognition model to recognize the probability that the multiple beams contain the keyword, a sound source beam is selected based on the probability that the beam contains the keyword and the signal quality of the beam, and whether the system is awakened or not is determined according to the feature matching result of the sound source beam at multiple moments. According to the method, the existing sound source positioning method and voice awakening process are not adopted, and the beam forming algorithm and the sound source positioning algorithm are decoupled, so that the influence of the sound source positioning precision on the beam forming algorithm direction is avoided, the awakening accuracy of the voice system is improved, and the user experience is improved.
Further embodiments of the disclosed voice wake-up method are described below in conjunction with fig. 2.
Fig. 2 is a flowchart of another embodiment of a voice wake-up method according to the present disclosure. As shown in fig. 2, the method of this embodiment includes: steps S202 to S214.
In step S202, a speech signal of a user is received through a microphone array.
In step S204, echo cancellation is performed on the multi-path speech signals received by the microphone array.
In step S206, the received voice signal is beamformed in a plurality of predetermined directions, resulting in a plurality of beams.
In step S208, a partial beam is selected according to the signal quality of the beam.
In some embodiments, the signal quality of the beam is determined based on at least one of the energy and the signal-to-noise ratio of the beam within a fixed time window; and selecting partial beams with the signal quality higher than the signal quality threshold. E.g., weights of energy and signal-to-noise ratio of the beam within a fixed time window, determines the signal quality of the beam. The weights of energy and signal-to-noise ratio can be set according to actual requirements. For example, the energy power and the SNR of each beam in a fixed time window are calculated respectively, and normalization processing is performed simultaneously to obtain
Further calculating signal quality scores for the respective beams
Selecting a signal quality score
k(k 1,2 … … M) higher than the signal quality threshold, or selecting a beam with a signal quality ranked at a predetermined rank.
By selecting the beams with better quality through the method, the calculation amount of the subsequent process can be reduced, and the system efficiency and the awakening accuracy rate are improved.
In step S210, the selected partial beams are input into a pre-trained keyword recognition model, so as to obtain the probability of the beams including the keywords.
In step S212, a beam pointing to the sound source direction is determined as a sound source beam according to the probability that the beam contains the keyword and the signal quality of the beam.
In step S214, it is determined whether to wake up the system according to the result of feature matching of the sound source beam at a plurality of consecutive time instances.
The present disclosure also provides a voice wake-up apparatus, which is described below with reference to fig. 3.
Fig. 3 is a block diagram of some embodiments of the disclosed voice wake-up apparatus. As shown in fig. 3, theapparatus 30 of this embodiment includes: abeam forming module 302, akeyword recognition module 304, a soundsource determination module 306, and a voice wake-upmodule 308.
Thebeam forming module 302 is configured to perform beam forming on the voice signal in a plurality of predetermined directions, so as to obtain a plurality of beams.
In some embodiments, thebeam forming module 302 is configured to determine a weight of each voice signal received by the microphone with respect to a predetermined direction according to a direction of the point source noise, a ratio of the point source noise to the white noise, and a directional vector of the predetermined direction, and perform weighted summation on each voice signal received by the microphone according to the weight of each voice signal received by the microphone with respect to the predetermined direction to determine a beam in the predetermined direction.
In some embodiments, beamforming may be performed according to the following formula. The same as in the previous embodiment.
Xn(k,l)=fft(xn(t)) (1)
Wherein x isn(t) is the speech signal received by the nth microphone, FFT (·) means Fast Fourier Transform (FFT) of the speech signal to obtain Xn(k, l) is xn(t) SFFT amplitude of kth frequency band in the l-th time period, wherein l represents that the speech signal is windowed and divided into l time periods to be processed respectively, and k represents the frequency band number of each speech signal after FFT conversion.
ym(t)=ifft(ym(k,l)) (3)
Wherein, y
m(t) is the output signal of the mth predetermined azimuth beam formed by the phased array beam, ifft (-) represents the inverse fast Fourier transform, Y
m(k, l) is y
m(t) SFFT magnitude of kth band of the l-th period.
In the mth beam processing process, the weight of the voice signal received by the nth microphone in the kth frequency band of the ith time period.
As can be seen from the above formula, determine
A signal representative of the beam m in the predetermined direction can be determined.
Wherein, W
m(k) The weighting vector of each path of voice signal received by the microphone in the mth beam processing process relative to the preset direction,
for an n-dimensional vector, the signal weight vector for each time segment can be considered the same, known
Then can obtain
Which represents the weight of the voice signal received by the nth microphone in the kth frequency band during the mth beam processing.
Is the covariance matrix of the noise during the mth beam processing,
is composed of
And (4) inverting the matrix.
For microphone array pointing vectors that are expected to enhance the bearing (i.e. the predetermined bearing) during the mth beam processing,
is that n is a column vector and,
is composed of
And (4) conjugate transposition.
By setting a predetermined direction.
α
psnFor the proportion of stationary azimuthal point source interference noise in the noise, 1-alpha
psnIs the proportion of white noise in the noise. Alpha is alpha
psnMay be obtained from testing or experience.
The pointing vector of the fixed azimuth point source interference noise in the mth beam processing process,
is composed of
And (4) conjugate transposition.
May be obtained from testing or experience.
Thekeyword recognition module 304 is configured to input the beam into a pre-trained keyword recognition model to obtain a probability that the beam includes the keyword.
In some embodiments, the keyword recognition model comprises: a deep learning model or a hidden markov model.
The soundsource determining module 306 is configured to determine a beam pointing to a sound source direction as a sound source beam according to the probability that the beam contains the keyword and the signal quality of the beam.
In some embodiments, the soundsource determining module 306 is configured to perform weighted summation on the probability that the beam includes the keyword and the signal quality of the beam to obtain the importance degree of the beam, select the beam with the highest importance degree as the sound source beam, and determine the direction pointed by the sound source beam as the sound source direction.
The voice wake-upmodule 308 is configured to determine whether to wake up the system according to the feature matching result of the sound source beam at multiple consecutive time instances.
In some embodiments, the voice wake-upmodule 308 is configured to match sound source directions pointed by sound source beams at multiple consecutive time instances, and determine whether the sound source beams at the multiple consecutive time instances all contain a keyword, and wake up the system if the sound source directions pointed by the sound source beams at the multiple consecutive time instances are consistent, and the sound source beams at the multiple consecutive time instances all contain the keyword.
Further embodiments of the disclosed voice wake-up apparatus are described below in conjunction with fig. 4.
Fig. 4 is a block diagram of another embodiment of a voice wakeup device according to the present disclosure. As shown in fig. 4, theapparatus 40 of this embodiment includes: anecho cancellation module 402, abeam forming module 404, a beam selection module 406, a keyword recognition module 408, a sound source determination module 410, a voice wake-up module 412, and a model training module 414.
Theecho cancellation module 402 is used for performing echo cancellation on a voice signal received through a microphone.
Thebeam forming module 404 is configured to perform beam forming on the voice signal in a predetermined plurality of directions, so as to obtain a plurality of beams. Thebeamforming module 404 functions the same as thebeamforming module 302.
The beam selection module 406 is configured to select a part of the beams according to the signal quality of the beams, and send the part of the beams to the keyword recognition module, so that the keyword recognition module 408 inputs the received beams into a keyword recognition model trained in advance.
In some embodiments, the beam selection module 406 is configured to determine a signal quality of a beam based on at least one of an energy and a signal-to-noise ratio of the beam within a fixed time window; and selecting partial beams with the signal quality higher than the signal quality threshold.
The keyword recognition module 408 is configured to input the beam into a pre-trained keyword recognition model to obtain a probability that the beam contains the keyword. The keyword recognition module 408 functions the same as thekeyword recognition module 304.
The sound source determining module 410 is configured to determine a beam pointing to a sound source direction as a sound source beam according to the probability that the beam contains the keyword and the signal quality of the beam. The sound source determination module 410 is functionally identical to the soundsource determination module 306.
The voice wake-up module 412 is configured to determine whether to wake up the system according to the feature matching result of the sound source beam at multiple consecutive time instances. The voice wakeup module 412 is functionally identical to thevoice wakeup module 308.
The model training module 414 is configured to perform a beamforming process on the voice signal in a plurality of predetermined directions to obtain a plurality of beams, perform keyword labeling on the plurality of beams to obtain training beams, and input the training beams into the keyword recognition model for training to obtain a pre-trained keyword recognition model.
The model training module 414 may also be configured to receive the multiple beams obtained by thebeam forming module 404 or the multiple beams obtained by the beam selecting module 406, perform keyword labeling on the multiple beams to obtain training beams, and input the training beams into the keyword recognition model for training to obtain a pre-trained keyword recognition model.
The voice wake apparatus in the embodiments of the present disclosure may each be implemented by various computing devices or computer systems, which are described below in conjunction with fig. 5 and 6.
Fig. 5 is a block diagram of some embodiments of the disclosed voice wake-up apparatus. As shown in fig. 5, theapparatus 50 of this embodiment includes: amemory 510 and aprocessor 520 coupled to thememory 510, theprocessor 520 configured to perform a voice wake-up method in any of the embodiments of the present disclosure based on instructions stored in thememory 510.
Memory 110 may include, for example, system memory, fixed non-volatile storage media, and the like. The system memory stores, for example, an operating system, an application program, a Boot Loader (Boot Loader), a database, and other programs.
Fig. 6 is a block diagram of another embodiment of a voice wakeup device according to the present disclosure. As shown in fig. 6, theapparatus 60 of this embodiment includes:memory 610 andprocessor 620 are similar tomemory 510 andprocessor 520, respectively. Aninput output interface 630, anetwork interface 640, astorage interface 650, and the like may also be included. Theseinterfaces 630, 640, 650 and the connections between thememory 610 and theprocessor 620 may be, for example, via abus 660. The input/output interface 630 provides a connection interface for input/output devices such as a display, a mouse, a keyboard, and a touch screen. Thenetwork interface 640 provides a connection interface for various networking devices, such as a database server or a cloud storage server. Thestorage interface 650 provides a connection interface for external storage devices such as an SD card and a usb disk.
The present disclosure also provides a computer-readable storage medium having a computer program stored thereon, wherein the program, when executed by a processor, implements the voice wake-up method of any of the foregoing embodiments.
As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only exemplary of the present disclosure and is not intended to limit the present disclosure, so that any modification, equivalent replacement, or improvement made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.