Disclosure of Invention
The invention provides a stable on-line multi-channel voice dereverberation method and a system, which are used for solving the technical problem.
A stable online multi-channel speech dereverberation method, comprising:
step 1: performing first preprocessing on an input voice signal, and converting the voice signal subjected to the first preprocessing into a frequency domain from a time domain to obtain a frequency domain signal; meanwhile, calculating a covariance matrix of the input voice signal;
the first preprocessing comprises framing;
step 2: calculating a regularization vector corresponding to each frame signal in the frequency domain signal; and step 3: estimating a filter coefficient corresponding to the frequency domain signal by adopting a recursive least square method based on a mode that each frequency band is independent;
and 4, step 4: calculating an auxiliary covariance matrix of covariance among channels, and correcting the auxiliary covariance matrix based on the regularization vector calculated in the step 2;
and 5: updating the filter coefficient based on the covariance matrix and the corrected auxiliary covariance matrix to obtain a new filter coefficient, and outputting the new filter coefficient to a filtering module;
step 6: and the filtering module carries out filtering processing on the frequency domain signal according to the new filter coefficient to obtain a frequency domain signal after reverberation is removed, converts the signal after reverberation is removed from a frequency domain to a time domain and transmits the signal to a voice recognition system.
Preferably, step 1 further comprises: acquiring a voice signal by adopting a microphone array, and converting the voice signal into a digital signal;
the step 1 converts the voice signal after the first pretreatment from a time domain to a frequency domain through short-time Fourier transform;
the step 2 is to calculate a regularization vector corresponding to each frame of signal according to the number of the microphones and the length of the filter;
and 6, converting the dereverberated signal from a frequency domain to a time domain through short-time inverse Fourier transform.
Preferably, the microphone array is a linear array or a circular array or a spherical array.
Preferably, in the framing processing in step 1, the frame length is 512 sample points, and the frame length is shifted to half of the frame length.
Preferably, the step 4 calculates an auxiliary covariance matrix of the covariance between the channels by using an auxiliary orthogonal transformation.
Preferably, the first pretreatment comprises sequentially performing: pre-emphasis processing, framing processing, windowing processing, and end point detection, wherein the end point detection is used for determining an effective signal of the digital signal, and extracting the effective signal part to serve as a signal output after the first pre-processing.
Preferably, after the microphone array is used for acquiring the voice signal, the second preprocessing is performed first, and then the voice signal is converted into a digital signal, where the second preprocessing includes: denoising;
the denoising processing comprises the following steps:
calculating the similarity of adjacent voice signals in the voice signals, and judging whether noise exists according to the similarity;
when noise exists, acquiring characteristic parameters of the noise contained in the voice signal;
denoising the voice signal according to the characteristic parameters;
and storing the denoised voice signal.
Preferably, the second preprocessing further includes a speech enhancement process, and the speech enhancement process includes:
determining the position and direction of a voice source according to the position of the microphone and the strength of the voice signal;
enhancing speech in the direction of the speech source while attenuating speech in the direction of the non-speech source.
A system for use in a dereverberation method as claimed in any preceding claim, the system comprising:
the first preprocessing module is used for performing the first preprocessing;
a first transformation module, configured to transform the first preprocessed voice signal from a time domain to a frequency domain;
a first calculation module for performing the calculation of a covariance matrix of the input speech signal;
a second calculation module, configured to perform the step 2;
a recursion module for performing said step 3;
a third calculation module for performing the step 4;
a filter coefficient update module for performing the step 5;
a filtering module, configured to perform filtering processing on the frequency domain signal in step 6;
a second transform module for converting the dereverberated signal from the frequency domain to the time domain.
Preferably, the system comprises:
the microphone array is used for acquiring a voice signal;
the input end of the second preprocessing module is connected with the output end of the microphone array;
and the input end of the audio coding and decoding chip is connected with the output end of the second preprocessing module, and the output end of the audio coding and decoding chip is connected with the first preprocessing module.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Detailed Description
The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.
In addition, the descriptions related to the first, the second, etc. in the present invention are only used for description purposes, do not particularly refer to an order or sequence, and do not limit the present invention, but only distinguish components or operations described in the same technical terms, and are not understood to indicate or imply relative importance or implicitly indicate the number of indicated technical features. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions and technical features between various embodiments can be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, the combination of the technical solutions should be considered to be absent and not be within the protection scope of the present invention.
An embodiment of the present invention provides a stable online multi-channel speech dereverberation method, as shown in fig. 1, including:
step 1: performing first preprocessing on an input voice signal, and converting the voice signal subjected to the first preprocessing into a frequency domain from a time domain to obtain a frequency domain signal; meanwhile, calculating a covariance matrix of the input voice signal;
the first preprocessing comprises framing;
step 2: calculating a regularization vector corresponding to each frame signal in the frequency domain signal; and step 3: estimating a filter coefficient corresponding to the frequency domain signal by adopting a recursive least square method based on a mode that each frequency band is independent;
and 4, step 4: calculating an auxiliary covariance matrix of covariance among channels, and correcting the auxiliary covariance matrix based on the regularization vector calculated in the step 2; wherein, the regularization control factor is introduced to change the regularization size of the matrix.
And 5: updating the filter coefficient based on the covariance matrix and the corrected auxiliary covariance matrix to obtain a new filter coefficient, and outputting the new filter coefficient to a filtering module;
step 6: and the filtering module carries out filtering processing on the frequency domain signal according to the new filter coefficient to obtain a frequency domain signal after reverberation is removed, converts the signal after reverberation is removed from a frequency domain to a time domain and transmits the signal to a voice recognition system.
Preferably, the step 4 calculates an auxiliary covariance matrix of the covariance between the channels by using an auxiliary orthogonal transformation.
The working principle of the technical scheme is as follows: at present, the on-line dereverberation of voice is usually realized by adopting a recursive least square filtering method, the solution of a covariance matrix is a key step of the recursive least square filtering process, the technical scheme adopts a regularization vector corresponding to each frame of signal to correct an auxiliary covariance matrix of covariance among channels, the corrected auxiliary covariance matrix and the signal are adopted to calculate the covariance matrix, and a filter coefficient is updated.
The beneficial effects of the above technical scheme are: according to the technical scheme, the characteristic value range of the matrix can be controlled by regularizing the covariance matrix, the matrix is prevented from entering a ill state, the stability of the algorithm is enhanced, the dispersion is not easy to occur, meanwhile, the dereverberation performance of the algorithm is not influenced, correct processed voice is obtained, and the accuracy of voice recognition is guaranteed.
In one embodiment, step 1 is preceded by: acquiring a voice signal by adopting a microphone array, and converting the voice signal into a digital signal;
the step 1 converts the voice signal after the first pretreatment from a time domain to a frequency domain through short-time Fourier transform;
the step 2 is to calculate a regularization vector corresponding to each frame of signal according to the number of the microphones and the length of the filter;
said step 6 transforms the dereverberated signal from the frequency domain to the time domain by a short time inverse fourier transform.
The microphone array is a linear array or a circular array or a spherical array, and preferably, the microphone array element spacing is 3.5 cm.
The beneficial effects of the above technical scheme are: the microphone array is convenient for collecting voice signals in different spatial directions; short-time fourier transform is more able to observe information about the instantaneous frequency of the signal than fourier transform.
In the framing process of step 1, the frame length is 512 sampling points, and the frame is shifted to half of the frame length.
The beneficial effects of the above technical scheme are: selecting the appropriate frame length and frame shift facilitates accurate signal processing.
In one embodiment, the first pre-processing comprises sequentially: pre-emphasis processing, framing processing, windowing processing, and end point detection, wherein the end point detection is used for determining an effective signal of the digital signal, and extracting the effective signal part to serve as a signal output after the first pre-processing.
The voice signal end point detection technology accurately determines a starting point and an end point of voice from a segment of signal containing voice, and distinguishes a voice signal (i.e. the effective signal) from a non-voice signal (including a silence segment and a noise segment).
The effective end point detection technology not only can reduce the data acquisition amount in the voice recognition system and save the processing time, but also can eliminate the interference of an unvoiced segment or a noise segment and improve the performance of the voice recognition system.
The beneficial effects of the above technical scheme are: the pre-emphasis process can be pre-emphasized by a first-order high-pass digital filter; because the voice signal has short-time stationarity, the voice signal can be divided into a plurality of short sections to be collected by windowing, so that the signal processing is more convenient; determining a valid signal of the digital signal through end point detection, and extracting a valid signal part to serve as a signal output after first preprocessing; the technical scheme ensures the reliability of signal processing and is convenient for the subsequent steps.
In an embodiment, the obtaining of the voice signal by using the microphone array is followed by performing a second preprocessing, and then converting the voice signal into a digital signal, where the second preprocessing includes: denoising;
the denoising processing comprises the following steps:
calculating the similarity of adjacent voice signals in the voice signals, and judging whether noise exists according to the similarity;
when noise exists, acquiring characteristic parameters of the noise contained in the voice signal;
denoising the voice signal according to the characteristic parameters;
and storing the denoised voice signal.
The working principle of the technical scheme is as follows: the denoising processing firstly calculates the similarity of adjacent voice signals in the voice signals, and judges whether noise exists according to the similarity; when noise exists, acquiring characteristic parameters of the noise contained in the voice signal; denoising the voice signal according to the characteristic parameters; finally, storing the voice signal after denoising;
the beneficial effects of the above technical scheme are: the technical scheme can ensure the noise processing effect and is more convenient to ensure the accuracy of the signal processing of the invention.
In one embodiment, the second pre-processing further comprises speech enhancement processing comprising:
determining the position and direction of a voice source according to the position of the microphone and the strength of the voice signal;
enhancing speech in the direction of the speech source while attenuating speech in the direction of the non-speech source.
The working principle effect of the technical scheme is as follows: the technical scheme determines the position and the direction of the voice source according to the position of the microphone and the strength of the voice signal; and enhancing the voice in the voice source direction and weakening the voice in the non-voice source direction at the same time according to the determined position and direction of the voice source.
The beneficial effects of the above technical scheme are: the voice in the voice source direction can be enhanced, and the voice signal processing effect can be ensured more conveniently.
A system for use in any of the above methods, as shown in fig. 2, comprising:
the first preprocessing module is used for performing the first preprocessing;
a first transformation module, configured to transform the first preprocessed voice signal from a time domain to a frequency domain;
a first calculation module for performing said calculating a covariance matrix of the input speech signal;
a second calculation module, configured to perform the step 2;
a recursion module for performing said step 3;
a third calculation module for performing the step 4;
a filter coefficient update module for performing the step 5;
a filtering module, configured to perform the filtering processing on the frequency domain signal in step 6
Said filtering the received signal;
a second transform module for converting the dereverberated signal from the frequency domain to the time domain.
The working principle of the technical scheme is as follows: the first preprocessing is carried out through a first preprocessing module, and the voice signal after the first preprocessing is transmitted to a first conversion module; converting the voice signal from a time domain to a frequency domain through a first transformation module and transmitting the frequency domain signal to a first calculation module, a second calculation module and a recursion module; the first calculation module is used for calculating the covariance matrix of the input voice signal and transmitting the covariance matrix to the filter coefficient updating module; step 2 is executed by the second calculation module and is transmitted to a third calculation module, and step 4 is executed by the third calculation module and is transmitted to a filter coefficient updating module; step 3 is executed by the recursion module and transmitted to the filter updating module; the filter updating module executes the step 5 to obtain a new filter coefficient and transmits the new filter coefficient to the filtering module; and the filtering module carries out filtering according to the updated filter coefficient to obtain a frequency domain signal after the reverberation is removed, converts the signal after the reverberation is removed from the frequency domain to a time domain and sends the signal to the voice recognition system.
The beneficial effects of the above technical scheme are: according to the technical scheme, the characteristic value range of the matrix can be controlled by regularizing the covariance matrix, the matrix is prevented from entering a ill state, the stability of the algorithm is enhanced, the dispersion is not easy to occur, meanwhile, the dereverberation performance of the algorithm is not influenced, correct processed voice is obtained, and the accuracy of voice recognition is ensured.
In one embodiment, as shown in FIG. 2, the system comprises:
the microphone array is used for acquiring a voice signal;
the input end of the second preprocessing module is connected with the output end of the microphone array;
and the input end of the audio coding and decoding chip is connected with the output end of the second preprocessing module, and the output end of the audio coding and decoding chip is connected with the first preprocessing module.
The working principle of the technical scheme is as follows: and (analog) voice signals are acquired through the microphone array and are transmitted to the second preprocessing module for second preprocessing, the second preprocessing module transmits the (analog) voice signals subjected to the second preprocessing to the audio decoding chip, and the (analog) voice signals are converted into digital signals to be transmitted to the first preprocessing module.
The beneficial effects of the above technical scheme are: the microphone array and the audio decoding chip are used for acquiring voice signals and converting the voice signals into digital signals, so that subsequent processing is facilitated, and second preprocessing is performed on the signals through the second preprocessing module, so that the reliability of signal transmission is guaranteed.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.