Disclosure of Invention
The invention aims to overcome the defects of the prior art, provides a comprehensive judgment method for voice and non-voice of voice, and solves the problems in the prior art.
The purpose of the invention is realized by the following technical scheme: a comprehensive decision method for voice and non-voice of speech, the comprehensive decision method comprises:
performing framing processing on input voice data to obtain first framing voice data and second framing voice data;
the processing method of the first frame voice data comprises the following steps:
preprocessing the first frame voice data, acquiring each frame voice data, performing time-frequency conversion and cepstrum coefficient, inputting the preprocessed data into a voice recognition network, and judging the proportion of voice segments of voice in the whole voice segment;
when the voice signal proportion is larger than a preset value, carrying out voice noise reduction processing by combining a short-time autocorrelation method and a spectral subtraction method;
detecting a voice endpoint by combining a short-time correlation method and an energy-entropy ratio method, marking voice speech segments in the detected voice data as voice, marking other speech segments as non-voice, and finally outputting the voice data;
the processing method of the second sub-frame voice data comprises the following steps:
performing voice noise reduction processing on the second sub-frame voice data by combining a short-time autocorrelation method and a spectral subtraction method;
and detecting the voice endpoint by combining a short-time correlation method and an energy-entropy ratio method, marking voice speech segments in the detected voice data as voice, marking other speech segments as non-voice, and finally outputting the voice data.
The preprocessing the first frame voice data to acquire each frame voice data for time-frequency conversion and cepstrum coefficient comprises:
the time-frequency parameters F (F, t) of the voice data obtained by the first frame voice data through short-time Fourier transform represent the relative energy value of the voice signal when the moment is t and the frequency is F;
performing MFCC feature extraction on each frame of voice data to obtain an MFCC value, a first-order MFCC difference and a second-order MFCC difference of each frame of voice data;
carrying out pre-emphasis processing on a voice signal, windowing the pre-emphasized signal and carrying out frequency domain conversion on the windowed signal to obtain the representation of the voice signal on a frequency domain;
calculating an energy spectrum of each frame of spectral line energy after passing through a Mel filter bank, and carrying out logarithm taking processing on the energy spectrum after passing through the Mel filter bank;
taking logarithm of the energy passing through the Mel filter bank, performing discrete cosine transform to obtain MFCC characteristics, and performing first-order difference processing on the MFCC characteristics to obtain first-order MFCC characteristics;
and performing differential operation on the first-order MFCC features to obtain second-order MFCC features.
The voice noise reduction processing comprises:
for each frame of voice data xnPerforming short-time autocorrelation processing to obtain autocorrelation value R of current framen;
Taking each frame autocorrelation value as a new autocorrelation sequence, and performing smooth filtering by adopting an average filtering method with set window length and window shift to obtain a filtered autocorrelation value sequence R'n;
Averaging the autocorrelation value sequence
As the threshold η, when the autocorrelation value is less than or equal to the frame segment of the threshold ηAs a non-voice section, a frame section larger than a threshold eta is taken as a voice section;
using the determined non-speech segment and speech segment as input, and adopting spectral subtraction to make original speech data xnDenoising to obtain denoised voice data x'n。
The determined non-speech segment and speech segment are used as input, and the original speech data x is subjected to spectral subtractionnDenoising to obtain denoised voice data x'nThe method comprises the following steps:
for each original frame of speech signal xnPerforming fast Fourier transform to obtain transformed voice signal Xn(k);
According to X
n(k) Amplitude | X of
n(k) Angle of phase
Calculating the frame number NIS of the non-voice section to obtain the average power spectrum value D (k) of the non-voice section;
calculating the speech signal X after fast Fourier transform
n(k) Average value Y of
n(k) And obtaining the amplitude value after spectral subtraction through a spectral subtraction formula
According to the spectrally subtracted amplitude
And phase angle
Obtaining the voice data x 'after noise reduction by utilizing inverse fast Fourier transform'
n。
The method for detecting the voice endpoint by combining the short-time correlation method with the energy-entropy ratio comprises the following steps:
calculating short-time energy to obtain each frame signal xnEnergy E ofnAnd compares the noise-reduced voice-per-frame signal x'nCalculating the value X 'after fast Fourier transform'n;
Calculating short-time energy E 'of each frame of voice signal after noise reduction in frequency domain'nAnd the energy spectrum S of the k-th spectral linen(k);
Calculating a normalized spectral probability density function p for each frequency component of each framen(k) Spectral entropy per frame HnAnd the energy-to-entropy ratio Ef of each frame signaln;
Calculating energy-entropy ratio EF 'of non-voice segment signal'nThe noise-reduced voice-per-frame signal x'nReplacing the noise-reduced non-speech frame to obtain the energy entropy ratio Ef 'of the noise-reduced non-speech frame'n;
Setting a decision threshold T1And T2And calculating an energy-to-entropy ratio Ef of the speech signalnIntersection point N with threshold T22,N3The start and stop points of the speech segment being at N2,N3Outside the time interval of (1);
from the starting point N2To the left and end point N3Searching to the right to respectively find the energy-entropy ratio Ef of the voice signalsnAnd threshold value T1Cross point N of1,N4,N1Is the starting point of a speech segment, N4Is the end point of a speech segment.
The voice recognition network comprises three convolution layers, three pooling layers and three full-connection layers; a first layer of convolutional layers: the size of the convolution kernel is 3 multiplied by 3, 32 convolution kernels are in total, the moving step length of the convolution kernels is 1, and in the convolution process, a 0 value is adopted for filling the part with insufficient boundaries; a first pooling layer: filling the part with insufficient boundary with 0 value by adopting the maximum pooling of 2 multiplied by 2; a second layer of convolutional layers: the size of the convolution kernel is 3 multiplied by 3, 64 convolution kernels are totally arranged, and the rest arrangement is the same as that of the first layer of convolution layer; the arrangement of the second layer of the pooling layer is the same as that of the first layer of the pooling layer; a third layer of convolutional layers: the size of the convolution kernel is 3 multiplied by 3, 1024 convolution kernels are in total, and the rest of the convolution kernels are the same as the first layer of convolution layer; the arrangement of the third layer of the pooling layer is the same as that of the first layer of the pooling layer; the output of the first full connection layer and the second full connection layer is 1024, and the output of the third full connection layer is 2, which represents the number of the required classifications; after each convolution, activating the convolved value by adopting a Relu activation function; in the training process of the network, parameters of the network are updated by adopting an Adam random gradient descent method.
The first frame voice data comprises voice data which is processed according to the time of each frame being 1s and the time of overlapping between each frame being 0.7 s; the second frame-divided voice data includes voice data processed by each frame with a time of 0.025s and an overlapping time of 0.01 s.
The invention has the following advantages: a comprehensive judgment method for voice and non-voice of voice is characterized by that it utilizes neural network to identify out voice containing voice segment, then combines with spectrum subtraction method of autocorrelation to find out non-voice segment and make noise reduction, finally utilizes the voice after noise reduction and utilizes the method of energy-entropy ratio to make more accurate judgment on voice segment so as to raise the applicability of voice judgment.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the detailed description of the embodiments of the present application provided below in connection with the appended drawings is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application. The invention is further described below with reference to the accompanying drawings.
As shown in fig. 1 and fig. 2, on the basis of the prior art, the present invention provides a comprehensive decision method for voice and non-voice of voice, which specifically includes the following contents, for the situations that the prior art cannot decide that the initial segment of voice is voice, the background noise is complex, multiple signal-to-noise ratios coexist, and a neural network and a model require a large amount of data for network training and model construction.
Step 1: the speech signal data (sampling rate fs Hz) subjected to AD conversion (analog signal-digital signal conversion) is used as input speech data, and the input speech data is subjected to framing processing, and two framing modes are provided in total. The first framing method is to process voice data in a time of 1s (fs length) per frame and in an overlapping time of 0.7s (0.7 × fs length) between frames. The second framing method is to process the voice data in a time of 0.025s (0.025 × fs) per frame and in a time of 0.01s (0.01 × fs) overlapping each frame.
Step 2: the voice data which is subjected to framing processing and has the frame length of fs and the frame overlapping length of 0.7 xfs is preprocessed, and time-frequency conversion and MFCC (Mel frequency cepstrum coefficient) acquisition are carried out on each frame of voice data.
Further, as shown in fig. 3, the method specifically includes:
step 2.1, obtaining time-frequency parameters F (F, t) of data from voice data through STFT, representing the relative energy of voice at time t and frequency F, xnFor speech signals, ω (n) is the Hamming window function. In the formula (2-2), N represents the size of the nth point value of the hamming window, N represents the window length, and the selected length N is 256.
F(f,t)=STFT(xn,ω(n)) (2-1)
And 2.2, acquiring the MFCC characteristics of each frame of voice data, and mainly acquiring the MFCC value, the first-order MFCC difference and the second-order MFCC difference of each frame of voice data.
Step 2.2.1 Pre-emphasis processing, x, of the speech signaln(i) Is the ith speech signal amplitude.
x′n(i)=xn(i)-0.97×xn(i-1) (2-3)
And 2.2.2, windowing the pre-emphasized signal and performing frequency domain conversion on the windowed signal, wherein a Hamming window is selected as a window function omega (l), the length is 256, and N is the total frame length. Obtaining a representation, X, of the speech signal in the frequency domainn(k) For the amplitude spectrum, P, of the speech signal at the k-th spectral linen(k) The power spectrum of the k-th spectral line of the speech signal is obtained.
Step 2.2.3 calculating the energy spectrum after each frame of spectral line energy passes through the Mel filter bank, the number of the filters is 22, the Mel filter formula is as follows, in the formula (2-6), Mi(k) Represents the ith filter, k is the kth spectral line of the input filter, f (i) isThe center frequency of the ith filter. In the formula (2-7), flIs the lowest frequency of the filter, fhIs the highest frequency, f, of the filtersIs the sampling frequency.
And 2.2.4, taking logarithm of the energy passing through the Mel filter bank, and then performing discrete cosine transform to obtain the MFCC characteristics.
And 2.2.5, performing first-order difference processing on the MFCC features to obtain first-order MFCC features, wherein a difference operation formula is as follows, d (j) represents the j (th) first-order difference, c (j + l) represents the order of the j (th) plus l cepstrum coefficient, and z represents the interval of a difference frame.
And 2.2.6, performing differential operation on the first-order MFCC characteristics, wherein the operation steps are the same as those of the formula (2-9), and obtaining second-order MFCC characteristics.
After preprocessing, the size of the time-frequency parameter of the obtained voice signal is 129 × 64, the size of the MFCC characteristic parameter is 99 × 64, the two parts of parameters are spliced by a matrix, and the splicing method is as shown in fig. 4, and finally the size of the obtained preprocessed output data is 228 × 64.
And step 3: as shown in fig. 5 and 6, each frame of preprocessed voice data is used as input of CNN, and the label of the input data is voice or non-voice for each frame of voice data. The network adopts three layers of convolution layer, three layers of pooling layer and three layers of full-connection layer. A first layer of convolutional layers: the convolution kernel size is 3 × 3, 32 convolution kernels are provided in total, the moving step length of the convolution kernels is 1, and in the convolution process, the part with insufficient boundaries is filled with 0 values. A first pooling layer: the largest pool size of 2 x 2 was used and the insufficient boundary part was filled with a value of 0. A second layer of convolutional layers: the convolution kernel size is 3 × 3, there are 64 convolution kernels in total, and the rest of the settings are the same as those of the first layer convolution layer. The arrangement of the second layer of the pooling layer is the same as that of the first layer of the pooling layer. A third layer of convolutional layers: the convolution kernel size is 3 × 3, there are 1024 convolution kernels, and the rest of the settings are the same as those of the first layer of convolution layer. The arrangement of the third layer of the pooling layer is the same as that of the first layer of the pooling layer. The output of the first layer full link layer and the second layer full link layer are both 1024, and the output of the third layer full link layer is 2, which represents the number of required classifications. And after each convolution, activating the convolved values by adopting a Relu activation function. In the training process of the network, parameters of the network are updated by adopting an Adam random gradient descent method.
And 4, step 4: after the voice data is judged through the neural network, the proportion of the voice section which is identified as the voice by the neural network in the voice section in the whole voice section is counted. The length of the whole voice segment counted each time is 10 s.
And 5: when the ratio of voice signals in the speech segment is less than 5%, the speech segment is judged to be a non-voice segment, and the speech segment is marked as non-voice.
Step 6: the voice noise reduction adopts a method of combining a short-time autocorrelation method and a spectral subtraction method to carry out voice noise reduction.
As shown in fig. 7, further,step 6 specifically includes the following:
step 6.1 for each frame of speech data xnPerforming short-time autocorrelation processing to obtain autocorrelation value R of current framen。
Wherein k is the voice data of the kth frame, and N is the number of samples of the voice data of the frame.
Step 6.2 Pair to obtainTaking the obtained autocorrelation value of each frame as a new autocorrelation value sequence RnCarrying out smooth filtering, and obtaining a filtered autocorrelation value sequence R 'by adopting an average filtering method with the window length of 10 and the window shift of 1'n。
R′n=mean(Rn+…+Rn+9),1≤n≤K-9 (6-2)
Wherein n is the sampled nth time speech value, and K is the total number of samples.
Step 6.3 averaging of sequences
As the threshold η, when a frame segment whose autocorrelation value is less than or equal to the threshold η is regarded as a non-speech segment, a frame segment greater than the threshold η is regarded as a speech segment.
Non-speech segment g determined by autocorrelationnAs input, the original speech data x are subjected to spectral subtractionnDenoising to obtain denoised voice data x'n。
Step 6.4 for each original frame speech signal x
nFast Fourier Transform (FFT) is carried out, N is the total frame length, and X
n(k) The spectral value of the kth spectral line of the nth frame of the speech signal. For X
n(k) Has an amplitude of | X
n(k) Angle of |
Step 6.5 the frame number of the non-voice section is NIS, and the non-voice section g is processednFFT processing is carried out to obtain dn(k) And represents the spectral value of the kth spectral line of the nth frame of the unvoiced signal. The average power spectrum value D (k) of the non-speech segment is obtained.
Step 6.6 FFT of the XnCalculating the average value Y thereofn(k)。
Step 6.7, obtaining the amplitude value after the spectral subtraction through a spectral subtraction formula
Wherein, a is 4 and is an over-reduction factor; b is 0.001, which is the gain compensation factor.
Step 6.8 obtaining the spectrally subtracted amplitude
And the original phase angle
Obtaining a noise-reduced voice x 'by Inverse Fast Fourier Transform (iFFT)'
n。
And 7: the voice endpoint detection adopts a short-time autocorrelation method and an energy-entropy ratio method to detect the endpoint. The processing mode of the autocorrelation stage is the same as that of the speech noise reduction stage, the corresponding methods can be seen in formulas (6-1) and (6-2), and the speech segment and the non-speech segment of the noise-reduced speech are obtained through the autocorrelation method. And taking the voice section and the non-voice section of the voice after noise reduction as input, and determining the starting and ending positions of the voice section and the non-voice section by a method of energy-entropy ratio.
As shown in fig. 8, further, step 7 specifically includes the following:
7.1 energy-entropy ratio is the ratio of the energy of each frame signal to the spectral entropy, and each frame signal x is obtained by calculating the short-time energynEnergy E ofn. And N is the number of sampling points of each frame signal.
The spectral entropy of the speech signal is obtained as follows.
Step 7.2 of noise-reduced signals x 'per frame'nCalculating value X 'after FFT conversion'nAnd k represents the k-th spectral line.
Step 7.3, calculating the short-time energy E 'of each frame of voice after noise reduction in the frequency domain'
nN is the length of FFT, only the positive frequency part is taken,
is X
n(k) Conjugation of (1).
Step 7.4 calculate the energy spectrum S of the k-th spectral linen(k)。
Step 7.5 calculate the normalized spectral probability density function p for each frequency component of each framen(k)。
Step 7.6 calculate the spectral entropy value H of each framen。
Step 7.7 calculate the energy-to-entropy ratio Ef of each frame signaln。
Step 7.8 calculates the energy-entropy ratio EF 'of the non-speech segment signal'n. The calculation process is the same as the formulas (7-2) and (7-7), and each frame signal x 'after noise reduction is carried out'nReplacing the noise-reduced non-speech frame to obtain the energy entropy ratio Ef 'of the noise-reduced non-speech frame'n。
Step 7.9 setting a decision threshold T1And T2Me is the maximum value of the energy-entropy ratio of each frame signal, and delta is the adaptive parameter of the decision threshold.
Me=max(Efn) (7-8)
δ=Me-mean(Ef′n) (7-9)
T1=0.05×δ+mean(Ef′n) (7-10)
T2=0.1×δ+mean(Ef′n) (7-11)
7.10 initial judgment of the threshold, calculating the energy-entropy ratio Ef of the voice signalnIntersection N with threshold T22,N3The start and stop points of the speech segment being at N2,N3Outside the time interval of (c).
Step 7.11 starts from the initial decision N2To the left and end point N3Searching to the right to respectively find the energy-entropy ratio Ef of the voice signalsnAnd a threshold T1Cross point N of1,N4。N1Is the starting point of a speech segment, N4Is the end point of a speech segment.
And 8: and marking the voice data after voice endpoint detection, wherein the voice segments are marked as voice, and the rest voice segments are marked as non-voice segments.
And step 9: and outputting voice data, namely splicing the voice marked as the voice section into a whole new voice section according to the time sequence, and storing the voice section at the fs Hz sampling rate in a wav file format.
The invention designs a convolutional neural network for recognizing the voice signal by combining the time-frequency parameter and the cepstrum characteristic parameter of the voice signal aiming at the condition that the non-voice signal occupies most time and the type and the energy of the non-voice signal are complicated and changeable in the actual voice signal. The input of the neural network is time-frequency parameters and cepstrum characteristic parameter information of the normalized voice signals, the current voice signals are judged through three layers of convolution layers, three layers of pooling layers and three layers of full-connection layers, whether the current voice section contains voice information can be roughly and quickly judged, and if the current voice section contains the voice signals, subsequent voice section point detection is carried out; if the voice signal is not contained, the subsequent processing is not carried out on the voice signal, so that the speed of voice judgment is increased, and the judgment time is reduced.
The recognition result of the network is shown in fig. 9, and the neural network combining the time-frequency parameters and the cepstrum characteristic parameters of the speech signal can more accurately recognize the speech segment and non-speech segment signals in the speech signal.
The invention designs a new voice denoising method aiming at the condition that initial frames of a voice section needing voice denoising are not non-voice frames in actual conditions and combining the advantages of short-time autocorrelation and spectral subtraction of voice. The method comprises the steps of firstly, calculating short-time autocorrelation of each frame of voice signals, and screening out non-voice speech segments by combining a threshold value. And then taking the screened non-voice section as a noise section in the spectral subtraction method to perform denoising processing on the original voice signal. The method realizes the denoising operation of the voice signal by adaptively determining the unvoiced segment of the voice, solves the defect that the unvoiced segment needs to be determined manually in the existing method, and improves the intelligence and denoising effect of voice denoising.
As a result, as shown in fig. 10, when the initial speech is a speech segment, the method adaptively determines the unvoiced segment of the speech signal, so that the speech signal can be accurately denoised, and the denoising effect is excellent.
The invention designs a new voice endpoint detection method aiming at the condition that initial frames of a voice section needing endpoint detection are not non-voice frames in actual conditions and combining the advantages of short-time autocorrelation and energy-entropy ratio method of voice. The method comprises the steps of firstly, calculating short-time autocorrelation of each frame of voice signals, and roughly screening out unvoiced speech segments by combining a threshold value. And then calculating the ratio of the energy of the screened non-voice segment to the spectral entropy, and determining the threshold value of the subsequent voice detection. And calculating the energy-entropy ratio of the speech signals according to frames, and carrying out endpoint detection on the speech by combining a determined threshold value. The method realizes voice endpoint detection of the voice signal by adaptively determining the unvoiced segment of the voice, overcomes the defect that the unvoiced segment needs to be determined manually in the existing method, and improves the intelligence and the accuracy of the voice endpoint detection.
As shown in fig. 11, when the initial speech is a speech segment, the endpoint detection method provided by the present invention can adaptively determine the non-speech segment of the speech signal, perform decision and endpoint detection on the speech signal, and the decision result is relatively accurate.
The foregoing is illustrative of the preferred embodiments of this invention, and it is to be understood that the invention is not limited to the precise form disclosed herein and that various other combinations, modifications, and environments may be resorted to, falling within the scope of the concept as disclosed herein, either as described above or as apparent to those skilled in the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.