CN114242116A

Movatterモバイル変換

Info

Publication number: CN114242116A
Application number: CN202210006259.1A
Authority: CN
Inventors: 代策宇; 张义林; 徐杨辉; 傅松; 段绍楠
Original assignee: Chengdu Jinjiang Electronic System Engineering Co Ltd
Current assignee: Chengdu Jinjiang Electronic System Engineering Co Ltd
Priority date: 2022-01-05
Filing date: 2022-01-05
Publication date: 2022-03-25
Anticipated expiration: 2042-01-05
Also published as: CN114242116B

Abstract

The invention relates to a method for comprehensively judging voices and non-voices of voice, which comprises the following steps: performing framing processing on input voice data to obtain first framing voice data and second framing voice data; preprocessing the first frame voice data, acquiring each frame voice data, performing time-frequency conversion and cepstrum coefficient, inputting the preprocessed data into a voice recognition network, and judging the proportion of voice segments of voice in the whole voice segment; when the voice signal proportion is larger than a preset value, carrying out voice noise reduction processing by combining a short-time autocorrelation method and a spectral subtraction method; and detecting the voice endpoint by combining a short-time correlation method and an energy-entropy ratio method, marking voice speech segments in the detected voice data as voice, marking other speech segments as non-voice, and finally outputting the voice data. The invention improves the applicability of the voice judgment, enlarges the application range of the judgment of the voice and the non-voice under the complex condition and further improves the applicability of the method in the voice judgment.

Description

Comprehensive judgment method for voice and non-voice of voice

Technical Field

The invention relates to the technical field of voice processing, in particular to a comprehensive judgment method for voice and non-voice of voice.

Background

For the speech and non-speech decision, the prior art methods can be roughly divided into three types: a threshold-based decision method, a decision method as a classifier and a model-based decision method. The threshold-based judgment method, namely the voice endpoint detection method, sets a reasonable threshold by extracting time domain and frequency domain characteristics of voice, such as short-term energy, short-term zero-crossing rate, cepstrum coefficient and the like, so as to realize the judgment of voice and non-voice; the decision method of the classifier, regard speech judgement as the classification problem of the voice and non-voice, utilize neural network and machine learning method to train the classifier, achieve the goal of judgement; the model-based method is to use a complete acoustic model to make decisions based on decoding and by means of global information.

However, in the existing voice and non-voice judging method, the judging condition is based on that the voice contains the same type of noise with unchanged signal-to-noise ratio on the voice segment needing to be judged. In order to achieve a good noise reduction effect, the initial frames of speech in the speech segment are assumed to be non-speech frames, i.e. noise frames, during the denoising process. And taking the initial non-voice frames as background noise of the voice to carry out noise reduction and voice judgment.

The existing judgment method based on the classifier and the model needs to judge the voice and the non-voice of each frame of signal, and then needs to adopt other methods to eliminate the deviation caused by judgment, and needs a great amount of different training data to train and construct the network or the model for achieving the accurate judgment of the voice signal, and the work required in the early stage is complex.

Therefore, the conditions of the speech to be judged in the prior art are ideal, and the noise frame cannot be determined adaptively under the conditions that the initial segment is speech or background noise is complex and multiple signal-to-noise ratios coexist, and the speech data is needed to train a network and construct a model for speech judgment.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, provides a comprehensive judgment method for voice and non-voice of voice, and solves the problems in the prior art.

The purpose of the invention is realized by the following technical scheme: a comprehensive decision method for voice and non-voice of speech, the comprehensive decision method comprises:

performing framing processing on input voice data to obtain first framing voice data and second framing voice data;

the processing method of the first frame voice data comprises the following steps:

preprocessing the first frame voice data, acquiring each frame voice data, performing time-frequency conversion and cepstrum coefficient, inputting the preprocessed data into a voice recognition network, and judging the proportion of voice segments of voice in the whole voice segment;

when the voice signal proportion is larger than a preset value, carrying out voice noise reduction processing by combining a short-time autocorrelation method and a spectral subtraction method;

detecting a voice endpoint by combining a short-time correlation method and an energy-entropy ratio method, marking voice speech segments in the detected voice data as voice, marking other speech segments as non-voice, and finally outputting the voice data;

the processing method of the second sub-frame voice data comprises the following steps:

performing voice noise reduction processing on the second sub-frame voice data by combining a short-time autocorrelation method and a spectral subtraction method;

and detecting the voice endpoint by combining a short-time correlation method and an energy-entropy ratio method, marking voice speech segments in the detected voice data as voice, marking other speech segments as non-voice, and finally outputting the voice data.

The preprocessing the first frame voice data to acquire each frame voice data for time-frequency conversion and cepstrum coefficient comprises:

the time-frequency parameters F (F, t) of the voice data obtained by the first frame voice data through short-time Fourier transform represent the relative energy value of the voice signal when the moment is t and the frequency is F;

performing MFCC feature extraction on each frame of voice data to obtain an MFCC value, a first-order MFCC difference and a second-order MFCC difference of each frame of voice data;

carrying out pre-emphasis processing on a voice signal, windowing the pre-emphasized signal and carrying out frequency domain conversion on the windowed signal to obtain the representation of the voice signal on a frequency domain;

calculating an energy spectrum of each frame of spectral line energy after passing through a Mel filter bank, and carrying out logarithm taking processing on the energy spectrum after passing through the Mel filter bank;

taking logarithm of the energy passing through the Mel filter bank, performing discrete cosine transform to obtain MFCC characteristics, and performing first-order difference processing on the MFCC characteristics to obtain first-order MFCC characteristics;

and performing differential operation on the first-order MFCC features to obtain second-order MFCC features.

The voice noise reduction processing comprises:

for each frame of voice data x_nPerforming short-time autocorrelation processing to obtain autocorrelation value R of current frame_n；

Taking each frame autocorrelation value as a new autocorrelation sequence, and performing smooth filtering by adopting an average filtering method with set window length and window shift to obtain a filtered autocorrelation value sequence R'_n；

Averaging the autocorrelation value sequence

As the threshold η, when the autocorrelation value is less than or equal to the frame segment of the threshold ηAs a non-voice section, a frame section larger than a threshold eta is taken as a voice section;

using the determined non-speech segment and speech segment as input, and adopting spectral subtraction to make original speech data x_nDenoising to obtain denoised voice data x'_n。

The determined non-speech segment and speech segment are used as input, and the original speech data x is subjected to spectral subtraction_nDenoising to obtain denoised voice data x'_nThe method comprises the following steps:

for each original frame of speech signal x_nPerforming fast Fourier transform to obtain transformed voice signal X_n(k)；

According to X_n(k) Amplitude | X of_n(k) Angle of phase

Calculating the frame number NIS of the non-voice section to obtain the average power spectrum value D (k) of the non-voice section;

calculating the speech signal X after fast Fourier transform_n(k) Average value Y of_n(k) And obtaining the amplitude value after spectral subtraction through a spectral subtraction formula

According to the spectrally subtracted amplitude

And phase angle

Obtaining the voice data x 'after noise reduction by utilizing inverse fast Fourier transform'_n。

The method for detecting the voice endpoint by combining the short-time correlation method with the energy-entropy ratio comprises the following steps:

calculating short-time energy to obtain each frame signal x_nEnergy E of_nAnd compares the noise-reduced voice-per-frame signal x'_nCalculating the value X 'after fast Fourier transform'_n；

Calculating short-time energy E 'of each frame of voice signal after noise reduction in frequency domain'_nAnd the energy spectrum S of the k-th spectral line_n(k)；

Calculating a normalized spectral probability density function p for each frequency component of each frame_n(k) Spectral entropy per frame H_nAnd the energy-to-entropy ratio Ef of each frame signal_n；

Calculating energy-entropy ratio EF 'of non-voice segment signal'_nThe noise-reduced voice-per-frame signal x'_nReplacing the noise-reduced non-speech frame to obtain the energy entropy ratio Ef 'of the noise-reduced non-speech frame'_n；

Setting a decision threshold T₁And T₂And calculating an energy-to-entropy ratio Ef of the speech signal_nIntersection point N with threshold T2₂，N₃The start and stop points of the speech segment being at N₂，N₃Outside the time interval of (1);

from the starting point N₂To the left and end point N₃Searching to the right to respectively find the energy-entropy ratio Ef of the voice signals_nAnd threshold value T₁Cross point N of₁，N₄，N₁Is the starting point of a speech segment, N₄Is the end point of a speech segment.

The voice recognition network comprises three convolution layers, three pooling layers and three full-connection layers; a first layer of convolutional layers: the size of the convolution kernel is 3 multiplied by 3, 32 convolution kernels are in total, the moving step length of the convolution kernels is 1, and in the convolution process, a 0 value is adopted for filling the part with insufficient boundaries; a first pooling layer: filling the part with insufficient boundary with 0 value by adopting the maximum pooling of 2 multiplied by 2; a second layer of convolutional layers: the size of the convolution kernel is 3 multiplied by 3, 64 convolution kernels are totally arranged, and the rest arrangement is the same as that of the first layer of convolution layer; the arrangement of the second layer of the pooling layer is the same as that of the first layer of the pooling layer; a third layer of convolutional layers: the size of the convolution kernel is 3 multiplied by 3, 1024 convolution kernels are in total, and the rest of the convolution kernels are the same as the first layer of convolution layer; the arrangement of the third layer of the pooling layer is the same as that of the first layer of the pooling layer; the output of the first full connection layer and the second full connection layer is 1024, and the output of the third full connection layer is 2, which represents the number of the required classifications; after each convolution, activating the convolved value by adopting a Relu activation function; in the training process of the network, parameters of the network are updated by adopting an Adam random gradient descent method.

The first frame voice data comprises voice data which is processed according to the time of each frame being 1s and the time of overlapping between each frame being 0.7 s; the second frame-divided voice data includes voice data processed by each frame with a time of 0.025s and an overlapping time of 0.01 s.

The invention has the following advantages: a comprehensive judgment method for voice and non-voice of voice is characterized by that it utilizes neural network to identify out voice containing voice segment, then combines with spectrum subtraction method of autocorrelation to find out non-voice segment and make noise reduction, finally utilizes the voice after noise reduction and utilizes the method of energy-entropy ratio to make more accurate judgment on voice segment so as to raise the applicability of voice judgment.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention;

FIG. 2 is a schematic flow chart showing the details of the method of the present invention;

FIG. 3 is a schematic diagram of an MFCC feature extraction process;

FIG. 4 is a schematic diagram of recognition network data preprocessing;

FIG. 5 is a schematic diagram of a convolutional neural recognition network structure;

FIG. 6 is a schematic diagram of a training and usage flow of a recognition network;

FIG. 7 is a schematic diagram of a speech noise reduction process;

FIG. 8 is a schematic view of an endpoint detection process;

FIG. 9 is a diagram illustrating the recognition results of the recognition network on the test set;

FIG. 10 is a schematic diagram of the effect of a speech denoising method based on the combination of short-term autocorrelation and spectral subtraction;

FIG. 11 is a schematic diagram of endpoint detection based on a combination of short-term autocorrelation and energy-to-entropy ratio.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the detailed description of the embodiments of the present application provided below in connection with the appended drawings is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application. The invention is further described below with reference to the accompanying drawings.

As shown in fig. 1 and fig. 2, on the basis of the prior art, the present invention provides a comprehensive decision method for voice and non-voice of voice, which specifically includes the following contents, for the situations that the prior art cannot decide that the initial segment of voice is voice, the background noise is complex, multiple signal-to-noise ratios coexist, and a neural network and a model require a large amount of data for network training and model construction.

Step 1: the speech signal data (sampling rate fs Hz) subjected to AD conversion (analog signal-digital signal conversion) is used as input speech data, and the input speech data is subjected to framing processing, and two framing modes are provided in total. The first framing method is to process voice data in a time of 1s (fs length) per frame and in an overlapping time of 0.7s (0.7 × fs length) between frames. The second framing method is to process the voice data in a time of 0.025s (0.025 × fs) per frame and in a time of 0.01s (0.01 × fs) overlapping each frame.

Step 2: the voice data which is subjected to framing processing and has the frame length of fs and the frame overlapping length of 0.7 xfs is preprocessed, and time-frequency conversion and MFCC (Mel frequency cepstrum coefficient) acquisition are carried out on each frame of voice data.

Further, as shown in fig. 3, the method specifically includes:

step 2.1, obtaining time-frequency parameters F (F, t) of data from voice data through STFT, representing the relative energy of voice at time t and frequency F, x_nFor speech signals, ω (n) is the Hamming window function. In the formula (2-2), N represents the size of the nth point value of the hamming window, N represents the window length, and the selected length N is 256.

F(f，t)＝STFT(x_n，ω(n)) (2-1)

And 2.2, acquiring the MFCC characteristics of each frame of voice data, and mainly acquiring the MFCC value, the first-order MFCC difference and the second-order MFCC difference of each frame of voice data.

Step 2.2.1 Pre-emphasis processing, x, of the speech signal_n(i) Is the ith speech signal amplitude.

x′_n(i)＝x_n(i)-0.97×x_n(i-1) (2-3)

And 2.2.2, windowing the pre-emphasized signal and performing frequency domain conversion on the windowed signal, wherein a Hamming window is selected as a window function omega (l), the length is 256, and N is the total frame length. Obtaining a representation, X, of the speech signal in the frequency domain_n(k) For the amplitude spectrum, P, of the speech signal at the k-th spectral line_n(k) The power spectrum of the k-th spectral line of the speech signal is obtained.

Step 2.2.3 calculating the energy spectrum after each frame of spectral line energy passes through the Mel filter bank, the number of the filters is 22, the Mel filter formula is as follows, in the formula (2-6), M_i(k) Represents the ith filter, k is the kth spectral line of the input filter, f (i) isThe center frequency of the ith filter. In the formula (2-7), f_lIs the lowest frequency of the filter, f_hIs the highest frequency, f, of the filter_sIs the sampling frequency.

And 2.2.4, taking logarithm of the energy passing through the Mel filter bank, and then performing discrete cosine transform to obtain the MFCC characteristics.

And 2.2.5, performing first-order difference processing on the MFCC features to obtain first-order MFCC features, wherein a difference operation formula is as follows, d (j) represents the j (th) first-order difference, c (j + l) represents the order of the j (th) plus l cepstrum coefficient, and z represents the interval of a difference frame.

And 2.2.6, performing differential operation on the first-order MFCC characteristics, wherein the operation steps are the same as those of the formula (2-9), and obtaining second-order MFCC characteristics.

After preprocessing, the size of the time-frequency parameter of the obtained voice signal is 129 × 64, the size of the MFCC characteristic parameter is 99 × 64, the two parts of parameters are spliced by a matrix, and the splicing method is as shown in fig. 4, and finally the size of the obtained preprocessed output data is 228 × 64.

And step 3: as shown in fig. 5 and 6, each frame of preprocessed voice data is used as input of CNN, and the label of the input data is voice or non-voice for each frame of voice data. The network adopts three layers of convolution layer, three layers of pooling layer and three layers of full-connection layer. A first layer of convolutional layers: the convolution kernel size is 3 × 3, 32 convolution kernels are provided in total, the moving step length of the convolution kernels is 1, and in the convolution process, the part with insufficient boundaries is filled with 0 values. A first pooling layer: the largest pool size of 2 x 2 was used and the insufficient boundary part was filled with a value of 0. A second layer of convolutional layers: the convolution kernel size is 3 × 3, there are 64 convolution kernels in total, and the rest of the settings are the same as those of the first layer convolution layer. The arrangement of the second layer of the pooling layer is the same as that of the first layer of the pooling layer. A third layer of convolutional layers: the convolution kernel size is 3 × 3, there are 1024 convolution kernels, and the rest of the settings are the same as those of the first layer of convolution layer. The arrangement of the third layer of the pooling layer is the same as that of the first layer of the pooling layer. The output of the first layer full link layer and the second layer full link layer are both 1024, and the output of the third layer full link layer is 2, which represents the number of required classifications. And after each convolution, activating the convolved values by adopting a Relu activation function. In the training process of the network, parameters of the network are updated by adopting an Adam random gradient descent method.

And 4, step 4: after the voice data is judged through the neural network, the proportion of the voice section which is identified as the voice by the neural network in the voice section in the whole voice section is counted. The length of the whole voice segment counted each time is 10 s.

And 5: when the ratio of voice signals in the speech segment is less than 5%, the speech segment is judged to be a non-voice segment, and the speech segment is marked as non-voice.

Step 6: the voice noise reduction adopts a method of combining a short-time autocorrelation method and a spectral subtraction method to carry out voice noise reduction.

As shown in fig. 7, further,step 6 specifically includes the following:

step 6.1 for each frame of speech data x_nPerforming short-time autocorrelation processing to obtain autocorrelation value R of current frame_n。

Wherein k is the voice data of the kth frame, and N is the number of samples of the voice data of the frame.

Step 6.2 Pair to obtainTaking the obtained autocorrelation value of each frame as a new autocorrelation value sequence R_nCarrying out smooth filtering, and obtaining a filtered autocorrelation value sequence R 'by adopting an average filtering method with the window length of 10 and the window shift of 1'_n。

R′_n＝mean(R_n+…+R_n+9)，1≤n≤K-9 (6-2)

Wherein n is the sampled nth time speech value, and K is the total number of samples.

Step 6.3 averaging of sequences

As the threshold η, when a frame segment whose autocorrelation value is less than or equal to the threshold η is regarded as a non-speech segment, a frame segment greater than the threshold η is regarded as a speech segment.

Non-speech segment g determined by autocorrelation_nAs input, the original speech data x are subjected to spectral subtraction_nDenoising to obtain denoised voice data x'_n。

Step 6.4 for each original frame speech signal x_nFast Fourier Transform (FFT) is carried out, N is the total frame length, and X_n(k) The spectral value of the kth spectral line of the nth frame of the speech signal. For X_n(k) Has an amplitude of | X_n(k) Angle of |

Step 6.5 the frame number of the non-voice section is NIS, and the non-voice section g is processed_nFFT processing is carried out to obtain d_n(k) And represents the spectral value of the kth spectral line of the nth frame of the unvoiced signal. The average power spectrum value D (k) of the non-speech segment is obtained.

Step 6.6 FFT of the X_nCalculating the average value Y thereof_n(k)。

Step 6.7, obtaining the amplitude value after the spectral subtraction through a spectral subtraction formula

Wherein, a is 4 and is an over-reduction factor; b is 0.001, which is the gain compensation factor.

Step 6.8 obtaining the spectrally subtracted amplitude

And the original phase angle

Obtaining a noise-reduced voice x 'by Inverse Fast Fourier Transform (iFFT)'_n。

And 7: the voice endpoint detection adopts a short-time autocorrelation method and an energy-entropy ratio method to detect the endpoint. The processing mode of the autocorrelation stage is the same as that of the speech noise reduction stage, the corresponding methods can be seen in formulas (6-1) and (6-2), and the speech segment and the non-speech segment of the noise-reduced speech are obtained through the autocorrelation method. And taking the voice section and the non-voice section of the voice after noise reduction as input, and determining the starting and ending positions of the voice section and the non-voice section by a method of energy-entropy ratio.

As shown in fig. 8, further, step 7 specifically includes the following:

7.1 energy-entropy ratio is the ratio of the energy of each frame signal to the spectral entropy, and each frame signal x is obtained by calculating the short-time energy_nEnergy E of_n. And N is the number of sampling points of each frame signal.

The spectral entropy of the speech signal is obtained as follows.

Step 7.2 of noise-reduced signals x 'per frame'_nCalculating value X 'after FFT conversion'_nAnd k represents the k-th spectral line.

Step 7.3, calculating the short-time energy E 'of each frame of voice after noise reduction in the frequency domain'_nN is the length of FFT, only the positive frequency part is taken,

is X_n(k) Conjugation of (1).

Step 7.4 calculate the energy spectrum S of the k-th spectral line_n(k)。

Step 7.5 calculate the normalized spectral probability density function p for each frequency component of each frame_n(k)。

Step 7.6 calculate the spectral entropy value H of each frame_n。

Step 7.7 calculate the energy-to-entropy ratio Ef of each frame signal_n。

Step 7.8 calculates the energy-entropy ratio EF 'of the non-speech segment signal'_n. The calculation process is the same as the formulas (7-2) and (7-7), and each frame signal x 'after noise reduction is carried out'_nReplacing the noise-reduced non-speech frame to obtain the energy entropy ratio Ef 'of the noise-reduced non-speech frame'_n。

Step 7.9 setting a decision threshold T₁And T₂Me is the maximum value of the energy-entropy ratio of each frame signal, and delta is the adaptive parameter of the decision threshold.

Me＝max(Ef_n) (7-8)

δ＝Me-mean(Ef′_n) (7-9)

T1＝0.05×δ+mean(Ef′_n) (7-10)

T2＝0.1×δ+mean(Ef′_n) (7-11)

7.10 initial judgment of the threshold, calculating the energy-entropy ratio Ef of the voice signal_nIntersection N with threshold T2₂，N₃The start and stop points of the speech segment being at N₂，N₃Outside the time interval of (c).

Step 7.11 starts from the initial decision N₂To the left and end point N₃Searching to the right to respectively find the energy-entropy ratio Ef of the voice signals_nAnd a threshold T₁Cross point N of₁，N₄。N₁Is the starting point of a speech segment, N₄Is the end point of a speech segment.

And 8: and marking the voice data after voice endpoint detection, wherein the voice segments are marked as voice, and the rest voice segments are marked as non-voice segments.

And step 9: and outputting voice data, namely splicing the voice marked as the voice section into a whole new voice section according to the time sequence, and storing the voice section at the fs Hz sampling rate in a wav file format.

The invention designs a convolutional neural network for recognizing the voice signal by combining the time-frequency parameter and the cepstrum characteristic parameter of the voice signal aiming at the condition that the non-voice signal occupies most time and the type and the energy of the non-voice signal are complicated and changeable in the actual voice signal. The input of the neural network is time-frequency parameters and cepstrum characteristic parameter information of the normalized voice signals, the current voice signals are judged through three layers of convolution layers, three layers of pooling layers and three layers of full-connection layers, whether the current voice section contains voice information can be roughly and quickly judged, and if the current voice section contains the voice signals, subsequent voice section point detection is carried out; if the voice signal is not contained, the subsequent processing is not carried out on the voice signal, so that the speed of voice judgment is increased, and the judgment time is reduced.

The recognition result of the network is shown in fig. 9, and the neural network combining the time-frequency parameters and the cepstrum characteristic parameters of the speech signal can more accurately recognize the speech segment and non-speech segment signals in the speech signal.

The invention designs a new voice denoising method aiming at the condition that initial frames of a voice section needing voice denoising are not non-voice frames in actual conditions and combining the advantages of short-time autocorrelation and spectral subtraction of voice. The method comprises the steps of firstly, calculating short-time autocorrelation of each frame of voice signals, and screening out non-voice speech segments by combining a threshold value. And then taking the screened non-voice section as a noise section in the spectral subtraction method to perform denoising processing on the original voice signal. The method realizes the denoising operation of the voice signal by adaptively determining the unvoiced segment of the voice, solves the defect that the unvoiced segment needs to be determined manually in the existing method, and improves the intelligence and denoising effect of voice denoising.

As a result, as shown in fig. 10, when the initial speech is a speech segment, the method adaptively determines the unvoiced segment of the speech signal, so that the speech signal can be accurately denoised, and the denoising effect is excellent.

The invention designs a new voice endpoint detection method aiming at the condition that initial frames of a voice section needing endpoint detection are not non-voice frames in actual conditions and combining the advantages of short-time autocorrelation and energy-entropy ratio method of voice. The method comprises the steps of firstly, calculating short-time autocorrelation of each frame of voice signals, and roughly screening out unvoiced speech segments by combining a threshold value. And then calculating the ratio of the energy of the screened non-voice segment to the spectral entropy, and determining the threshold value of the subsequent voice detection. And calculating the energy-entropy ratio of the speech signals according to frames, and carrying out endpoint detection on the speech by combining a determined threshold value. The method realizes voice endpoint detection of the voice signal by adaptively determining the unvoiced segment of the voice, overcomes the defect that the unvoiced segment needs to be determined manually in the existing method, and improves the intelligence and the accuracy of the voice endpoint detection.

As shown in fig. 11, when the initial speech is a speech segment, the endpoint detection method provided by the present invention can adaptively determine the non-speech segment of the speech signal, perform decision and endpoint detection on the speech signal, and the decision result is relatively accurate.

The foregoing is illustrative of the preferred embodiments of this invention, and it is to be understood that the invention is not limited to the precise form disclosed herein and that various other combinations, modifications, and environments may be resorted to, falling within the scope of the concept as disclosed herein, either as described above or as apparent to those skilled in the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A speech and non-speech integrated decision method is characterized in that: the comprehensive judgment method comprises the following steps:

2. A method for integrated decision of speech and non-speech according to claim 1, characterized by: the preprocessing the first frame voice data to acquire each frame voice data for time-frequency conversion and cepstrum coefficient comprises:

3. A method for integrated decision of speech and non-speech according to claim 1, characterized by: the voice noise reduction processing comprises:

Averaging the autocorrelation value sequence

As a threshold η, when a frame segment whose autocorrelation value is less than or equal to the threshold η is taken as a non-speech segment, a frame segment greater than the threshold η is taken as a speech segment;

4. A method for integrated decision of speech and non-speech according to claim 3, characterized by: the determined non-speech segment and speech segment are used as input, and the original speech data x is subjected to spectral subtraction_nDenoising to obtain denoised voice data x'_nThe method comprises the following steps:

According to X_n(k) Amplitude | X of_n(k) Angle of phase

Non-speech segmentCalculating to obtain the average power spectrum value D (k) of the non-voice section;

According to the spectrally subtracted amplitude

Phase angle of sun

5. A method for integrated decision of speech and non-speech according to claim 1, characterized by: the method for detecting the voice endpoint by combining the short-time correlation method with the energy-entropy ratio comprises the following steps:

6. A method for integrated decision of speech and non-speech according to claim 1, characterized by: the voice recognition network comprises three convolution layers, three pooling layers and three full-connection layers; a first layer of convolutional layers: the size of the convolution kernel is 3 multiplied by 3, 32 convolution kernels are in total, the moving step length of the convolution kernels is 1, and in the convolution process, a 0 value is adopted for filling the part with insufficient boundaries; a first pooling layer: filling the part with insufficient boundary with 0 value by adopting the maximum pooling of 2 multiplied by 2; a second layer of convolutional layers: the size of the convolution kernel is 3 multiplied by 3, 64 convolution kernels are totally arranged, and the rest arrangement is the same as that of the first layer of convolution layer; the arrangement of the second layer of the pooling layer is the same as that of the first layer of the pooling layer; a third layer of convolutional layers: the size of the convolution kernel is 3 multiplied by 3, 1024 convolution kernels are in total, and the rest of the convolution kernels are the same as the first layer of convolution layer; the arrangement of the third layer of the pooling layer is the same as that of the first layer of the pooling layer; the output of the first full connection layer and the second full connection layer is 1024, and the output of the third full connection layer is 2, which represents the number of the required classifications; after each convolution, activating the convolved value by adopting a Relu activation function; in the training process of the network, parameters of the network are updated by adopting an Adam random gradient descent method.

7. A method for integrated decision of speech and non-speech according to any of claims 1-6, characterized by: the first frame voice data comprises voice data which is processed according to the time of each frame being 1s and the time of overlapping between each frame being 0.7 s; the second frame-divided voice data includes voice data processed by each frame with a time of 0.025s and an overlapping time of 0.01 s.