Finally, a correction factor x is adopted_m And (N) normalizing the windowed data of each frame to obtain preprocessed audio data: x is x_m ”(N)＝factorx_m '(N)*x_m '(N)。

Further, the short-time autocorrelation value R of the audio data after preprocessing each frame is calculated_m (k) And its mean (R)_m (k) The calculation formula of (c) is:

wherein i represents the i-th sampling point; x is x_m "(i) represents the ith sample point, x, of the m-th frame pre-processed audio data_m "(i+k) denotes a sample point after the audio data delay k time after the mth frame preprocessing;

further, the standard deviation of the audio data after each frame of pretreatment

Further, zero-crossing rate detection is performed on each frame of audio data processed in the step 2, which specifically includes:

since the audio data is a wideband non-stationary signal, the calculation formula of the short-time average zero-crossing rate is as follows:

wherein, I.S. is absolute value, sgn is sign function,

sgn[R_m (k)]＝1 R_m (k)>0

sgn[R_m (k)]＝0 R_m (k)＝0

sgn[R_m (k)]＝-1 R_m (k)<0

when the signs of two adjacent sampling points are the same, zero crossing is not generated; when the sign of two adjacent sampling points is opposite, |sgn [ R ]_m (k)]-sgn[R_m (k-1)]|=2, so for each frame of data, the sum is divided by 2N to give the average zero crossing rate.

Further, if there is no voice input for 3 seconds, the voice output is turned off.

Compared with the prior art, the invention has the beneficial effects that: the invention adopts the autocorrelation technology to effectively distinguish voice from background noise; the false detection and missing detection probability of the VAD is effectively reduced by adopting a plurality of statistics and a plurality of judgment thresholds; the algorithm is simple and reliable, the calculation complexity is low, the real-time performance is good, the portability is high, and a plurality of processing platforms are provided; the zero crossing rate detection can resist single-tone interference, can effectively prevent the interference of single-tone and squeak on voice detection, and improves the reliability of judgment.

The invention detects the voice activation through short-time autocorrelation, standard deviation and zero crossing rate detection, improves the comfort experience of short-wave communication voice communication, and greatly reduces the probability of noise silence and voice break-over threshold in the practical application process. The applicability is improved for the short wave communication voice squelch function.

Drawings

The invention will now be described in further detail with reference to the drawings and to specific examples.

FIG. 1 is a block flow diagram of an implementation of the present invention.

Detailed Description

Embodiments and effects of the present invention are described in further detail below with reference to the accompanying drawings.

Due to the limitation of the computing power and the power consumption of the using equipment, the adopted VAD algorithm cannot be too complex, and meanwhile, the processing delay (mainly the judgment delay of the occurrence of the voice and the end of the voice) cannot be too large, namely, the VAD algorithm has near real-time processing capability. In addition, the method should work normally in complex background noise, and has certain self-adaptation performance, and the factors lead to the VAD algorithm to be simple to realize and reliable to detect.

Based on the application requirements, referring to fig. 1, the short-wave communication voice activation detection method based on zero-crossing rate detection provided by the invention specifically comprises the following steps:

the specific framing and windowing processing is as follows: for each frame of audio data after band-pass filtering, intercepting a section of sampling point in the middle of each frame by adopting a window function to serve as data after windowing; i.e.

x_m '(N)＝x_m (N).*Hamming(N)

Wherein x is_m (N) is m-th frame band-pass filtered audio data, m=1, 2, …, N; ham min g (N) is a Hamming window function of length N. Specifically, the length of each frame of data is n=256 samples, and the middle 200 points of 256 are just intercepted from 28 points to 228 points, so as to reduce interference among frequency domains in the framing process.

The specific normalization process is as follows:

Secondly, comparing a preset experience value a (the value is an experience value obtained in the actual environment debugging process) with the average value to obtain a correction factor: factor x_m '(N)＝a/(mean x_m '(N))；

Finally, a correction factor x is adopted_m ' (N) normalizing the windowed data of each frame to obtain a preprocessed dataAudio data: x is x_m ”(N)＝factor x_m '(N)*x_m '(N)。

Step 2, calculating a short-time autocorrelation value and an average value of the audio data after each frame pretreatment, judging whether the correlation value of each point of the audio data after each frame pretreatment is more than 3 times of the average value of the correlation values of the audio data after the frame pretreatment, if so, setting the correlation value of the point to 0, otherwise, turning to step 3;

the correlation function is used to determine the similarity of two signals in the time domain, and when the correlation function of two signals is large, it is explained that one signal may be a time lag or advance of the other signal; when the correlation function is 0, then the two signals are completely different. The purpose of eliminating noise is achieved by utilizing the correlation of signals.

The autocorrelation function reflects the degree of similarity of the signal to the signal itself after a delay.

Short-time autocorrelation value R of audio data after each frame pretreatment_m (k) And its mean (R)_m (k) The calculation formula of (c) is:

wherein i represents the i-th sampling point; x is x_m "(i) represents the ith sample point, x, of the m-th frame pre-processed audio data_m "(i+k) denotes a sample point after the audio data delay k time after the mth frame preprocessing.

Standard deviation of the audio data after each frame pretreatment

In practice, std (stat) is required_m The amplitude of the data is slightly adjusted to prevent the data from overflowing in the subsequent calculation process.

zero-crossing rate Z_n Is the case where the signal is defined to cross the horizontal axis. For continuous signals, observing the condition that the voice time domain waveform passes through the transverse axis; for discrete signals, adjacent sample values have different algebraic signs, i.e. the number of times the sample changes sign.

wherein, I.S. is absolute value, sgn is sign function,

sgn[R_m (k)]＝1 R_m (k)>0

sgn[R_m (k)]＝0 R_m (k)＝0

sgn[R_m (k)]＝-1 R_m (k)<0

Step 5, judging whether there is standard deviation std (stat) of the audio data after the continuous M-frame preprocessing_m If the voice data is not smaller than the preset first-level threshold, judging that the voice data is input, and switching to the step 6; otherwise, judging that no voice is input; wherein M is more than or equal to 3.

Will std (stat)_m And comparing with a preset first-level threshold. Continuously calculating a plurality of frames (the number of continuous frames is generally selected to be 1 according to the calculated amount)Preferably between 0 and 30 frames), the std (stat) calculated for each frame_m And comparing with a preset first-level threshold, counting the number of frames which are not smaller than the first-level threshold in the selected continuous frames, and recording the value as peak_count.

When more than 3 frames appear in the continuous frames, the judgment is passed once, otherwise, the judgment is not passed once, namely, the judgment is passed once when the peak_count is more than or equal to 3, otherwise, the judgment is not passed. And returning to the step 1 to continue voice recognition when the decision is not passed once.

Step 6, judging whether the average zero crossing rate of the audio data after the continuous S frame pretreatment is not less than a preset secondary threshold, if yes, judging that the input is voice, otherwise, judging that the input is noise; to this end, voice activation detection is completed. Wherein S is more than or equal to 5.

Only after the first decision, the second decision is made. The secondary judgment is that if the zero crossing rate in continuous multi-frame data (5 continuous frames are defined in the actual use process of the scheme) is not smaller than a preset secondary threshold, the voice is indicated at the moment, otherwise, the noise is judged. And when the voice is judged, setting a relevant identifier, storing the data into a corresponding buffer zone, and outputting the voice data when the timing interruption of 20ms is met.

Considering the voice interval and duration, and closing voice output when the duration is about 3s less than the judgment requirement and the duration 3s standard variance is smaller than a preset threshold value;

after the voice detection is finished, the voice data is output after self-adaptive filtering, and the non-voice data is not processed, so that the purpose is to enhance the voice effect and improve the comfort level of voice.

The invention considers the randomness of noise, the average of the autocorrelation value is smaller, and the standard deviation is also smaller. In contrast, the autocorrelation value of a speech signal is large on average, the standard deviation thereof is also large, and the variance variation of autocorrelation between different frame signals of the speech signal is also large. Therefore, the presence or absence of speech is determined by using the feature of the variance of the autocorrelation and the corresponding statistics, and VAD detection is performed.

Typically, the voice sampling frequency is 9.6kHz, the data frame length is 20ms (the voice signal is generally considered to be substantially stationary in 10 ms-30 ms), and the number of processing points per time is 256 points. To prevent erroneous judgment of noise as speech, a secondary judgment is added. The method has certain expansibility, can adopt double thresholds or even multiple thresholds on the basis of the algorithm, sets the upper and lower boundaries of the thresholds, improves the detection accuracy, and has the cost of properly increasing the implementation complexity. The present invention relates generally to digital processing of speech signals, assuming that corresponding pre-processing, such as low-pass filtering, gain amplification, etc., has been performed prior to VAD processing.

The invention adopts the autocorrelation technology to effectively distinguish voice from background noise; the false detection and missing detection probability of the VAD is effectively reduced by adopting a plurality of statistics and a plurality of judgment thresholds; the algorithm is simple and reliable, the calculation complexity is low, the real-time performance is good, the portability is high, and a plurality of processing platforms are provided; zero crossing rate detection and single-tone interference resistance can effectively prevent single-tone and squeak from interfering with voice detection, and the reliability of judgment is improved.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware associated with program instructions, where the foregoing program may be stored in a computer readable storage medium, and when executed, the program performs steps including the above method embodiments; and the aforementioned storage medium includes: various media that can store program code, such as ROM, RAM, magnetic or optical disks.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. The short-wave communication voice activation detection method based on zero-crossing rate detection is characterized by comprising the following steps of:

2. The short wave communication voice activation detection method based on zero crossing rate detection according to claim 1, wherein the framing and windowing process is as follows: for each frame of audio data after band-pass filtering, intercepting a section of sampling point in the middle of each frame by adopting a window function to serve as data after windowing; i.e.

x_m '(N)＝x_m (N).*Hamming(N)

Wherein x is_m (N) is m-th frame band-pass filtered audio data, m=1, 2, …, N; hamming (N) is a Hamming window function of length N.

3. The short wave communication voice activation detection method based on zero crossing rate detection according to claim 1, wherein the normalization process specifically comprises:

Secondly, comparing a preset experience value a with the average value to obtain a correction factor: factor x_m '(N)＝a/(meanx_m '(N))；

4. The method for detecting the activation of voice in short-wave communication based on zero-crossing rate detection according to claim 1, wherein the short-time autocorrelation value R of the audio data after each frame of preprocessing is calculated_m (k) And its mean (R)_m (k) The calculation formula of (c) is:

5. The method for detecting the activation of voice in short wave communication based on zero crossing rate detection according to claim 1, wherein the standard deviation of the audio data after each frame of preprocessing is

6. The short wave communication voice activation detection method based on zero crossing rate detection according to claim 1, wherein the zero crossing rate detection is performed on each frame of audio data processed in step 2, and specifically comprises the following steps:

wherein, I and R are absolute values_m (k) A short-time autocorrelation value of the audio data after the m-th frame pretreatment; sgn [.]As a function of the sign of the symbol,

sgn[R_m (k)]＝1 R_m (k)>0

sgn[R_m (k)]＝0 R_m (k)＝0

sgn[R_m (k)]＝-1 R_m (k)<0

when the signs of two adjacent sampling points are the same, zero crossing is not generated; when the sign of two adjacent sampling points is opposite, |sgn [ R ]_m (k)]-sgn[R_m (k-1)]|＝2。

7. The method for detecting voice activation of short-wave communication based on zero-crossing rate detection according to claim 1, wherein if there is no voice input for 3 seconds continuously, the voice output is turned off.