Short wave communication voice activation detection method based on zero crossing rate detectionTechnical Field
The invention belongs to the technical field of short-wave communication, and particularly relates to a short-wave communication voice activation detection method based on zero crossing rate detection.
Background
The voice activation Detection technology (VAD), also called End-Point Detection (EPD), aims to correctly distinguish voice from various background noise, and has important application in the field of voice signal processing, especially in the field of acoustic signal processing. In speech recognition, a voiced segment and an unvoiced segment in a speech signal are generally segmented according to a certain endpoint detection algorithm, and then the voiced segment is recognized according to certain specific features of speech. Studies have shown that: even in a quiet environment, more than half of the recognition errors of the speech recognition system come from the endpoint detector. Therefore, as a first step in a speech recognition system, the criticality of endpoint detection cannot be ignored, especially in a speech endpoint detection in a strong background noise environment, and its accuracy directly influences whether subsequent work can be effectively performed to a great extent. The diversity of speech and background noise complicates the VAD problem.
Essentially, the root of the various VAD detection techniques is to find statistics that can effectively distinguish speech segments from a speech noise free background, and ultimately to a threshold decision. The conventional statistical feature quantities mainly used at present include: short-time energy, short-time zero-crossing rate, short-time autocorrelation function, information entropy, cepstrum, MEL coefficient and the like, and different VAD technologies are mostly based on different combinations of the methods. With the development of digital signal processing technology and the improvement of the computing capacity of corresponding processing equipment, new VAD algorithms such as wavelet transformation method, myopia entropy, support Vector Machine (SVM), neural network and the like are presented.
In general, the detection effect of a single statistical judgment is not ideal, and is often suitable for certain specific occasions. Because the background noise in different environments has larger change, and the voice changes along with the changes of the gender, age, language, tone, sound intensity, speech speed and the like of a speaker, the joint judgment criterion based on multiple statistics and multiple judgment thresholds becomes the direction of VAD detection research.
In a short-wave radio station, voice signal detection is a precondition for the short-wave radio station to finish squelching. The squelch is one of the basic functions of a radio station, and ensures that when a voice signal exists, the audio output of a receiver is turned on, and normal communication is maintained; and when no voice signal exists and only noise exists, the audio output is turned off. The basic process is that the presence or absence of a speech signal can be detected first, and then the audio output is controlled accordingly. In military small portable radio equipment, the VAD technology is effectively used to reduce power consumption in the voice-free section and prolong the service life of the equipment due to the limitation of power consumption.
Due to the limitation of the computing power and the power consumption of the using equipment, the adopted VAD algorithm cannot be too complex, and meanwhile, the processing delay (mainly the judgment delay of the occurrence of the voice and the end of the voice) cannot be too large, namely, the VAD algorithm has near real-time processing capability. In addition, the method should work normally in complex background noise, and has certain self-adaptation performance, and the factors lead to the VAD algorithm to be simple to realize and reliable to detect. Therefore, it is necessary to find a voice detection method with relatively simple calculation and relatively reliable detection results.
The currently used short wave voice detection method comprises the following steps: (1) Based on a combination of short-time energy and short-time average amplitude, the method is based on the amplitude change of the voice signal with time. The amplitude of the unvoiced segments is small, and the energy is concentrated in the high frequency band; the amplitude of the voiced sound section is larger, and the energy is concentrated in the low frequency section; (2) A detection method based on pre-emphasis and standard deviation comparison with a preset threshold.
In the active method, one major problem of the short-time energy function is that En is too sensitive to signal level values; in practical applications (e.g., pointing devices) it is easy to overflow due to the need to calculate the sum of squares of the signal samples. Therefore, the En is generally replaced by an average amplitude function Mn. However, at this time, the Mn of unvoiced and voiced, voiced and unvoiced is not as pronounced as the short-time energy En. Therefore, the phenomena of voice flushing failure, noise silence, and the like often occur in the practical application process, and the environmental impact is larger. The same voice activation detection method requires readjustment of parameters after changing the environment. Meanwhile, when the existing voice detection method is used for unwanted signals such as noise and single tones in the signals, the energy of the single tones and the noise is concentrated in a low frequency band, so that the signals can be misjudged to be voice through energy judgment, and misjudgment exists.
Disclosure of Invention
In order to solve the problems, the invention aims to provide a short-wave communication voice activation detection method based on zero-crossing rate detection.
In order to achieve the above purpose, the present invention adopts the following technical scheme.
A short wave communication voice activation detection method based on zero crossing rate detection comprises the following steps:
step 1, acquiring an audio data acquisition stream, namely N frames of audio data x (N), wherein the length of each frame of data is N, and sequentially carrying out band-pass filtering, framing windowing processing and normalization processing on the audio data acquisition stream to obtain corresponding N frames of preprocessed audio data x "(N);
step 2, calculating a short-time correlation value and an average value of the pre-processed audio data of each frame, judging whether the correlation value of each point of the pre-processed audio data of each frame is more than 3 times of the average value of the correlation value of the pre-processed audio data of the frame, if so, setting the correlation value of the point to 0, otherwise, turning to step 3;
step 3, calculating standard deviation std (stat) of the audio data after each frame pretreatmentm ;
Step 4, detecting zero crossing rate of each frame of audio data processed in the step 2 to obtain average zero crossing rate corresponding to each frame;
step 5, judging whether there is standard deviation std (stat) of the audio data after the continuous M-frame preprocessingm If the voice data is not smaller than the preset first-level threshold, judging that the voice data is input, and switching to the step 6; otherwise, judging that no voice is input;
step 6, judging whether the average zero crossing rate of the audio data after the continuous S frame pretreatment is not smaller than a preset second-level threshold, if yes, judging that the input is voice, otherwise, judging that the input is noise; to this end, voice activation detection is completed.
Further, the framing and windowing process is as follows: for each frame of audio data after band-pass filtering, intercepting a section of sampling point in the middle of each frame by adopting a window function to serve as data after windowing; i.e.
xm '(N)=xm (N).*Ham min g(N)
Wherein x ism (N) is m-th frame band-pass filtered audio data, m=1, 2, …, N; ham min g (N) is a Hamming window function of length N.
Further, the normalization process specifically includes:
first, data x per frame in the windowed data x' (N) is calculated
m Average value of N sample points of' (N)
Secondly, comparing the preset experience value a with the average value,and (3) obtaining a correction factor: factor xm '(N)=a/(meanxm '(N));
Finally, a correction factor x is adoptedm And (N) normalizing the windowed data of each frame to obtain preprocessed audio data: x is xm ”(N)=factorxm '(N)*xm '(N)。
Further, the short-time autocorrelation value R of the audio data after preprocessing each frame is calculatedm (k) And its mean (R)m (k) The calculation formula of (c) is:
wherein i represents the i-th sampling point; x is xm "(i) represents the ith sample point, x, of the m-th frame pre-processed audio datam "(i+k) denotes a sample point after the audio data delay k time after the mth frame preprocessing;
further, the standard deviation of the audio data after each frame of pretreatment
Further, zero-crossing rate detection is performed on each frame of audio data processed in the step 2, which specifically includes:
since the audio data is a wideband non-stationary signal, the calculation formula of the short-time average zero-crossing rate is as follows:
wherein, I.S. is absolute value, sgn is sign function,
sgn[Rm (k)]=1 Rm (k)>0
sgn[Rm (k)]=0 Rm (k)=0
sgn[Rm (k)]=-1 Rm (k)<0
when the signs of two adjacent sampling points are the same, zero crossing is not generated; when the sign of two adjacent sampling points is opposite, |sgn [ R ]m (k)]-sgn[Rm (k-1)]|=2, so for each frame of data, the sum is divided by 2N to give the average zero crossing rate.
Further, if there is no voice input for 3 seconds, the voice output is turned off.
Compared with the prior art, the invention has the beneficial effects that: the invention adopts the autocorrelation technology to effectively distinguish voice from background noise; the false detection and missing detection probability of the VAD is effectively reduced by adopting a plurality of statistics and a plurality of judgment thresholds; the algorithm is simple and reliable, the calculation complexity is low, the real-time performance is good, the portability is high, and a plurality of processing platforms are provided; the zero crossing rate detection can resist single-tone interference, can effectively prevent the interference of single-tone and squeak on voice detection, and improves the reliability of judgment.
The invention detects the voice activation through short-time autocorrelation, standard deviation and zero crossing rate detection, improves the comfort experience of short-wave communication voice communication, and greatly reduces the probability of noise silence and voice break-over threshold in the practical application process. The applicability is improved for the short wave communication voice squelch function.
Drawings
The invention will now be described in further detail with reference to the drawings and to specific examples.
FIG. 1 is a block flow diagram of an implementation of the present invention.
Detailed Description
Embodiments and effects of the present invention are described in further detail below with reference to the accompanying drawings.
Due to the limitation of the computing power and the power consumption of the using equipment, the adopted VAD algorithm cannot be too complex, and meanwhile, the processing delay (mainly the judgment delay of the occurrence of the voice and the end of the voice) cannot be too large, namely, the VAD algorithm has near real-time processing capability. In addition, the method should work normally in complex background noise, and has certain self-adaptation performance, and the factors lead to the VAD algorithm to be simple to realize and reliable to detect.
Based on the application requirements, referring to fig. 1, the short-wave communication voice activation detection method based on zero-crossing rate detection provided by the invention specifically comprises the following steps:
step 1, acquiring an audio data acquisition stream, namely N frames of audio data x (N), wherein the length of each frame of data is N, and sequentially carrying out band-pass filtering, framing windowing processing and normalization processing on the audio data acquisition stream to obtain corresponding N frames of preprocessed audio data x "(N);
the specific framing and windowing processing is as follows: for each frame of audio data after band-pass filtering, intercepting a section of sampling point in the middle of each frame by adopting a window function to serve as data after windowing; i.e.
xm '(N)=xm (N).*Hamming(N)
Wherein x ism (N) is m-th frame band-pass filtered audio data, m=1, 2, …, N; ham min g (N) is a Hamming window function of length N. Specifically, the length of each frame of data is n=256 samples, and the middle 200 points of 256 are just intercepted from 28 points to 228 points, so as to reduce interference among frequency domains in the framing process.
The specific normalization process is as follows:
first, data x per frame in the windowed data x' (N) is calculated
m Average value of N sample points of' (N)
Secondly, comparing a preset experience value a (the value is an experience value obtained in the actual environment debugging process) with the average value to obtain a correction factor: factor xm '(N)=a/(mean xm '(N));
Finally, a correction factor x is adoptedm ' (N) normalizing the windowed data of each frame to obtain a preprocessed dataAudio data: x is xm ”(N)=factor xm '(N)*xm '(N)。
Step 2, calculating a short-time autocorrelation value and an average value of the audio data after each frame pretreatment, judging whether the correlation value of each point of the audio data after each frame pretreatment is more than 3 times of the average value of the correlation values of the audio data after the frame pretreatment, if so, setting the correlation value of the point to 0, otherwise, turning to step 3;
the correlation function is used to determine the similarity of two signals in the time domain, and when the correlation function of two signals is large, it is explained that one signal may be a time lag or advance of the other signal; when the correlation function is 0, then the two signals are completely different. The purpose of eliminating noise is achieved by utilizing the correlation of signals.
The autocorrelation function reflects the degree of similarity of the signal to the signal itself after a delay.
Short-time autocorrelation value R of audio data after each frame pretreatmentm (k) And its mean (R)m (k) The calculation formula of (c) is:
wherein i represents the i-th sampling point; x is xm "(i) represents the ith sample point, x, of the m-th frame pre-processed audio datam "(i+k) denotes a sample point after the audio data delay k time after the mth frame preprocessing.
Step 3, calculating standard deviation std (stat) of the audio data after each frame pretreatmentm ;
Standard deviation of the audio data after each frame pretreatment
In practice, std (stat) is requiredm The amplitude of the data is slightly adjusted to prevent the data from overflowing in the subsequent calculation process.
Step 4, detecting zero crossing rate of each frame of audio data processed in the step 2 to obtain average zero crossing rate corresponding to each frame;
zero-crossing rate Zn Is the case where the signal is defined to cross the horizontal axis. For continuous signals, observing the condition that the voice time domain waveform passes through the transverse axis; for discrete signals, adjacent sample values have different algebraic signs, i.e. the number of times the sample changes sign.
Since the audio data is a wideband non-stationary signal, the calculation formula of the short-time average zero-crossing rate is as follows:
wherein, I.S. is absolute value, sgn is sign function,
sgn[Rm (k)]=1 Rm (k)>0
sgn[Rm (k)]=0 Rm (k)=0
sgn[Rm (k)]=-1 Rm (k)<0
when the signs of two adjacent sampling points are the same, zero crossing is not generated; when the sign of two adjacent sampling points is opposite, |sgn [ R ]m (k)]-sgn[Rm (k-1)]|=2, so for each frame of data, the sum is divided by 2N to give the average zero crossing rate.
Step 5, judging whether there is standard deviation std (stat) of the audio data after the continuous M-frame preprocessingm If the voice data is not smaller than the preset first-level threshold, judging that the voice data is input, and switching to the step 6; otherwise, judging that no voice is input; wherein M is more than or equal to 3.
Will std (stat)m And comparing with a preset first-level threshold. Continuously calculating a plurality of frames (the number of continuous frames is generally selected to be 1 according to the calculated amount)Preferably between 0 and 30 frames), the std (stat) calculated for each framem And comparing with a preset first-level threshold, counting the number of frames which are not smaller than the first-level threshold in the selected continuous frames, and recording the value as peak_count.
When more than 3 frames appear in the continuous frames, the judgment is passed once, otherwise, the judgment is not passed once, namely, the judgment is passed once when the peak_count is more than or equal to 3, otherwise, the judgment is not passed. And returning to the step 1 to continue voice recognition when the decision is not passed once.
Step 6, judging whether the average zero crossing rate of the audio data after the continuous S frame pretreatment is not less than a preset secondary threshold, if yes, judging that the input is voice, otherwise, judging that the input is noise; to this end, voice activation detection is completed. Wherein S is more than or equal to 5.
Only after the first decision, the second decision is made. The secondary judgment is that if the zero crossing rate in continuous multi-frame data (5 continuous frames are defined in the actual use process of the scheme) is not smaller than a preset secondary threshold, the voice is indicated at the moment, otherwise, the noise is judged. And when the voice is judged, setting a relevant identifier, storing the data into a corresponding buffer zone, and outputting the voice data when the timing interruption of 20ms is met.
Considering the voice interval and duration, and closing voice output when the duration is about 3s less than the judgment requirement and the duration 3s standard variance is smaller than a preset threshold value;
after the voice detection is finished, the voice data is output after self-adaptive filtering, and the non-voice data is not processed, so that the purpose is to enhance the voice effect and improve the comfort level of voice.
The invention considers the randomness of noise, the average of the autocorrelation value is smaller, and the standard deviation is also smaller. In contrast, the autocorrelation value of a speech signal is large on average, the standard deviation thereof is also large, and the variance variation of autocorrelation between different frame signals of the speech signal is also large. Therefore, the presence or absence of speech is determined by using the feature of the variance of the autocorrelation and the corresponding statistics, and VAD detection is performed.
Typically, the voice sampling frequency is 9.6kHz, the data frame length is 20ms (the voice signal is generally considered to be substantially stationary in 10 ms-30 ms), and the number of processing points per time is 256 points. To prevent erroneous judgment of noise as speech, a secondary judgment is added. The method has certain expansibility, can adopt double thresholds or even multiple thresholds on the basis of the algorithm, sets the upper and lower boundaries of the thresholds, improves the detection accuracy, and has the cost of properly increasing the implementation complexity. The present invention relates generally to digital processing of speech signals, assuming that corresponding pre-processing, such as low-pass filtering, gain amplification, etc., has been performed prior to VAD processing.
The invention adopts the autocorrelation technology to effectively distinguish voice from background noise; the false detection and missing detection probability of the VAD is effectively reduced by adopting a plurality of statistics and a plurality of judgment thresholds; the algorithm is simple and reliable, the calculation complexity is low, the real-time performance is good, the portability is high, and a plurality of processing platforms are provided; zero crossing rate detection and single-tone interference resistance can effectively prevent single-tone and squeak from interfering with voice detection, and the reliability of judgment is improved.
Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware associated with program instructions, where the foregoing program may be stored in a computer readable storage medium, and when executed, the program performs steps including the above method embodiments; and the aforementioned storage medium includes: various media that can store program code, such as ROM, RAM, magnetic or optical disks.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.