Movatterモバイル変換


[0]ホーム

URL:


CN109754823A - A kind of voice activity detection method, mobile terminal - Google Patents

A kind of voice activity detection method, mobile terminal
Download PDF

Info

Publication number
CN109754823A
CN109754823ACN201910143186.9ACN201910143186ACN109754823ACN 109754823 ACN109754823 ACN 109754823ACN 201910143186 ACN201910143186 ACN 201910143186ACN 109754823 ACN109754823 ACN 109754823A
Authority
CN
China
Prior art keywords
voice
current frame
noise
frequency domain
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910143186.9A
Other languages
Chinese (zh)
Inventor
王少华
申厚拯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Vivo Mobile Communication Co Ltd
Original Assignee
Vivo Mobile Communication Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Vivo Mobile Communication Co LtdfiledCriticalVivo Mobile Communication Co Ltd
Priority to CN201910143186.9ApriorityCriticalpatent/CN109754823A/en
Publication of CN109754823ApublicationCriticalpatent/CN109754823A/en
Pendinglegal-statusCriticalCurrent

Links

Landscapes

Abstract

Translated fromChinese

本发明提供了一种语音活动检测方法和移动终端,涉及音频信号处理技术领域。所述方法,包括:获取目标音频数据中当前帧在预设的多个频域子带中的特征值;基于所述特征值,通过预设的分类方法确定所述当前帧的语音后验概率;在所述语音后验概率大于等于预设的语音门限概率的情况下,确认所述当前帧为语音帧。能够提高语音活动检测的准确性以及实用性。

The invention provides a voice activity detection method and a mobile terminal, and relates to the technical field of audio signal processing. The method includes: acquiring feature values of a current frame in the target audio data in a plurality of preset frequency domain subbands; based on the feature values, determining a speech posterior probability of the current frame by a preset classification method ; In the case that the voice posterior probability is greater than or equal to a preset voice threshold probability, confirm that the current frame is a voice frame. The accuracy and practicability of voice activity detection can be improved.

Description

Translated fromChinese
一种语音活动检测方法、移动终端A kind of voice activity detection method, mobile terminal

技术领域technical field

本发明涉及音频信号处理技术领域,尤其涉及一种语音活动检测方法、移动终端。The present invention relates to the technical field of audio signal processing, and in particular, to a voice activity detection method and a mobile terminal.

背景技术Background technique

目前,移动互联网的快速发展带动了手机、平板电脑、穿戴式设备等移动终端的广泛普及,而作为移动终端上人机交互最方便自然的方式之一,语音输入正逐渐被广大用户所接受。但是目前,语音通话、语音识别、语音唤醒、声纹识别等应用场景在噪声或者强噪声环境下都受到诸多限制。这些应用场景一般都要结合语音增强的方法,才能到达比较好的效果。而语音前端信号的增强需要一个比较准确的语音活动检测(Voice ActivityDetection,VAD)相结合在一起使用。一个准确的、对低信噪比鲁棒的VAD对语音增强起到至关重要的作用。At present, the rapid development of the mobile Internet has led to the widespread popularity of mobile terminals such as mobile phones, tablet computers, and wearable devices. As one of the most convenient and natural ways of human-computer interaction on mobile terminals, voice input is gradually being accepted by the majority of users. However, at present, application scenarios such as voice calls, voice recognition, voice wake-up, and voiceprint recognition are subject to many limitations in noisy or strong noise environments. These application scenarios generally need to be combined with voice enhancement methods to achieve better results. The enhancement of the voice front-end signal requires a more accurate voice activity detection (Voice Activity Detection, VAD) to be used together. An accurate and robust VAD to low signal-to-noise ratio plays a crucial role in speech enhancement.

但是,现有的语音活动检测方案以下两方面的缺陷:一方面,在低信噪比情况下,VAD判断不准确,具体可以包括命中率(Speech Hit Rate)和误警率(False Alarm Rate)两个方面。其中,命中率低说明漏掉的语音比较多,误警率高说明不是语音的帧会被判断成语音,而且这两个方面对语音增强的影响都比较大。另一方面,现有VAD方案涉及的模型复杂度高,比如基于深度学习的VAD方法。深度学习的方法效果好,但是对于嵌入式设备来说,复杂度比较高,难以实用,而且深度学习的方法需要大量的语音和噪声样本,需要对样本做标注,实现起来比较困难。综上所述,现有的VAD方法存在准确率以及实用性较低等问题。However, the existing voice activity detection scheme has the following two defects: On the one hand, in the case of low signal-to-noise ratio, the VAD judgment is inaccurate, which may include the hit rate (Speech Hit Rate) and the false alarm rate (False Alarm Rate). two aspects. Among them, a low hit rate indicates that there are more missing voices, and a high false alarm rate indicates that frames that are not voices will be judged as voices, and these two aspects have a greater impact on voice enhancement. On the other hand, existing VAD schemes involve high model complexity, such as deep learning-based VAD methods. The deep learning method works well, but for embedded devices, the complexity is high and it is difficult to be practical, and the deep learning method requires a large number of speech and noise samples, and the samples need to be labeled, which is difficult to implement. To sum up, the existing VAD methods have problems such as low accuracy and low practicability.

发明内容SUMMARY OF THE INVENTION

本发明实施例提供一种语音活动检测方法和移动终端,以解决现有的VAD方法准确率以及实用性较低的问题。Embodiments of the present invention provide a voice activity detection method and a mobile terminal, so as to solve the problems of low accuracy and practicability of the existing VAD method.

为了解决上述技术问题,本发明是这样实现的:In order to solve the above-mentioned technical problems, the present invention is achieved in this way:

第一方面,本发明实施例提供了一种语音活动检测方法,包括:In a first aspect, an embodiment of the present invention provides a voice activity detection method, including:

获取目标音频数据中当前帧在预设的多个频域子带中的特征值;Obtain the eigenvalues of the current frame in the preset multiple frequency domain subbands in the target audio data;

基于所述特征值,通过预设的分类方法确定所述当前帧的语音后验概率;Based on the feature value, determine the speech posterior probability of the current frame by a preset classification method;

在所述语音后验概率大于等于预设的语音门限概率的情况下,确认所述当前帧为语音帧。In the case that the speech posterior probability is greater than or equal to a preset speech threshold probability, it is confirmed that the current frame is a speech frame.

第二方面,本发明实施例还提供了一种移动终端,包括:In a second aspect, an embodiment of the present invention further provides a mobile terminal, including:

特征值获取模块,用于获取目标音频数据中当前帧在预设的多个频域子带中的特征值;所述特征值包括幅度均值和/或幅度方差值;an eigenvalue acquisition module for acquiring eigenvalues of the current frame in the target audio data in a plurality of preset frequency domain subbands; the eigenvalues include an amplitude mean value and/or an amplitude variance value;

第一概率获取模块,用于基于所述特征值,通过预设的分类方法确定所述当前帧的语音后验概率;a first probability acquisition module, configured to determine the speech posterior probability of the current frame by a preset classification method based on the feature value;

语音帧确认模块,用于在所述语音后验概率大于等于预设的语音门限概率的情况下,确认所述当前帧为语音帧。A voice frame confirmation module, configured to confirm that the current frame is a voice frame when the voice posterior probability is greater than or equal to a preset voice threshold probability.

第三方面,本发明实施例另外提供了一种移动终端,包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述计算机程序被所述处理器执行时实现如前述的语音活动检测方法的步骤。In a third aspect, an embodiment of the present invention additionally provides a mobile terminal, including: a memory, a processor, and a computer program stored on the memory and executable on the processor, the computer program being processed by the processor The steps of the aforementioned voice activity detection method are implemented when the device is executed.

第四方面,本发明实施例另外提供了一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如前述的语音活动检测方法的步骤。In a fourth aspect, an embodiment of the present invention additionally provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the aforementioned voice activity detection method is implemented. step.

在本发明实施例中,通过获取目标音频数据中当前帧在预设的多个频域子带中的特征值;基于所述特征值,通过预设的分类方法确定所述当前帧的语音后验概率;在所述语音后验概率大于等于预设的语音门限概率的情况下,确认所述当前帧为语音帧。能够实现提高语音活动检测的准确性以及实用性。In this embodiment of the present invention, the feature values of the current frame in the target audio data in a plurality of preset frequency domain subbands are obtained; based on the feature values, a preset classification method is used to determine the voice output of the current frame. confirming that the current frame is a speech frame when the speech posterior probability is greater than or equal to a preset speech threshold probability. The accuracy and practicability of voice activity detection can be improved.

上述说明仅是本发明技术方案的概述,为了能够更清楚了解本发明的技术手段,而可依照说明书的内容予以实施,并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂,以下特举本发明的具体实施方式。The above description is only an overview of the technical solutions of the present invention, in order to be able to understand the technical means of the present invention more clearly, it can be implemented according to the content of the description, and in order to make the above and other purposes, features and advantages of the present invention more obvious and easy to understand , the following specific embodiments of the present invention are given.

附图说明Description of drawings

为了更清楚地说明本发明实施例的技术方案,下面将对本发明实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions of the embodiments of the present invention more clearly, the following briefly introduces the drawings that are used in the description of the embodiments of the present invention. Obviously, the drawings in the following description are only some embodiments of the present invention. , for those of ordinary skill in the art, other drawings can also be obtained based on these drawings without any creative effort.

图1是本发明实施例一中的一种语音活动检测方法的步骤流程图;1 is a flow chart of steps of a method for detecting voice activity in Embodiment 1 of the present invention;

图2是本发明实施例二中的一种语音活动检测方法的步骤流程图;2 is a flow chart of steps of a voice activity detection method in Embodiment 2 of the present invention;

图3是本发明实施例三中的一种移动终端的结构示意图;3 is a schematic structural diagram of a mobile terminal in Embodiment 3 of the present invention;

图4是本发明实施例四中的一种移动终端的结构示意图;4 is a schematic structural diagram of a mobile terminal in Embodiment 4 of the present invention;

图5是本发明实施例五中的一种移动终端的硬件结构示意图。FIG. 5 is a schematic diagram of a hardware structure of a mobile terminal in Embodiment 5 of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

实施例一Example 1

详细介绍本发明实施例提供的一种语音活动检测方法。A voice activity detection method provided by an embodiment of the present invention is introduced in detail.

参照图1,示出了本发明实施例中一种语音活动检测方法的步骤流程图。Referring to FIG. 1 , a flowchart of steps of a voice activity detection method in an embodiment of the present invention is shown.

步骤110,获取目标音频数据中当前帧在预设的多个频域子带中的特征值。Step 110: Obtain feature values of the current frame in the target audio data in a plurality of preset frequency domain subbands.

在本发明实施例中,在带噪情况下,为了提高语音活动检测方案的鲁棒性的,同时降低方案的复杂度,可以基于目标语音数据中的每一帧在频域上进行分带处理,从而可以把音频帧的特征维度降低到比较低的维度。那么针对目标音频数据中当前待检测的音频帧,也即当前帧,可以获取当前帧在预设的多个频域子带中的特征值。其中的频域子带的具体划分方式,以及频域子带的数量等均可以根据需求进行预先设置,对此本发明实施例不加以限定。所述特征值可以包括但不限于幅度均值和/或幅度方差值,等等。In the embodiment of the present invention, in the case of noise, in order to improve the robustness of the voice activity detection scheme and reduce the complexity of the scheme at the same time, it is possible to perform banding processing in the frequency domain based on each frame in the target voice data , so that the feature dimension of the audio frame can be reduced to a lower dimension. Then, for the audio frame currently to be detected in the target audio data, that is, the current frame, the feature values of the current frame in a plurality of preset frequency domain subbands can be obtained. The specific division manner of the frequency-domain subbands, the number of the frequency-domain subbands, etc., may be preset according to requirements, which are not limited in this embodiment of the present invention. The eigenvalues may include, but are not limited to, magnitude mean and/or magnitude variance values, and the like.

例如,可以设置频域子带包括6个子带,分别是:[80Hz,250Hz]、[250Hz,500Hz]、[500Hz,1kHz]、[1kHz,2kHz]、[2kHz,3kHz]和[3kHz,4kHz]。那么此时则可以分别获取当前帧在6个频域子带中的特征值。其中可以通过任何可用方式获取当前帧在预设的多个频域子带中的特征值,对此本发明实施例不加以限定。For example, the frequency domain subband can be set to include 6 subbands, namely: [80Hz, 250Hz], [250Hz, 500Hz], [500Hz, 1kHz], [1kHz, 2kHz], [2kHz, 3kHz] and [3kHz, 4kHz] ]. Then, at this time, the eigenvalues of the current frame in the 6 frequency domain subbands can be obtained respectively. The eigenvalues of the current frame in the preset multiple frequency domain subbands may be acquired in any available manner, which is not limited in this embodiment of the present invention.

例如,可以先将当前帧转换至频域,进而对当前帧对应的频域信号取绝对值,从而得到当前帧在频域在幅度信息,然后分别基于幅度信息,获取当前帧在各个频域子带中的幅度均值。而且,为了加入一些基音周期和谐波方面的信息,还可以对当前帧对应在每个频域子带上的幅度值计算当前帧对应在每个频域子带中的幅度方差值,用这个幅度方差值间接表示基音周期和谐波的信息。另外,由于一般情况下低频范围的音频对语音活动检测的影响比较大,因此为了减少计算量以提高语音活动检测效率,可以只选择频率范围较低的前N个频域子带计算当前帧的幅度方差。当然,本发明实施例中的特征值所包含的具体内容可以根据需求进行预先设置,对此本发明实施例不加以限定。For example, the current frame can be converted to the frequency domain first, and then the absolute value of the frequency domain signal corresponding to the current frame can be obtained, so as to obtain the amplitude information of the current frame in the frequency domain, and then based on the amplitude information, obtain the current frame in each frequency domain. The amplitude mean in the band. Moreover, in order to add some pitch period and harmonic information, it is also possible to calculate the amplitude variance value of the current frame corresponding to each frequency domain subband for the amplitude value corresponding to the current frame in each frequency domain subband, using This amplitude variance value indirectly represents information about the pitch period and harmonics. In addition, because the audio in the low frequency range has a relatively large impact on the voice activity detection in general, in order to reduce the amount of calculation and improve the efficiency of voice activity detection, only the first N frequency domain subbands with a lower frequency range can be selected to calculate the current frame. Amplitude variance. Certainly, the specific content included in the feature value in the embodiment of the present invention may be preset according to requirements, which is not limited in the embodiment of the present invention.

步骤120,基于所述特征值,通过预设的分类方法确定所述当前帧的语音后验概率。Step 120: Based on the feature value, determine the speech posterior probability of the current frame by a preset classification method.

在获取得到当前帧的特征值之后,则可以基于特征值,通过预设的分类方法确定当前帧的语音后验概率。其中的分类方法可以根据需求进行预先设置,对此本发明实施例不加以限定。After the feature value of the current frame is obtained, the speech posterior probability of the current frame may be determined by a preset classification method based on the feature value. The classification method may be preset according to requirements, which is not limited in this embodiment of the present invention.

例如,可以设置分类方法包括但不限于基于GMM(Gaussian Mixture Model,高斯混合模型)的分类方法、基于神经网络的分类方法、基于SVM(Support Vector Machine,支持向量机)的分类方法,等等。而且针对不同的分类方法,都可以基于相应的样本音频数据预先训练或者设置相应方法中涉及的参数。其中的样本音频数据可以包括但不限于样本语音音频数据和/或样本噪声音频数据,等等。For example, classification methods can be set including but not limited to a GMM (Gaussian Mixture Model, Gaussian Mixture Model)-based classification method, a neural network-based classification method, an SVM (Support Vector Machine, support vector machine)-based classification method, and the like. Moreover, for different classification methods, the parameters involved in the corresponding methods can be pre-trained or set based on the corresponding sample audio data. The sample audio data therein may include, but is not limited to, sample speech audio data and/or sample noise audio data, and the like.

步骤130,在所述语音后验概率大于等于预设的语音门限概率的情况下,确认所述当前帧为语音帧。Step 130: In the case that the speech posterior probability is greater than or equal to a preset speech threshold probability, confirm that the current frame is a speech frame.

而在获取得到当前帧的语音后验概率之后,则可以进而确定当前帧的语音后验概率是否大于等于预设的语音门限概率,那么在当前帧的语音后验概率大于等于预设的语音门限概率的情况下,则可以确认当前帧为语音帧,而在当前帧属于音频的语音后验概率小于预设的语音门限概率的情况下,则可以确认当前帧为噪声帧。After obtaining the voice posterior probability of the current frame, it can be further determined whether the voice posterior probability of the current frame is greater than or equal to the preset voice threshold probability, then the voice posterior probability of the current frame is greater than or equal to the preset voice threshold In the case of probability, it can be confirmed that the current frame is a speech frame, and when the posterior probability of speech of the current frame belonging to audio is less than the preset speech threshold probability, it can be confirmed that the current frame is a noise frame.

其中,语音门限概率的具体取值可以根据需求进行预先设置,对此本发明实施例不加以限定。例如,可以设置语音门限概率为0.9或者0.95,等等。The specific value of the voice threshold probability may be preset according to requirements, which is not limited in this embodiment of the present invention. For example, the speech threshold probability can be set to 0.9 or 0.95, and so on.

在本发明实施例中,通过获取目标音频数据中当前帧在预设的多个频域子带中的特征值;基于所述特征值,通过预设的分类方法确定所述当前帧的语音后验概率;在所述语音后验概率大于等于预设的语音门限概率的情况下,确认所述当前帧为语音帧。能够实现提高语音活动检测的准确性以及实用性。In this embodiment of the present invention, the feature values of the current frame in the target audio data in a plurality of preset frequency domain subbands are obtained; based on the feature values, a preset classification method is used to determine the voice output of the current frame. confirming that the current frame is a speech frame when the speech posterior probability is greater than or equal to a preset speech threshold probability. The accuracy and practicability of voice activity detection can be improved.

实施例二Embodiment 2

详细介绍本发明实施例提供的一种语音活动检测方法。A voice activity detection method provided by an embodiment of the present invention is introduced in detail.

参照图2,示出了本发明实施例中一种语音活动检测方法的步骤流程图。Referring to FIG. 2 , a flowchart of steps of a voice activity detection method in an embodiment of the present invention is shown.

步骤210,获取目标音频数据中当前帧在预设的多个频域子带中的特征值。Step 210: Obtain the feature values of the current frame in the preset multiple frequency domain subbands in the target audio data.

可选地,在本发明实施例中,所述步骤210进一步可以包括:Optionally, in this embodiment of the present invention, the step 210 may further include:

子步骤A211,将所述当前帧转换为频域信号,并获取所述频域信号在预设的多个频域子带中的特征值;Sub-step A211, converting the current frame into a frequency-domain signal, and acquiring the eigenvalues of the frequency-domain signal in a plurality of preset frequency-domain subbands;

或者,子步骤B211,通过针对所述频率子带的带通滤波器对所述当前帧进行滤波处理,得到所述当前帧中对应所述频域子带的频率分量,并获取所述频率分量对应在所述频域子带中的特征值。Or, in sub-step B211, filter the current frame by a bandpass filter for the frequency subband to obtain a frequency component corresponding to the frequency domain subband in the current frame, and obtain the frequency component corresponds to the eigenvalues in the frequency domain subbands.

一般而言,目标音频是时域信号,而频域子带是基于频率划分的区间,那么为了获取当前帧在每个频域子带中的特征值,则需要确定当前帧中属于各个频域子带的部分。Generally speaking, the target audio is a time-domain signal, and the frequency-domain subband is an interval based on frequency division, so in order to obtain the eigenvalues of the current frame in each frequency-domain subband, it is necessary to determine the current frame belongs to each frequency domain. part of the subband.

因此,在本发明实施例中,可以先将当前帧转换为频域信号,进而可以基于当前帧对应的频域信号,确定其对应在各个频域子带中的特征值。其中,可以通过任何可用方式将当前帧从时域转换为频域信号,对此本发明实施例不加以限定。例如,可以基于快速傅里叶变换(Fast Fourier Transformation,FFT),将当前帧转换为频域信号,等等。Therefore, in this embodiment of the present invention, the current frame may be converted into a frequency domain signal first, and then the eigenvalues corresponding to each frequency domain subband may be determined based on the frequency domain signal corresponding to the current frame. Wherein, the current frame may be converted from a time domain to a frequency domain signal in any available manner, which is not limited in this embodiment of the present invention. For example, the current frame may be converted into a frequency domain signal based on Fast Fourier Transformation (FFT), and so on.

或者,也可以通过每个频域子带对应的带通滤波器分别对当前帧进行滤波处理,从而可以得到当前帧中对应每个频域子带的频率分量,从而可以基于每个频域子带对应的频率分量,进而获取相应频率分量在相应频率子带中的特征值。其中,通过每个频域子带对应的带通滤波器,可以过滤得到当前帧中属于相应频率子带的频率分量。Alternatively, the current frame can also be filtered through the bandpass filter corresponding to each frequency domain subband, so that the frequency components corresponding to each frequency domain subband in the current frame can be obtained, so that the frequency components corresponding to each frequency domain subband in the current frame can be obtained. Band the corresponding frequency components, and then obtain the eigenvalues of the corresponding frequency components in the corresponding frequency subbands. Wherein, through the bandpass filter corresponding to each frequency domain subband, the frequency components belonging to the corresponding frequency subband in the current frame can be obtained by filtering.

但是,为了降低设备复杂度以及特征值获取复杂度,可以优选地采用子步骤A211中所述的方式获取当前帧在每个频率子带中的特征值。However, in order to reduce the complexity of the device and the complexity of acquiring the eigenvalues, the eigenvalues of the current frame in each frequency subband may preferably be acquired in the manner described in sub-step A211.

步骤220,获取所述语音音频数据中的每一个语音音频帧在每个所述频域子带中的语音特征值,和所述噪声音频数据中的每一个噪声音频帧在每个所述频域子带中的噪声特征值。Step 220: Acquire the speech feature value of each speech audio frame in each of the frequency domain subbands in the speech audio data, and each noise audio frame in the noise audio data in each of the frequency domain subbands. Noise eigenvalues in domain subbands.

在本发明实施例中,如果基于GMM模型进行语音活动检测,那么则需要初始化GMM模型中的参数。GMM模型可以表示如下:In this embodiment of the present invention, if voice activity detection is performed based on the GMM model, the parameters in the GMM model need to be initialized. The GMM model can be expressed as follows:

其中,表示GMM模型中的第k个分量,k的取值为1或2,μ1和μ2分别表示语音特征均值参数和噪声特征均值参数,∑1和∑2分别表示高斯混合模型的语音协方差参数和噪声协方差参数,π1和π2分别表示音频帧先验概率和噪声帧先验概率,x表示由待检测的当前帧的特征值构成的特征向量,p(x)表示概率。其中π1和π2的取值可以根据需求进行预先设置,对此本发明实施例不加以限定。例如,可以设置π1和π2均为0.5;或者设置π1为0.3,π2为0.7;等等。在实际应用中,一般音频帧属于噪声帧的可能性大于音频帧,因此可以设置π2大于π1in, Indicates the kth component in the GMM model, and the value of k is 1 or 2, μ1 and μ2 respectively represent the speech feature mean parameter and noise feature mean parameter, ∑1 and ∑2 respectively represent the speech covariance of the Gaussian mixture model parameters and noise covariance parameters, π1 and π2 represent the audio frame prior probability and noise frame prior probability, respectively, x represents the eigenvector composed of the eigenvalues of the current frame to be detected, and p(x) represents the probability. The values of π1 and π2 may be preset according to requirements, which are not limited in this embodiment of the present invention. For example, π1 and π2 can be both set to 0.5; or π1 can be set to 0.3, π2 can be set to 0.7; and so on. In practical applications, generally, the possibility of an audio frame belonging to a noise frame is greater than that of an audio frame, so π2 can be set to be greater than π1 .

那么为了对GMM模型中的μ1、μ2、∑1和∑2进行初始化,同时保证GMM模型的准确率,则可以获取预设的语音音频中的每一个语音音频帧在每个所述频域子带中的语音特征值,和预设的噪声音频中的每一个噪声音频帧在每个所述频域子带中的噪声特征值。其中的语音特征值和噪声特征值也可以相应包括幅度均值和/或幅度方差值。而且语音特征值和噪声特征值的获取方式可以参考前述的当前帧特征值的获取方式,在此不加以赘述。Then, in order to initialize μ1 , μ2 , ∑1 and ∑2 in the GMM model, and at the same time ensure the accuracy of the GMM model, it is possible to obtain the preset speech and audio frames of each speech and audio frame at each frequency. The speech feature value in the domain subband, and the noise feature value in each of the frequency domain subbands of each noise audio frame in the preset noise audio. The speech feature value and the noise feature value therein may also include the amplitude mean value and/or the amplitude variance value accordingly. In addition, the acquisition method of the speech feature value and the noise feature value may refer to the aforementioned acquisition method of the current frame feature value, which will not be repeated here.

步骤230,根据所述语音特征值,获取所述高斯混合模型的语音特征均值参数和语音协方差参数。Step 230: Obtain the voice feature mean parameter and voice covariance parameter of the Gaussian mixture model according to the voice feature value.

在本发明实施例中,为了提高语音特征均值参数和语音协方差参数的准确性,可以预先选取至少一个语音音频数据以确定语音特征均值参数和语音协方差参数,而且每个语音音频数据中又可以包含至少一个语音音频帧。而如果预设的语音音频数据包含多个语音音频帧,那么每个频域子带则可能对应多个同维度的语音特征值。In the embodiment of the present invention, in order to improve the accuracy of the voice feature mean parameter and the voice covariance parameter, at least one voice audio data may be preselected to determine the voice feature mean parameter and the voice covariance parameter, and each voice audio data contains May contain at least one speech audio frame. However, if the preset speech audio data includes multiple speech audio frames, each frequency domain subband may correspond to multiple speech feature values of the same dimension.

例如,如果语音特征值中包含了在每个频域子带中的语音幅度均值,而且预设的语音音频数据包含了三个语音音频帧分别为a1、a2和a3,频域子带的划分方式如上述的[80Hz,250Hz]、[250Hz,500Hz]、[500Hz,1kHz]、[1kHz,2kHz]、[2kHz,3kHz]和[3kHz,4kHz]。那么对于语音音频帧a1、a2和a3,都可以在频域子带[80Hz,250Hz]中对应有语音幅度均值分别为AVG11、AVG12和AVG13,相应地在频域子带[250Hz,500Hz]中也可以对应有语音幅度均值分别为AVG21、AVG22和AVG23,以此类推。那么每个频域子带最终对应的语音幅度均值可以为预设的每个语音音频帧在该频域子带中的语音幅度均值的平均值,例如对于上述的频域子带[250Hz,500Hz],最终对应的语音幅度均值可以(AVG21+AVG22+AVG23)/3。For example, if the speech feature value contains the mean value of speech amplitude in each frequency domain subband, and the preset speech audio data contains three speech audio frames a1, a2 and a3 respectively, the division of frequency domain subbands The modes are as described above [80Hz, 250Hz], [250Hz, 500Hz], [500Hz, 1kHz], [1kHz, 2kHz], [2kHz, 3kHz] and [3kHz, 4kHz]. Then for the speech audio frames a1, a2 and a3, there can be corresponding speech amplitude averages in the frequency domain subband [80Hz, 250Hz] respectively AVG11, AVG12 and AVG13, correspondingly in the frequency domain subband [250Hz, 500Hz] It can also correspond to the mean values of speech amplitudes as AVG21, AVG22, and AVG23, and so on. Then the final corresponding speech amplitude mean value of each frequency domain subband can be the preset average value of the speech amplitude mean value of each speech audio frame in the frequency domain subband, for example, for the above frequency domain subband [250Hz, 500Hz ], the final corresponding average speech amplitude can be (AVG21+AVG22+AVG23)/3.

相应地,如果语音特征值中包含了在每个频域子带中的语音幅度方差值,对于语音音频帧a1、a2和a3,在频域子带[80Hz,250Hz]中对应的语音幅度方差值分别为V11、V12和V13。那么对于频域子带[80Hz,250Hz],最终对应的语音幅度方差值可以(V11+V11+V11)/3。Correspondingly, if the speech feature value contains the speech amplitude variance value in each frequency domain subband, for the speech audio frames a1, a2 and a3, the corresponding speech amplitude in the frequency domain subband [80Hz, 250Hz] The variance values are V11, V12, and V13, respectively. Then for the frequency domain subband [80Hz, 250Hz], the final corresponding speech amplitude variance value can be (V11+V11+V11)/3.

在本发明实施例中,为了避免混淆,可以定义每个频域子带对应每种特征值为一个特征维度,那么每个频域子带对应的语音幅度均值为一个特征维度,每个频域子带对应的语音幅度方差值为一个特征维度。那么此时为了获取所述语音特征值的平均值,可以先分别获取每个维度的语音特征值的平均值,记为每个特征维度的第一语音特征均值,进而获取全部特征维度的第一语音特征均值的平均值,得到高斯混合模型的语音特征均值参数;获取全部特征维度的第一语音特征均值的协方差,得到高斯混合模型的语音协方差参数。In this embodiment of the present invention, in order to avoid confusion, it can be defined that each frequency domain subband corresponds to each feature value as a feature dimension, then the average speech amplitude corresponding to each frequency domain subband is a feature dimension, and each frequency domain subband corresponds to a feature dimension. The speech amplitude variance value corresponding to the subband is a feature dimension. Then in order to obtain the average value of the voice feature values at this time, the average value of the voice feature values of each dimension can be obtained separately, recorded as the first voice feature mean value of each feature dimension, and then the first voice feature value of all feature dimensions can be obtained. The average value of the voice feature mean is obtained to obtain the voice feature mean parameter of the Gaussian mixture model; the covariance of the first voice feature mean of all feature dimensions is obtained to obtain the voice covariance parameter of the Gaussian mixture model.

例如,对于上述的频域子带,假设语音特征值包括每个语音音频帧在每个频域子带中的语音幅度均值,以及在前两个频域子带中的语音幅度方差值,而按照上述的频域子带划分方式,可以得到6个频域子带,那么此时语音特征值的特征维度为8维。那么可以分别获取频域子带[80Hz,250Hz]对应的语音幅度均值的平均值为(AVG11+AVG12+AVG13)/3,记为b1;频域子带[250Hz,500Hz]对应的语音幅度均值的平均值为(AVG21+AVG22+AVG23)/3,记为b2,以此类推频域子带[3kHz,4kHz]对应的语音幅度均值的平均值记为b6。并且,获取频域子带[80Hz,250Hz]对应的语音幅度方差值的平均值记为c1,获取频域子带[250Hz,500Hz]对应的语音幅度方差值的平均值记为c2。那么针对预设的语音音频数据,进而获取b1、b2...b6、c1和c2的平均值,得到GMM模型中的语音特征均值参数,而通过获取b1、b2...b6、c1和c2的协方差则可以得到GMM模型中的语音协方差参数。For example, for the above frequency domain subbands, it is assumed that the speech feature value includes the speech amplitude mean value of each speech audio frame in each frequency domain subband, and the speech amplitude variance value in the first two frequency domain subbands, According to the above frequency domain subband division method, 6 frequency domain subbands can be obtained, and then the feature dimension of the speech feature value is 8 dimensions at this time. Then the average value of the speech amplitude corresponding to the frequency domain sub-band [80Hz, 250Hz] can be obtained respectively as (AVG11+AVG12+AVG13)/3, denoted as b1; the frequency domain sub-band [250Hz, 500Hz] The corresponding speech amplitude mean value The average value is (AVG21+AVG22+AVG23)/3, denoted as b2, and by analogy, the average value of the speech amplitude corresponding to the frequency domain subband [3kHz, 4kHz] is denoted as b6. In addition, the average value of the speech amplitude variance values corresponding to the frequency domain subbands [80Hz, 250Hz] is obtained as c1, and the average value of the speech amplitude variance values corresponding to the frequency domain subbands [250Hz, 500Hz] is obtained as c2. Then, for the preset voice audio data, the average value of b1, b2...b6, c1 and c2 is obtained, and the voice feature mean parameter in the GMM model is obtained, and by obtaining b1, b2...b6, c1 and c2 The covariance of , then the speech covariance parameter in the GMM model can be obtained.

步骤240,根据所述噪声特征值,获取所述高斯混合模型的噪声特征均值参数和噪声协方差参数。Step 240: Acquire a noise feature mean parameter and a noise covariance parameter of the Gaussian mixture model according to the noise feature value.

对于噪声特征值,也可以相应地参考上述步骤中针对语音特征值的处理方式,进而得到高斯混合模型的噪声特征均值参数和噪声协方差参数。具体的可以参考上述步骤230,在此不加以赘述。For the noise feature value, the processing method for the speech feature value in the above steps may also be referred to accordingly, and then the noise feature mean parameter and the noise covariance parameter of the Gaussian mixture model are obtained. For details, reference may be made to the foregoing step 230, which will not be repeated here.

步骤250,基于所述特征值以及预设的先验概率,通过预设的高斯混合模型确定所述当前帧的语音后验概率;其中所述高斯混合模型中的模型参数为基于预设的语音音频数据和噪声音频数据确定的数值。Step 250, based on the feature value and the preset prior probability, determine the voice posterior probability of the current frame through the preset Gaussian mixture model; wherein the model parameters in the Gaussian mixture model are based on the preset voice A value determined by audio data and noise audio data.

此时高斯混合模型中的模型参数为即为上述的语音特征均值参数、语音协方差参数、噪声特征均值参数和噪声协方差参数。在确定了高斯混合模型中的参数之后,则可以基于当前帧的特征值,以及预设的先验概率,通过相应的高斯混合模型确定当前帧的语音后验概率。At this time, the model parameters in the Gaussian mixture model are the above-mentioned speech feature mean parameter, speech covariance parameter, noise feature mean parameter and noise covariance parameter. After the parameters in the Gaussian mixture model are determined, the speech posterior probability of the current frame can be determined through the corresponding Gaussian mixture model based on the feature value of the current frame and the preset prior probability.

其中,先验概率的具体取值可以根据需求进行预先设置,对此本发明实施例不加以限定。例如,为了方便可以把先验概率看成固定值,比如设置语音先验概率和噪声先验概率均为0.5;或者可以设置语音先验概率为0.3,而噪声先验概率为0.7,等等。The specific value of the prior probability may be preset according to requirements, which is not limited in this embodiment of the present invention. For example, for convenience, the prior probability can be regarded as a fixed value, for example, the prior probability of speech and the prior probability of noise can be set to 0.5; or the prior probability of speech can be set to 0.3, and the prior probability of noise can be set to 0.7, and so on.

步骤260,在所述语音后验概率大于等于预设的语音门限概率的情况下,确认所述当前帧为语音帧。Step 260, in the case that the speech posterior probability is greater than or equal to a preset speech threshold probability, confirm that the current frame is a speech frame.

步骤270,如果确认所述当前帧为语音帧,基于所述当前帧在每个所述频域子带中的语音特征值,对所述高斯混合模型的语音特征均值参数和语音协方差参数进行优化处理。Step 270, if it is confirmed that the current frame is a voice frame, based on the voice feature value of the current frame in each of the frequency domain subbands, the voice feature mean parameter and the voice covariance parameter of the Gaussian mixture model are performed. Optimized processing.

如果确认当前帧为语音帧,那么则可以将当前帧作为训练数据,优化高斯混合模型的语音特征均值参数和语音协方差参数,具体的优化方式可以根据需求进行预先设置,对此本发明实施例不加以限定。例如,可以采用平滑方式进行优化,那么此时则可以根据当前帧的语音特征值,确定语音特征值对应的特征均值,对GMM模型中的语音特征均值参数进行平滑,加入当前帧的特征均值信息。同样的,计算当前帧的语音特征值对应的特征协方差,然后根据当前帧的特征协方差对GMM模型中的语音协方差参数进行平滑。If it is confirmed that the current frame is a voice frame, then the current frame can be used as training data to optimize the voice feature mean parameter and voice covariance parameter of the Gaussian mixture model. The specific optimization method can be preset according to requirements. For this embodiment of the present invention Not limited. For example, the smoothing method can be used for optimization, then at this time, the feature mean value corresponding to the voice feature value can be determined according to the voice feature value of the current frame, the voice feature mean parameter in the GMM model is smoothed, and the feature mean value information of the current frame is added. . Similarly, the feature covariance corresponding to the speech feature value of the current frame is calculated, and then the speech covariance parameters in the GMM model are smoothed according to the feature covariance of the current frame.

例如,可以设置平滑参数为0.1和0.9,那么假设GMM模型中的语音特征均值参数和语音协方差参数分别为s和t,且确认为语音帧的当前帧的特征均值和特征协方差分别为s1和t1,那么此时平滑优化后的GMM模型中的语音特征均值参数s’为0.9*s+0.1*s1,平滑优化后的GMM模型中的语音协方差参数t’为0.9*t+0.1*t1。当然具体的平滑方式以及平滑参数都可以根据需求进行自定义设置,对此本发明实施例不加以限定。For example, the smoothing parameters can be set to 0.1 and 0.9, then it is assumed that the voice feature mean parameter and voice covariance parameter in the GMM model are s and t respectively, and the feature mean and feature covariance of the current frame confirmed as a voice frame are respectively s1 and t1, then the voice feature mean parameter s' in the smoothed and optimized GMM model is 0.9*s+0.1*s1, and the voice covariance parameter t' in the smoothed and optimized GMM model is 0.9*t+0.1* t1. Of course, the specific smoothing mode and smoothing parameters can be customized and set according to requirements, which are not limited in this embodiment of the present invention.

步骤280,如果确认所述当前帧为噪声帧,基于所述当前帧在每个所述频域子带中的噪声特征值,对所述高斯混合模型的噪声特征均值参数和噪声协方差参数进行优化处理。Step 280, if it is confirmed that the current frame is a noise frame, based on the noise feature value of the current frame in each of the frequency domain subbands, the noise feature mean parameter and the noise covariance parameter of the Gaussian mixture model are performed. Optimized processing.

相应地,如果确认当前帧为噪声帧,那么则可以基于当前帧在每个所述频域子带中的噪声特征值,对所述高斯混合模型的噪声特征均值参数和噪声协方差参数进行优化处理。具体可以与通过语音帧对GMM模型中的语音特征均值参数和语音协方差参数进行优化处理类似,可以参考上述的步骤270,再次不加以赘述。Correspondingly, if it is confirmed that the current frame is a noise frame, then the noise feature mean parameter and the noise covariance parameter of the Gaussian mixture model can be optimized based on the noise feature value of the current frame in each of the frequency domain subbands. deal with. Specifically, it may be similar to the optimization process of the speech feature mean parameter and the speech covariance parameter in the GMM model by using the speech frame, and reference may be made to the above-mentioned step 270, which will not be repeated.

当然,在本发明实施例中,通过噪声帧对高斯混合模型的噪声特征均值参数和噪声协方差参数进行优化处理的具体优化方式可以与通过语音帧对所述高斯混合模型的语音特征均值参数和语音协方差参数进行优化处理的具体优化方式相同,当然也可以不完全相同,具体的可以根据需求进行自定义设置,对此本发明实施例不加以限定。Of course, in this embodiment of the present invention, the specific optimization method for optimizing the noise feature mean parameter and the noise covariance parameter of the Gaussian mixture model through the noise frame may be the same as the voice frame through the voice frame. The specific optimization methods for optimizing the speech covariance parameters are the same, and of course they may not be exactly the same, and specific settings can be customized according to requirements, which are not limited in this embodiment of the present invention.

步骤290,基于调整后的高斯混合模型,确定所述当前帧的下一帧音频的语音后验概率。Step 290 , based on the adjusted Gaussian mixture model, determine the speech posterior probability of the next frame of audio of the current frame.

在调整了高斯混合模型中的参数之后,可以在一定程度上提高GMM模型的准确率,那么则可以基于基于调整后的高斯混合模型,确定所述当前帧的下一帧音频的语音后验概率。进而确认下一帧是否为语音帧,依次循环,直至音频数据的最后一帧音频。After adjusting the parameters in the Gaussian mixture model, the accuracy of the GMM model can be improved to a certain extent, then based on the adjusted Gaussian mixture model, the speech posterior probability of the next frame of audio of the current frame can be determined . Then confirm whether the next frame is a voice frame, and cycle in turn until the last frame of audio data.

进一步地,由于在实际应用中,语音的频率属于低频的可能性较大,因此对于较低频率进行更详细的划分可以提高语音活动检测的准确性。因此在本发明实施例中,频域子带基于所述目标音频数据的采样频率所设置,且起始频率越低的频率子带对应的频率范围越小。同样可以提高语音活动检测的准确性。Further, in practical applications, the frequency of speech is more likely to belong to low frequencies, so the more detailed division of lower frequencies can improve the accuracy of speech activity detection. Therefore, in the embodiment of the present invention, the frequency domain subband is set based on the sampling frequency of the target audio data, and the frequency subband with a lower starting frequency corresponds to a smaller frequency range. The accuracy of voice activity detection can also be improved.

例如,对于上述的各个频率子带[80Hz,250Hz]、[250Hz,500Hz]、[500Hz,1kHz]、[1kHz,2kHz]、[2kHz,3kHz]和[3kHz,4kHz],可以看出此时以80Hz作为起始频率的频率子带[80Hz,250Hz]所对应的频率范围为250-80=170Hz,而以250Hz作为起始频率的频率子带[250Hz,500Hz]所对应的频率范围为250Hz。很明显,250Hz大于170Hz,此时可以使得提取得到的特征值更准确有效。For example, for the above frequency subbands [80Hz, 250Hz], [250Hz, 500Hz], [500Hz, 1kHz], [1kHz, 2kHz], [2kHz, 3kHz] and [3kHz, 4kHz], it can be seen that at this time The frequency range corresponding to the frequency subband [80Hz, 250Hz] with 80Hz as the starting frequency is 250-80=170Hz, and the frequency range corresponding to the frequency subband [250Hz, 500Hz] with 250Hz as the starting frequency is 250Hz . Obviously, 250Hz is greater than 170Hz, which can make the extracted eigenvalues more accurate and effective.

在本发明实施例中,通过获取目标音频数据中当前帧在预设的多个频域子带中的特征值;基于所述特征值,通过预设的分类方法确定所述当前帧的语音后验概率;在所述语音后验概率大于等于预设的语音门限概率的情况下,确认所述当前帧为语音帧。能够实现提高语音活动检测的准确性以及实用性。In this embodiment of the present invention, the feature values of the current frame in the target audio data in a plurality of preset frequency domain subbands are obtained; based on the feature values, a preset classification method is used to determine the voice output of the current frame. confirming that the current frame is a speech frame when the speech posterior probability is greater than or equal to a preset speech threshold probability. The accuracy and practicability of voice activity detection can be improved.

而且,在本发明实施例中,还可以将所述当前帧转换为频域信号,并获取所述频域信号在预设的多个频域子带中的特征值;或者,通过针对所述频率子带的带通滤波器对所述当前帧进行滤波处理,得到所述当前帧中对应所述频域子带的频率分量,并获取所述频率分量对应在所述频域子带中的特征值。从而可以提高特征值的准确性,进而提高语音活动检测的准确性。Moreover, in this embodiment of the present invention, the current frame may also be converted into a frequency domain signal, and eigenvalues of the frequency domain signal in a plurality of preset frequency domain subbands may be obtained; The band-pass filter of the frequency subband performs filtering processing on the current frame to obtain the frequency component corresponding to the frequency domain subband in the current frame, and obtains the frequency component corresponding to the frequency domain subband. Eigenvalues. Thereby, the accuracy of the feature value can be improved, thereby improving the accuracy of the voice activity detection.

另外,在本发明实施例中,还可以基于所述特征值以及预设的先验概率,通过预设的高斯混合模型确定所述当前帧的语音后验概率;其中,所述高斯混合模型中的模型参数为基于预设的语音音频数据和噪声音频数据确定的数值。并且,获取所述语音音频数据中的每一个语音音频帧在每个所述频域子带中的语音特征值,和所述噪声音频数据中的每一个噪声音频帧在每个所述频域子带中的噪声特征值;根据所述语音特征值,获取所述高斯混合模型的语音特征均值参数和语音协方差参数;根据所述噪声特征值,获取所述高斯混合模型的噪声特征均值参数和噪声协方差参数。以及,如果确认所述当前帧为语音帧,基于所述当前帧在每个所述频域子带中的语音特征值,对所述高斯混合模型的语音特征均值参数和语音协方差参数进行优化处理;如果确认所述当前帧为噪声帧,基于所述当前帧在每个所述频域子带中的噪声特征值,对所述高斯混合模型的噪声特征均值参数和噪声协方差参数进行优化处理;基于调整后的高斯混合模型,确定所述当前帧的下一帧音频的语音后验概率。从而可以提高降低语音活动检测模型的复杂度,从而提高语音活动检测的实用性以及准确性。In addition, in this embodiment of the present invention, based on the feature value and a preset prior probability, a preset Gaussian mixture model may be used to determine the speech posterior probability of the current frame; wherein, in the Gaussian mixture model The model parameters of are the values determined based on preset speech audio data and noise audio data. And, acquiring the speech feature value of each speech audio frame in each of the frequency domain subbands in the speech audio data, and each noise audio frame in the noise audio data in each of the frequency domain noise feature value in the sub-band; according to the voice feature value, obtain the voice feature mean parameter and voice covariance parameter of the Gaussian mixture model; according to the noise feature value, obtain the noise feature mean parameter of the Gaussian mixture model and the noise covariance parameter. And, if it is confirmed that the current frame is a voice frame, based on the voice feature value of the current frame in each of the frequency domain subbands, the voice feature mean parameter and the voice covariance parameter of the Gaussian mixture model are optimized. Processing; if it is confirmed that the current frame is a noise frame, based on the noise feature value of the current frame in each of the frequency domain subbands, the noise feature mean parameter and the noise covariance parameter of the Gaussian mixture model are optimized. Processing; based on the adjusted Gaussian mixture model, determine the speech posterior probability of the next frame of audio of the current frame. Therefore, the complexity of the voice activity detection model can be improved and reduced, thereby improving the practicability and accuracy of the voice activity detection.

实施例三Embodiment 3

详细介绍本发明实施例提供的一种移动终端。A mobile terminal provided by an embodiment of the present invention is introduced in detail.

参照图3,示出了本发明实施例中一种移动终端的结构示意图。Referring to FIG. 3 , a schematic structural diagram of a mobile terminal in an embodiment of the present invention is shown.

本发明实施例的移动终端300包括:特征值获取模块310、第一概率获取模块320和语音帧确认模块330。The mobile terminal 300 in the embodiment of the present invention includes: a feature value acquisition module 310 , a first probability acquisition module 320 , and a voice frame confirmation module 330 .

下面分别详细介绍各模块的功能以及各模块之间的交互关系。The functions of each module and the interaction between the modules are described in detail below.

特征值获取模块310,用于获取目标音频数据中当前帧在预设的多个频域子带中的特征值;所述特征值包括幅度均值和/或幅度方差值。The eigenvalue acquiring module 310 is configured to acquire eigenvalues of the current frame in the target audio data in a plurality of preset frequency domain subbands; the eigenvalues include amplitude mean value and/or amplitude variance value.

第一概率获取模块320,用于基于所述特征值,通过预设的分类方法确定所述当前帧的语音后验概率。The first probability obtaining module 320 is configured to determine the speech posterior probability of the current frame by a preset classification method based on the feature value.

语音帧确认模块330,用于在所述语音后验概率大于等于预设的语音门限概率的情况下,确认所述当前帧为语音帧。The speech frame confirmation module 330 is configured to confirm that the current frame is a speech frame when the speech posterior probability is greater than or equal to a preset speech threshold probability.

本发明实施例提供的移动终端能够实现图1至图2的方法实施例中移动终端实现的各个过程,为避免重复,这里不再赘述。The mobile terminal provided in the embodiment of the present invention can implement each process implemented by the mobile terminal in the method embodiments of FIG. 1 to FIG. 2 , and to avoid repetition, details are not described here.

在本发明实施例中,通过获取目标音频数据中当前帧在预设的多个频域子带中的特征值;;基于所述特征值,通过预设的分类方法确定所述当前帧的语音后验概率;在所述语音后验概率大于等于预设的语音门限概率的情况下,确认所述当前帧为语音帧。能够实现提高语音活动检测的准确性以及实用性。In the embodiment of the present invention, the feature values of the current frame in the preset multiple frequency domain subbands in the target audio data are obtained; based on the feature values, the voice of the current frame is determined by a preset classification method Posterior probability; when the voice posterior probability is greater than or equal to a preset voice threshold probability, confirm that the current frame is a voice frame. The accuracy and practicability of voice activity detection can be improved.

实施例四Embodiment 4

详细介绍本发明实施例提供的一种移动终端。A mobile terminal provided by an embodiment of the present invention is introduced in detail.

参照图4,示出了本发明实施例中一种移动终端的结构示意图。Referring to FIG. 4 , a schematic structural diagram of a mobile terminal in an embodiment of the present invention is shown.

本发明实施例的移动终端400包括:特征值获取模块410、训练音频特征获取模块420、语音参数获取模块430、噪声参数获取模块440、第一概率获取模块450、语音参数优化模块460、噪声参数优化模块470、第二概率获取模块480和语音帧确认模块490。The mobile terminal 400 in the embodiment of the present invention includes: a feature value acquisition module 410, a training audio feature acquisition module 420, a speech parameter acquisition module 430, a noise parameter acquisition module 440, a first probability acquisition module 450, a speech parameter optimization module 460, and a noise parameter The optimization module 470 , the second probability acquisition module 480 and the speech frame confirmation module 490 .

下面分别详细介绍各模块的功能以及各模块之间的交互关系。The functions of each module and the interaction between the modules are described in detail below.

特征值获取模块410,用于获取目标音频数据中当前帧在预设的多个频域子带中的特征值;所述特征值包括幅度均值和/或幅度方差值。The eigenvalue acquiring module 410 is configured to acquire eigenvalues of the current frame in the target audio data in a plurality of preset frequency domain subbands; the eigenvalues include an amplitude mean value and/or an amplitude variance value.

可选地,在本发明实施例中,所述特征值获取模块410,进一步可以包括:Optionally, in this embodiment of the present invention, the feature value obtaining module 410 may further include:

第一特征值获取子模块,用于将所述当前帧转换为频域信号,并获取所述频域信号在预设的多个频域子带中的特征值;a first eigenvalue acquisition submodule, configured to convert the current frame into a frequency-domain signal, and acquire eigenvalues of the frequency-domain signal in a plurality of preset frequency-domain subbands;

或者,第二特征值获取子模块,用于通过针对所述频率子带的带通滤波器对所述当前帧进行滤波处理,得到所述当前帧中对应所述频域子带的频率分量,并获取所述频率分量对应在所述频域子带中的特征值。Or, a second eigenvalue acquisition submodule, configured to filter the current frame by a bandpass filter for the frequency subband to obtain the frequency component corresponding to the frequency domain subband in the current frame, and obtain the eigenvalues corresponding to the frequency components in the frequency domain subbands.

训练音频特征获取模块420,用于获取所述语音音频数据中的每一个语音音频帧在每个所述频域子带中的语音特征值,和所述噪声音频数据中的每一个噪声音频帧在每个所述频域子带中的噪声特征值;The training audio feature acquisition module 420 is used to acquire the voice feature value of each voice audio frame in each of the frequency domain subbands in the voice audio data, and each noise audio frame in the noise audio data noise eigenvalues in each of said frequency domain subbands;

语音参数获取模块430,用于根据所述语音特征值,获取所述高斯混合模型的语音特征均值参数和语音协方差参数;A voice parameter obtaining module 430, configured to obtain the voice feature mean parameter and voice covariance parameter of the Gaussian mixture model according to the voice feature value;

噪声参数获取模块440,用于根据所述噪声特征值,获取所述高斯混合模型的噪声特征均值参数和噪声协方差参数。The noise parameter obtaining module 440 is configured to obtain the noise feature mean parameter and the noise covariance parameter of the Gaussian mixture model according to the noise feature value.

第一概率获取模块450,用于基于所述特征值,通过预设的分类方法确定所述当前帧的语音后验概率。The first probability obtaining module 450 is configured to determine the speech posterior probability of the current frame by a preset classification method based on the feature value.

可选地,在本发明实施例中,所述第一概率获取模块450,进一步可以包括:Optionally, in this embodiment of the present invention, the first probability obtaining module 450 may further include:

概率获取子模块451,用于基于所述特征值以及预设的先验概率,通过预设的高斯混合模型确定所述当前帧的语音后验概率;其中,所述高斯混合模型中的模型参数为基于预设的语音音频数据和噪声音频数据确定的数值。The probability acquisition sub-module 451 is used to determine the speech posterior probability of the current frame through the preset Gaussian mixture model based on the feature value and the preset prior probability; wherein, the model parameters in the Gaussian mixture model A value determined based on preset speech audio data and noise audio data.

语音参数优化模块460,用于如果确认所述当前帧为语音帧,基于所述当前帧在每个所述频域子带中的语音特征值,对所述高斯混合模型的语音特征均值参数和语音协方差参数进行优化处理;The speech parameter optimization module 460 is used for, if it is confirmed that the current frame is a speech frame, based on the speech feature value of the current frame in each of the frequency domain subbands, the speech feature mean parameter of the Gaussian mixture model and Optimize the speech covariance parameters;

噪声参数优化模块470,用于如果确认所述当前帧为噪声帧,基于所述当前帧在每个所述频域子带中的噪声特征值,对所述高斯混合模型的噪声特征均值参数和噪声协方差参数进行优化处理;The noise parameter optimization module 470 is configured to, if it is confirmed that the current frame is a noise frame, based on the noise feature value of the current frame in each of the frequency-domain subbands, perform an analysis on the noise feature mean parameter of the Gaussian mixture model and the Noise covariance parameters are optimized;

第二概率获取模块480,用于基于调整后的高斯混合模型,确定所述当前帧的下一帧音频的语音后验概率。The second probability obtaining module 480 is configured to determine, based on the adjusted Gaussian mixture model, the posterior probability of speech of the next frame of audio of the current frame.

语音帧确认模块490,用于在所述语音后验概率大于等于预设的语音门限概率的情况下,确认所述当前帧为语音帧。The speech frame confirmation module 490 is configured to confirm that the current frame is a speech frame when the speech posterior probability is greater than or equal to a preset speech threshold probability.

可选地,在本发明实施例中,所述频域子带基于所述目标音频数据的采样频率所设置,且起始频率越低的频率子带对应的频率范围越小。Optionally, in this embodiment of the present invention, the frequency domain subband is set based on the sampling frequency of the target audio data, and the frequency subband with a lower starting frequency corresponds to a smaller frequency range.

本发明实施例提供的移动终端能够实现图1至图2的方法实施例中移动终端实现的各个过程,为避免重复,这里不再赘述。The mobile terminal provided in the embodiment of the present invention can implement each process implemented by the mobile terminal in the method embodiments of FIG. 1 to FIG. 2 , and to avoid repetition, details are not described here.

在本发明实施例中,通过获取目标音频数据中当前帧在预设的多个频域子带中的特征值;基于所述特征值,通过预设的分类方法确定所述当前帧的语音后验概率;在所述语音后验概率大于等于预设的语音门限概率的情况下,确认所述当前帧为语音帧。能够实现提高语音活动检测的准确性以及实用性。In this embodiment of the present invention, the feature values of the current frame in the target audio data in a plurality of preset frequency domain subbands are obtained; based on the feature values, a preset classification method is used to determine the voice output of the current frame. confirming that the current frame is a speech frame when the speech posterior probability is greater than or equal to a preset speech threshold probability. The accuracy and practicability of voice activity detection can be improved.

而且,在本发明实施例中,还可以将所述当前帧转换为频域信号,并获取所述频域信号在预设的多个频域子带中的特征值;或者,通过针对所述频率子带的带通滤波器对所述当前帧进行滤波处理,得到所述当前帧中对应所述频域子带的频率分量,并获取所述频率分量对应在所述频域子带中的特征值。从而可以提高特征值的准确性,进而提高语音活动检测的准确性。Moreover, in this embodiment of the present invention, the current frame may also be converted into a frequency domain signal, and eigenvalues of the frequency domain signal in a plurality of preset frequency domain subbands may be obtained; The band-pass filter of the frequency subband performs filtering processing on the current frame to obtain the frequency component corresponding to the frequency domain subband in the current frame, and obtains the frequency component corresponding to the frequency domain subband. Eigenvalues. Thereby, the accuracy of the feature value can be improved, thereby improving the accuracy of the voice activity detection.

另外,在本发明实施例中,还可以基于所述特征值以及预设的先验概率,通过预设的高斯混合模型确定所述当前帧的语音后验概率;其中,所述高斯混合模型中的模型参数为基于预设的语音音频数据和噪声音频数据确定的数值。并且,获取所述语音音频数据中的每一个语音音频帧在每个所述频域子带中的语音特征值,和所述噪声音频数据中的每一个噪声音频帧在每个所述频域子带中的噪声特征值;根据所述语音特征值,获取所述高斯混合模型的语音特征均值参数和语音协方差参数;根据所述噪声特征值,获取所述高斯混合模型的噪声特征均值参数和噪声协方差参数。以及,如果确认所述当前帧为语音帧,基于所述当前帧在每个所述频域子带中的语音特征值,对所述高斯混合模型的语音特征均值参数和语音协方差参数进行优化处理;如果确认所述当前帧为噪声帧,基于所述当前帧在每个所述频域子带中的噪声特征值,对所述高斯混合模型的噪声特征均值参数和噪声协方差参数进行优化处理;基于调整后的高斯混合模型,确定所述当前帧的下一帧音频的语音后验概率。从而可以提高降低语音活动检测模型的复杂度,从而提高语音活动检测的实用性以及准确性。In addition, in this embodiment of the present invention, based on the feature value and a preset prior probability, a preset Gaussian mixture model may be used to determine the speech posterior probability of the current frame; wherein, in the Gaussian mixture model The model parameters of are the values determined based on preset speech audio data and noise audio data. And, acquiring the speech feature value of each speech audio frame in each of the frequency domain subbands in the speech audio data, and each noise audio frame in the noise audio data in each of the frequency domain noise feature value in the sub-band; according to the voice feature value, obtain the voice feature mean parameter and voice covariance parameter of the Gaussian mixture model; according to the noise feature value, obtain the noise feature mean parameter of the Gaussian mixture model and the noise covariance parameter. And, if it is confirmed that the current frame is a voice frame, based on the voice feature value of the current frame in each of the frequency domain subbands, the voice feature mean parameter and the voice covariance parameter of the Gaussian mixture model are optimized. Processing; if it is confirmed that the current frame is a noise frame, based on the noise feature value of the current frame in each of the frequency domain subbands, the noise feature mean parameter and the noise covariance parameter of the Gaussian mixture model are optimized. Processing; based on the adjusted Gaussian mixture model, determine the speech posterior probability of the next frame of audio of the current frame. Therefore, the complexity of the voice activity detection model can be improved and reduced, thereby improving the practicability and accuracy of the voice activity detection.

进一步地,在本发明实施例中,频域子带基于所述目标音频数据的采样频率所设置,且起始频率越低的频率子带对应的频率范围越小。同样可以提高语音活动检测的准确性。Further, in the embodiment of the present invention, the frequency domain subbands are set based on the sampling frequency of the target audio data, and the frequency subbands with lower starting frequencies correspond to smaller frequency ranges. The accuracy of voice activity detection can also be improved.

实施例五Embodiment 5

图5为实现本发明各个实施例的一种移动终端的硬件结构示意图。FIG. 5 is a schematic diagram of a hardware structure of a mobile terminal implementing various embodiments of the present invention.

该移动终端500包括但不限于:射频单元501、网络模块502、音频输出单元503、输入单元504、传感器505、显示单元506、用户输入单元507、接口单元508、存储器509、处理器510、以及电源511等部件。本领域技术人员可以理解,图5中示出的移动终端结构并不构成对移动终端的限定,移动终端可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。在本发明实施例中,移动终端包括但不限于手机、平板电脑、笔记本电脑、掌上电脑、车载终端、可穿戴设备、以及计步器等。The mobile terminal 500 includes but is not limited to: a radio frequency unit 501, a network module 502, an audio output unit 503, an input unit 504, a sensor 505, a display unit 506, a user input unit 507, an interface unit 508, a memory 509, a processor 510, and Power 511 and other components. Those skilled in the art can understand that the structure of the mobile terminal shown in FIG. 5 does not constitute a limitation on the mobile terminal, and the mobile terminal may include more or less components than the one shown, or combine some components, or different components layout. In this embodiment of the present invention, the mobile terminal includes, but is not limited to, a mobile phone, a tablet computer, a notebook computer, a palmtop computer, a vehicle-mounted terminal, a wearable device, a pedometer, and the like.

其中,处理器510,用于获取目标音频数据中当前帧在预设的多个频域子带中的特征值;基于所述特征值,通过预设的分类方法确定所述当前帧的语音后验概率;在所述语音后验概率大于等于预设的语音门限概率的情况下,确认所述当前帧为语音帧。Wherein, the processor 510 is configured to obtain the feature values of the current frame in the preset multiple frequency domain subbands in the target audio data; confirming that the current frame is a speech frame when the speech posterior probability is greater than or equal to a preset speech threshold probability.

在本发明实施例中,通过获取目标音频数据中当前帧在预设的多个频域子带中的特征值;基于所述特征值,通过预设的分类方法确定所述当前帧的语音后验概率;在所述语音后验概率大于等于预设的语音门限概率的情况下,确认所述当前帧为语音帧。能够实现提高语音活动检测的准确性以及实用性。In this embodiment of the present invention, the feature values of the current frame in the target audio data in a plurality of preset frequency domain subbands are obtained; based on the feature values, a preset classification method is used to determine the voice output of the current frame. confirming that the current frame is a speech frame when the speech posterior probability is greater than or equal to a preset speech threshold probability. The accuracy and practicability of voice activity detection can be improved.

而且,在本发明实施例中,还可以将所述当前帧转换为频域信号,并获取所述频域信号在预设的多个频域子带中的特征值;或者,通过针对所述频率子带的带通滤波器对所述当前帧进行滤波处理,得到所述当前帧中对应所述频域子带的频率分量,并获取所述频率分量对应在所述频域子带中的特征值。从而可以提高特征值的准确性,进而提高语音活动检测的准确性。Moreover, in this embodiment of the present invention, the current frame may also be converted into a frequency domain signal, and eigenvalues of the frequency domain signal in a plurality of preset frequency domain subbands may be obtained; The band-pass filter of the frequency subband performs filtering processing on the current frame to obtain the frequency component corresponding to the frequency domain subband in the current frame, and obtains the frequency component corresponding to the frequency domain subband. Eigenvalues. Thereby, the accuracy of the feature value can be improved, thereby improving the accuracy of the voice activity detection.

另外,在本发明实施例中,还可以基于所述特征值以及预设的先验概率,通过预设的高斯混合模型确定所述当前帧的语音后验概率;其中,所述高斯混合模型中的模型参数为基于预设的语音音频数据和噪声音频数据确定的数值。并且,获取所述语音音频数据中的每一个语音音频帧在每个所述频域子带中的语音特征值,和所述噪声音频数据中的每一个噪声音频帧在每个所述频域子带中的噪声特征值;根据所述语音特征值,获取所述高斯混合模型的语音特征均值参数和语音协方差参数;根据所述噪声特征值,获取所述高斯混合模型的噪声特征均值参数和噪声协方差参数。以及,如果确认所述当前帧为语音帧,基于所述当前帧在每个所述频域子带中的语音特征值,对所述高斯混合模型的语音特征均值参数和语音协方差参数进行优化处理;如果确认所述当前帧为噪声帧,基于所述当前帧在每个所述频域子带中的噪声特征值,对所述高斯混合模型的噪声特征均值参数和噪声协方差参数进行优化处理;基于调整后的高斯混合模型,确定所述当前帧的下一帧音频的语音后验概率。从而可以提高降低语音活动检测模型的复杂度,从而提高语音活动检测的实用性以及准确性。In addition, in this embodiment of the present invention, based on the feature value and a preset prior probability, a preset Gaussian mixture model may be used to determine the speech posterior probability of the current frame; wherein, in the Gaussian mixture model The model parameters of are the values determined based on preset speech audio data and noise audio data. And, acquiring the speech feature value of each speech audio frame in each of the frequency domain subbands in the speech audio data, and each noise audio frame in the noise audio data in each of the frequency domain noise feature value in the sub-band; according to the voice feature value, obtain the voice feature mean parameter and voice covariance parameter of the Gaussian mixture model; according to the noise feature value, obtain the noise feature mean parameter of the Gaussian mixture model and the noise covariance parameter. And, if it is confirmed that the current frame is a voice frame, based on the voice feature value of the current frame in each of the frequency domain subbands, the voice feature mean parameter and the voice covariance parameter of the Gaussian mixture model are optimized. Processing; if it is confirmed that the current frame is a noise frame, based on the noise feature value of the current frame in each of the frequency domain subbands, the noise feature mean parameter and the noise covariance parameter of the Gaussian mixture model are optimized. Processing; based on the adjusted Gaussian mixture model, determine the speech posterior probability of the next frame of audio of the current frame. Therefore, the complexity of the voice activity detection model can be improved and reduced, thereby improving the practicability and accuracy of the voice activity detection.

进一步地,在本发明实施例中,频域子带基于所述目标音频数据的采样频率所设置,且起始频率越低的频率子带对应的频率范围越小。同样可以提高语音活动检测的准确性。Further, in the embodiment of the present invention, the frequency domain subbands are set based on the sampling frequency of the target audio data, and the frequency subbands with lower starting frequencies correspond to smaller frequency ranges. The accuracy of voice activity detection can also be improved.

应理解的是,本发明实施例中,射频单元501可用于收发信息或通话过程中,信号的接收和发送,具体的,将来自基站的下行数据接收后,给处理器510处理;另外,将上行的数据发送给基站。通常,射频单元501包括但不限于天线、至少一个放大器、收发信机、耦合器、低噪声放大器、双工器等。此外,射频单元501还可以通过无线通信系统与网络和其他设备通信。It should be understood that, in this embodiment of the present invention, the radio frequency unit 501 can be used for receiving and sending signals during sending and receiving of information or during a call. Specifically, after receiving the downlink data from the base station, it is processed by the processor 510; The uplink data is sent to the base station. Generally, the radio frequency unit 501 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, and the like. In addition, the radio frequency unit 501 can also communicate with the network and other devices through a wireless communication system.

移动终端通过网络模块502为用户提供了无线的宽带互联网访问,如帮助用户收发电子邮件、浏览网页和访问流式媒体等。The mobile terminal provides the user with wireless broadband Internet access through the network module 502, such as helping the user to send and receive emails, browse web pages, access streaming media, and the like.

音频输出单元503可以将射频单元501或网络模块502接收的或者在存储器509中存储的音频数据转换成音频信号并且输出为声音。而且,音频输出单元503还可以提供与移动终端500执行的特定功能相关的音频输出(例如,呼叫信号接收声音、消息接收声音等等)。音频输出单元503包括扬声器、蜂鸣器以及受话器等。The audio output unit 503 may convert audio data received by the radio frequency unit 501 or the network module 502 or stored in the memory 509 into audio signals and output as sound. Also, the audio output unit 503 may also provide audio output related to a specific function performed by the mobile terminal 500 (eg, call signal reception sound, message reception sound, etc.). The audio output unit 503 includes a speaker, a buzzer, a receiver, and the like.

输入单元504用于接收音频或视频信号。输入单元504可以包括图形处理器(Graphics Processing Unit,GPU)5041和麦克风5042,图形处理器5041对在视频捕获模式或图像捕获模式中由图像捕获装置(如摄像头)获得的静态图片或视频的图像数据进行处理。处理后的图像帧可以显示在显示单元506上。经图形处理器5041处理后的图像帧可以存储在存储器509(或其它存储介质)中或者经由射频单元501或网络模块502进行发送。麦克风5042可以接收声音,并且能够将这样的声音处理为音频数据。处理后的音频数据可以在电话通话模式的情况下转换为可经由射频单元501发送到移动通信基站的格式输出。The input unit 504 is used to receive audio or video signals. The input unit 504 may include a graphics processor (Graphics Processing Unit, GPU) 5041 and a microphone 5042, and the graphics processor 5041 is used for still pictures or video images obtained by an image capture device (such as a camera) in a video capture mode or an image capture mode data is processed. The processed image frames may be displayed on the display unit 506 . The image frames processed by the graphics processor 5041 may be stored in the memory 509 (or other storage medium) or transmitted via the radio frequency unit 501 or the network module 502 . The microphone 5042 can receive sound and can process such sound into audio data. The processed audio data can be converted into a format that can be transmitted to a mobile communication base station via the radio frequency unit 501 for output in the case of a telephone call mode.

移动终端500还包括至少一种传感器505,比如光传感器、运动传感器以及其他传感器。具体地,光传感器包括环境光传感器及接近传感器,其中,环境光传感器可根据环境光线的明暗来调节显示面板5061的亮度,接近传感器可在移动终端500移动到耳边时,关闭显示面板5061和/或背光。作为运动传感器的一种,加速计传感器可检测各个方向上(一般为三轴)加速度的大小,静止时可检测出重力的大小及方向,可用于识别移动终端姿态(比如横竖屏切换、相关游戏、磁力计姿态校准)、振动识别相关功能(比如计步器、敲击)等;传感器505还可以包括指纹传感器、压力传感器、虹膜传感器、分子传感器、陀螺仪、气压计、湿度计、温度计、红外线传感器等,在此不再赘述。The mobile terminal 500 also includes at least one sensor 505, such as a light sensor, a motion sensor, and other sensors. Specifically, the light sensor includes an ambient light sensor and a proximity sensor, wherein the ambient light sensor can adjust the brightness of the display panel 5061 according to the brightness of the ambient light, and the proximity sensor can turn off the display panel 5061 and the proximity sensor when the mobile terminal 500 is moved to the ear. / or backlight. As a kind of motion sensor, the accelerometer sensor can detect the magnitude of acceleration in all directions (usually three axes), and can detect the magnitude and direction of gravity when stationary, and can be used to identify the posture of mobile terminals (such as horizontal and vertical screen switching, related games , magnetometer attitude calibration), vibration recognition related functions (such as pedometer, tapping), etc.; the sensor 505 may also include a fingerprint sensor, a pressure sensor, an iris sensor, a molecular sensor, a gyroscope, a barometer, a hygrometer, a thermometer, Infrared sensors, etc., are not repeated here.

显示单元506用于显示由用户输入的信息或提供给用户的信息。显示单元506可包括显示面板5061,可以采用液晶显示器(Liquid Crystal Display,LCD)、有机发光二极管(Organic Light-Emitting Diode,OLED)等形式来配置显示面板5061。The display unit 506 is used to display information input by the user or information provided to the user. The display unit 506 may include a display panel 5061, and the display panel 5061 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like.

用户输入单元507可用于接收输入的数字或字符信息,以及产生与移动终端的用户设置以及功能控制有关的键信号输入。具体地,用户输入单元507包括触控面板5071以及其他输入设备5072。触控面板5071,也称为触摸屏,可收集用户在其上或附近的触摸操作(比如用户使用手指、触笔等任何适合的物体或附件在触控面板5071上或在触控面板5071附近的操作)。触控面板5071可包括触摸检测装置和触摸控制器两个部分。其中,触摸检测装置检测用户的触摸方位,并检测触摸操作带来的信号,将信号传送给触摸控制器;触摸控制器从触摸检测装置上接收触摸信息,并将它转换成触点坐标,再送给处理器510,接收处理器510发来的命令并加以执行。此外,可以采用电阻式、电容式、红外线以及表面声波等多种类型实现触控面板5071。除了触控面板5071,用户输入单元507还可以包括其他输入设备5072。具体地,其他输入设备5072可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆,在此不再赘述。The user input unit 507 may be used to receive input numerical or character information, and generate key signal input related to user settings and function control of the mobile terminal. Specifically, the user input unit 507 includes a touch panel 5071 and other input devices 5072 . The touch panel 5071, also referred to as a touch screen, can collect the user's touch operations on or near it (such as the user's finger, stylus, etc., any suitable object or accessory on or near the touch panel 5071). operate). The touch panel 5071 may include two parts, a touch detection device and a touch controller. Among them, the touch detection device detects the user's touch orientation, detects the signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, converts it into contact coordinates, and then sends it to the touch controller. To the processor 510, the command sent by the processor 510 is received and executed. In addition, the touch panel 5071 can be implemented in various types such as resistive, capacitive, infrared, and surface acoustic waves. In addition to the touch panel 5071 , the user input unit 507 may also include other input devices 5072 . Specifically, other input devices 5072 may include, but are not limited to, physical keyboards, function keys (such as volume control keys, switch keys, etc.), trackballs, mice, and joysticks, which will not be repeated here.

进一步的,触控面板5071可覆盖在显示面板5061上,当触控面板5071检测到在其上或附近的触摸操作后,传送给处理器510以确定触摸事件的类型,随后处理器510根据触摸事件的类型在显示面板5061上提供相应的视觉输出。虽然在图5中,触控面板5071与显示面板5061是作为两个独立的部件来实现移动终端的输入和输出功能,但是在某些实施例中,可以将触控面板5071与显示面板5061集成而实现移动终端的输入和输出功能,具体此处不做限定。Further, the touch panel 5071 can be covered on the display panel 5061. When the touch panel 5071 detects a touch operation on or near it, it transmits it to the processor 510 to determine the type of the touch event, and then the processor 510 determines the type of the touch event according to the touch The type of event provides a corresponding visual output on display panel 5061. Although in FIG. 5, the touch panel 5071 and the display panel 5061 are used as two independent components to realize the input and output functions of the mobile terminal, in some embodiments, the touch panel 5071 and the display panel 5061 may be integrated The input and output functions of the mobile terminal are implemented, which is not specifically limited here.

接口单元508为外部装置与移动终端500连接的接口。例如,外部装置可以包括有线或无线头戴式耳机端口、外部电源(或电池充电器)端口、有线或无线数据端口、存储卡端口、用于连接具有识别模块的装置的端口、音频输入/输出(I/O)端口、视频I/O端口、耳机端口等等。接口单元508可以用于接收来自外部装置的输入(例如,数据信息、电力等等)并且将接收到的输入传输到移动终端500内的一个或多个元件或者可以用于在移动终端500和外部装置之间传输数据。The interface unit 508 is an interface for connecting an external device to the mobile terminal 500 . For example, external devices may include wired or wireless headset ports, external power (or battery charger) ports, wired or wireless data ports, memory card ports, ports for connecting devices with identification modules, audio input/output (I/O) ports, video I/O ports, headphone ports, and more. The interface unit 508 may be used to receive input (eg, data information, power, etc.) from an external device and transmit the received input to one or more elements within the mobile terminal 500 or may be used between the mobile terminal 500 and the external Transfer data between devices.

存储器509可用于存储软件程序以及各种数据。存储器509可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据手机的使用所创建的数据(比如音频数据、电话本等)等。此外,存储器509可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。The memory 509 may be used to store software programs as well as various data. The memory 509 may mainly include a stored program area and a stored data area, wherein the stored program area may store an operating system, an application program (such as a sound playback function, an image playback function, etc.) required for at least one function, and the like; Data created by the use of the mobile phone (such as audio data, phone book, etc.), etc. Additionally, memory 509 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

处理器510是移动终端的控制中心,利用各种接口和线路连接整个移动终端的各个部分,通过运行或执行存储在存储器509内的软件程序和/或模块,以及调用存储在存储器509内的数据,执行移动终端的各种功能和处理数据,从而对移动终端进行整体监控。处理器510可包括一个或多个处理单元;优选的,处理器510可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器510中。The processor 510 is the control center of the mobile terminal, uses various interfaces and lines to connect various parts of the entire mobile terminal, runs or executes the software programs and/or modules stored in the memory 509, and calls the data stored in the memory 509. , perform various functions of the mobile terminal and process data, so as to monitor the mobile terminal as a whole. The processor 510 may include one or more processing units; preferably, the processor 510 may integrate an application processor and a modem processor, wherein the application processor mainly processes the operating system, user interface, and application programs, etc., and the modem The processor mainly handles wireless communication. It can be understood that, the above-mentioned modulation and demodulation processor may not be integrated into the processor 510.

移动终端500还可以包括给各个部件供电的电源511(比如电池),优选的,电源511可以通过电源管理系统与处理器510逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。The mobile terminal 500 may also include a power supply 511 (such as a battery) for supplying power to various components. Preferably, the power supply 511 may be logically connected to the processor 510 through a power management system, so as to manage charging, discharging, and power consumption management through the power management system. and other functions.

另外,移动终端500包括一些未示出的功能模块,在此不再赘述。In addition, the mobile terminal 500 includes some functional modules not shown, which will not be repeated here.

优选的,本发明实施例还提供了一种移动终端,包括:处理器510,存储器509,存储在存储器509上并可在处理器510上运行的计算机程序,该计算机程序被处理器510执行时实现上述语音活动检测方法实施例的各个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。Preferably, an embodiment of the present invention further provides a mobile terminal, comprising: a processor 510, a memory 509, a computer program stored in the memory 509 and executable on the processor 510, when the computer program is executed by the processor 510 Various processes of the above embodiments of the voice activity detection method are implemented, and the same technical effect can be achieved. To avoid repetition, details are not repeated here.

本发明实施例还提供了一种计算机可读存储介质,计算机可读存储介质上存储有计算机程序,计算机程序被处理器执行时实现上述语音活动检测方法实施例的各个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。其中,所述的计算机可读存储介质,如只读存储器(Read-Only Memory,简称ROM)、随机存取存储器(Random Access Memory,简称RAM)、磁碟或者光盘等。Embodiments of the present invention further provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, each process of the above embodiments of the voice activity detection method is implemented, and can achieve the same The technical effect, in order to avoid repetition, will not be repeated here. The computer-readable storage medium is, for example, a read-only memory (Read-Only Memory, ROM for short), a random access memory (Random Access Memory, RAM for short), a magnetic disk, or an optical disk.

需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。It should be noted that, herein, the terms "comprising", "comprising" or any other variation thereof are intended to encompass non-exclusive inclusion, such that a process, method, article or device comprising a series of elements includes not only those elements, It also includes other elements not expressly listed or inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in a process, method, article or apparatus that includes the element.

通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本发明各个实施例所述的方法。From the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiment can be implemented by means of software plus a necessary general hardware platform, and of course can also be implemented by hardware, but in many cases the former is better implementation. Based on this understanding, the technical solutions of the present invention can be embodied in the form of software products in essence or the parts that make contributions to the prior art, and the computer software products are stored in a storage medium (such as ROM/RAM, magnetic disk, CD), including several instructions to make a terminal (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of the present invention.

上面结合附图对本发明的实施例进行了描述,但是本发明并不局限于上述的具体实施方式,上述的具体实施方式仅仅是示意性的,而不是限制性的,本领域的普通技术人员在本发明的启示下,在不脱离本发明宗旨和权利要求所保护的范围情况下,还可做出很多形式,均属于本发明的保护之内。The embodiments of the present invention have been described above in conjunction with the accompanying drawings, but the present invention is not limited to the above-mentioned specific embodiments, which are merely illustrative rather than restrictive. Under the inspiration of the present invention, without departing from the spirit of the present invention and the scope protected by the claims, many forms can be made, which all belong to the protection of the present invention.

本领域普通技术人员可以意识到,结合本发明实施例中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。Those skilled in the art can realize that the units and algorithm steps of each example described in conjunction with the embodiments disclosed in the embodiments of the present invention can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of the present invention.

所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working process of the above-described systems, devices and units may refer to the corresponding processes in the foregoing method embodiments, which will not be repeated here.

在本申请所提供的实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.

所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。The functions, if implemented in the form of software functional units and sold or used as independent products, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution. The computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes: a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk and other mediums that can store program codes.

以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以权利要求的保护范围为准。The above are only specific embodiments of the present invention, but the protection scope of the present invention is not limited thereto. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed by the present invention. should be included within the protection scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims (12)

CN201910143186.9A2019-02-262019-02-26 A kind of voice activity detection method, mobile terminalPendingCN109754823A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201910143186.9ACN109754823A (en)2019-02-262019-02-26 A kind of voice activity detection method, mobile terminal

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201910143186.9ACN109754823A (en)2019-02-262019-02-26 A kind of voice activity detection method, mobile terminal

Publications (1)

Publication NumberPublication Date
CN109754823Atrue CN109754823A (en)2019-05-14

Family

ID=66406725

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201910143186.9APendingCN109754823A (en)2019-02-262019-02-26 A kind of voice activity detection method, mobile terminal

Country Status (1)

CountryLink
CN (1)CN109754823A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN110189747A (en)*2019-05-292019-08-30大众问问(北京)信息科技有限公司Voice signal recognition methods, device and equipment
CN111477243A (en)*2020-04-162020-07-31维沃移动通信有限公司 Audio signal processing method and electronic device
CN112116927A (en)*2019-06-212020-12-22罗伯特·博世有限公司Real-time detection of speech activity in an audio signal
CN112562735A (en)*2020-11-272021-03-26锐迪科微电子(上海)有限公司Voice detection method, device, equipment and storage medium
CN114582331A (en)*2020-12-022022-06-03北京猎户星空科技有限公司Voice processing method, model training method and device for voice processing
CN115273913A (en)*2022-07-272022-11-01歌尔科技有限公司Voice endpoint detection method, device, equipment and computer readable storage medium
CN115831132A (en)*2021-09-172023-03-21腾讯科技(深圳)有限公司Audio encoding and decoding method, device, medium and electronic equipment
TWI820529B (en)*2020-12-082023-11-01聯發科技股份有限公司A signal processing method and a speaker thereof

Citations (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN101080765A (en)*2005-05-092007-11-28株式会社东芝Voice activity detection apparatus and method
CN101548313A (en)*2006-11-162009-09-30国际商业机器公司Voice activity detection system and method
KR20120014755A (en)*2010-08-102012-02-20연세대학교 산학협력단 Apparatus and method for detecting audio target signal
US20160260426A1 (en)*2015-03-022016-09-08Electronics And Telecommunications Research InstituteSpeech recognition apparatus and method
CN108648769A (en)*2018-04-202018-10-12百度在线网络技术(北京)有限公司Voice activity detection method, apparatus and equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN101080765A (en)*2005-05-092007-11-28株式会社东芝Voice activity detection apparatus and method
CN101548313A (en)*2006-11-162009-09-30国际商业机器公司Voice activity detection system and method
KR20120014755A (en)*2010-08-102012-02-20연세대학교 산학협력단 Apparatus and method for detecting audio target signal
US20160260426A1 (en)*2015-03-022016-09-08Electronics And Telecommunications Research InstituteSpeech recognition apparatus and method
CN108648769A (en)*2018-04-202018-10-12百度在线网络技术(北京)有限公司Voice activity detection method, apparatus and equipment

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
周学广等: "《信息内容安全》", 30 November 2012*
宋知用: "《MATLAB语音信号分析与合成 第2版》", 30 January 2018*
陈奇川等: "基于GMM的声音活动检测方法", 《计算机应用与软件》*

Cited By (10)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN110189747A (en)*2019-05-292019-08-30大众问问(北京)信息科技有限公司Voice signal recognition methods, device and equipment
CN112116927A (en)*2019-06-212020-12-22罗伯特·博世有限公司Real-time detection of speech activity in an audio signal
CN111477243A (en)*2020-04-162020-07-31维沃移动通信有限公司 Audio signal processing method and electronic device
CN112562735A (en)*2020-11-272021-03-26锐迪科微电子(上海)有限公司Voice detection method, device, equipment and storage medium
CN112562735B (en)*2020-11-272023-03-24锐迪科微电子(上海)有限公司Voice detection method, device, equipment and storage medium
CN114582331A (en)*2020-12-022022-06-03北京猎户星空科技有限公司Voice processing method, model training method and device for voice processing
TWI820529B (en)*2020-12-082023-11-01聯發科技股份有限公司A signal processing method and a speaker thereof
US11811686B2 (en)2020-12-082023-11-07Mediatek Inc.Packet reordering method of sound bar
CN115831132A (en)*2021-09-172023-03-21腾讯科技(深圳)有限公司Audio encoding and decoding method, device, medium and electronic equipment
CN115273913A (en)*2022-07-272022-11-01歌尔科技有限公司Voice endpoint detection method, device, equipment and computer readable storage medium

Similar Documents

PublicationPublication DateTitle
CN109754823A (en) A kind of voice activity detection method, mobile terminal
CN111477243B (en)Audio signal processing method and electronic equipment
CN107799125A (en)A kind of audio recognition method, mobile terminal and computer-readable recording medium
CN108511002B (en)Method for recognizing sound signal of dangerous event, terminal and computer readable storage medium
CN109982228B (en)Microphone fault detection method and mobile terminal
CN109065060B (en)Voice awakening method and terminal
CN109951602B (en)Vibration control method and mobile terminal
CN108848267B (en) Audio playback method and mobile terminal
CN108196815B (en)Method for adjusting call sound and mobile terminal
CN110012143B (en) A receiver control method and terminal
CN108521501B (en)Voice input method, mobile terminal and computer readable storage medium
CN109814799A (en) Screen response control method and terminal device
CN108089801A (en)A kind of method for information display and mobile terminal
CN109994111A (en) An interaction method, device and mobile terminal
CN110995921A (en)Call processing method, electronic device and computer readable storage medium
CN107705804A (en)A kind of audible device condition detection method and mobile terminal
CN108174012A (en) A permission control method and mobile terminal
CN108594952B (en) Method and terminal device for anti-charging interference
CN110764650A (en) Button trigger detection method and electronic device
CN108305638B (en) A signal processing method, signal processing device and terminal equipment
WO2019101127A1 (en)Biological identification module processing method and apparatus, and mobile terminal
CN109639738A (en) Voice data transmission method and terminal device
CN108650392A (en)A kind of call recording method and mobile terminal
CN110472520B (en) A kind of identification method and mobile terminal
CN108108608B (en)Control method of mobile terminal and mobile terminal

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
RJ01Rejection of invention patent application after publication

Application publication date:20190514

RJ01Rejection of invention patent application after publication

[8]ページ先頭

©2009-2025 Movatter.jp