Movatterモバイル変換


[0]ホーム

URL:


CN111508524A - Method and system for identifying voice source device - Google Patents

Method and system for identifying voice source device
Download PDF

Info

Publication number
CN111508524A
CN111508524ACN202010148882.1ACN202010148882ACN111508524ACN 111508524 ACN111508524 ACN 111508524ACN 202010148882 ACN202010148882 ACN 202010148882ACN 111508524 ACN111508524 ACN 111508524A
Authority
CN
China
Prior art keywords
tcn
lmfb
feature
speech
res
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010148882.1A
Other languages
Chinese (zh)
Other versions
CN111508524B (en
Inventor
苏兆品
吴张倩
张国富
岳峰
武钦芳
沈朝勇
肖锐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei University of Technology
Original Assignee
Hefei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei University of TechnologyfiledCriticalHefei University of Technology
Priority to CN202010148882.1ApriorityCriticalpatent/CN111508524B/en
Publication of CN111508524ApublicationCriticalpatent/CN111508524A/en
Application grantedgrantedCritical
Publication of CN111508524BpublicationCriticalpatent/CN111508524B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Images

Classifications

Landscapes

Abstract

Translated fromChinese

本发明提供一种语音来源设备的识别方法和系统,涉及语音信息处理技术领域。本发明通过获取包含自然噪声的语音数据库;提取语音数据库中的语音样本的LMFB特征;基于改进的TCN网络学习深度语音特征,并利用LDA对其进行优化;最后基于深度语音特征LMFB‑TCN‑LDA对SVM分类器进行训练和测试,得到语音来源设备识别模型。本发明通过用包含自然噪声的语音样本的深度语音特征LMFB‑TCN‑LDA训练和测试SVM分类器,得到的语音来源设备识别模型能准确识别出包含自然噪声的语音的来源设备的语音来源设备识别模型,同时,本发明基于改进的TCN网络和LDA对LMFB特征进行深度语音特征学习,使得提取的LMFB‑TCN‑LDA特征更加反应设备本身特性,从而进一步提高语音来源设备识别模型的识别准确率。

Figure 202010148882

The invention provides a method and system for identifying a voice source device, which relate to the technical field of voice information processing. The present invention obtains a voice database containing natural noise; extracts the LMFB features of the voice samples in the voice database; learns deep voice features based on an improved TCN network, and uses LDA to optimize them; finally, based on the deep voice features LMFB-TCN-LDA The SVM classifier is trained and tested to obtain a speech source device recognition model. In the present invention, the SVM classifier is trained and tested by using the deep speech feature LMFB-TCN-LDA of the speech sample containing natural noise, and the obtained speech source device recognition model can accurately identify the source device of the speech containing natural noise. At the same time, the present invention performs deep speech feature learning on LMFB features based on the improved TCN network and LDA, so that the extracted LMFB-TCN-LDA features more reflect the characteristics of the device itself, thereby further improving the recognition accuracy of the voice source device recognition model.

Figure 202010148882

Description

Translated fromChinese
语音来源设备的识别方法和系统Method and system for identifying voice source device

技术领域technical field

本发明涉及语音信息处理技术领域,具体涉及一种语音来源设备的识别方法和系统。The present invention relates to the technical field of voice information processing, in particular to a method and system for identifying a voice source device.

背景技术Background technique

随着网络技术的发展,智能设备拥有了更多的功能与实用性,已经成为人们日常生活中不可或缺的一部分。越来越多的人喜欢使用智能手机和网络社交软件来记录日常活动的场景和声音。其中,语音是微信等网络社交软件最为常见的通信方式之一,基于语音信号的手机来源识别已经成为多媒体取证领域的一个热点课题,对于验证音频来源的真实性和原始性具有重要的现实意义,近年来受到公安和司法部门的高度重视。With the development of network technology, smart devices have more functions and practicability, and have become an indispensable part of people's daily life. More and more people like to use smartphones and online social software to record scenes and sounds of daily activities. Among them, voice is one of the most common communication methods for social networking software such as WeChat. Mobile phone source identification based on voice signals has become a hot topic in the field of multimedia forensics. It has important practical significance for verifying the authenticity and originality of audio sources. In recent years, it has been highly valued by the public security and judicial departments.

现有的语音来源设备的识别方法的框架通常包含两个步骤,即训练和识别。首先从训练集中提取不同型号手机的传统的关键语音特征(如MFCC),然后利用这些关键语音特征进行训练和分类,以创建不同的手机来源模板,最后将从测试集中提取的关键语音特征送入到手机来源模板库中进行匹配,以识别出特定的手机型号。The framework of existing speech source device recognition methods usually consists of two steps, namely training and recognition. First, the traditional key speech features (such as MFCC) of different models of mobile phones are extracted from the training set, and then these key speech features are used for training and classification to create different mobile phone source templates, and finally the key speech features extracted from the test set are fed into Match against the phone source template library to identify a specific phone model.

然而,本申请的发明人发现,现有的语音来源设备的识别方法在识别理想数据库中能取得不错的识别结果,但是当音频包换自然噪声时,其识别结果将会明显受到影响,导致识别结果的准确度较低,即现有的语音来源设备的识别方法在识别含有自然噪声的音频时准确度较低。However, the inventors of the present application have found that the existing methods for identifying voice source devices can achieve good identification results in the identification ideal database, but when the audio package is replaced by natural noise, the identification results will be significantly affected, resulting in the recognition of The accuracy of the result is low, that is, the recognition method of the existing voice source device has low accuracy in recognizing the audio containing natural noise.

发明内容SUMMARY OF THE INVENTION

(一)解决的技术问题(1) Technical problems solved

针对现有技术的不足,本发明提供了一种语音来源设备的识别方法和系统,解决了现有的语音来源设备的识别方法在识别含有自然噪声的音频时准确度较低的技术问题。Aiming at the deficiencies of the prior art, the present invention provides a method and system for identifying a voice source device, which solves the technical problem that the existing voice source device identification method has low accuracy when identifying audio containing natural noise.

(二)技术方案(2) Technical solutions

为实现以上目的,本发明通过以下技术方案予以实现:To achieve the above purpose, the present invention is achieved through the following technical solutions:

本发明提供一种语音来源设备的识别方法,所述方法由计算机执行,包括:The present invention provides a method for identifying a voice source device. The method is executed by a computer and includes:

获取包含自然噪声的语音数据库;Obtain a speech database containing natural noise;

提取所述语音数据库中的语音样本的LMFB特征;Extract the LMFB features of the speech samples in the speech database;

基于改进的TCN网络和所述语音样本的LMFB特征学习LMFB-TCN特征;Learning LMFB-TCN features based on the improved TCN network and the LMFB features of the speech samples;

基于LDA技术对所述LMFB-TCN特征进行优化,获取深度语音特征LMFB-TCN-LDA;Optimize the LMFB-TCN feature based on the LDA technology to obtain the deep speech feature LMFB-TCN-LDA;

基于所述深度语音特征LMFB-TCN-LDA对SVM分类器进行训练和测试,得到语音来源设备识别模型,所述语音来源设备识别模型用于识别语音来源设备的品牌和型号。The SVM classifier is trained and tested based on the deep voice feature LMFB-TCN-LDA, and a voice source device recognition model is obtained, and the voice source device recognition model is used to identify the brand and model of the voice source device.

优选的,所述获取包含自然噪声的语音数据库,包括:Preferably, the acquiring a speech database containing natural noise includes:

S101、获取自然噪声的语音数据;S101, acquiring speech data of natural noise;

S102、将所述语音数据裁剪成语音样本;S102, cutting the voice data into voice samples;

S103、将所述语音样本分为训练集和测试集,所述训练集和所述测试集构成所述语音数据库。S103. Divide the speech sample into a training set and a test set, and the training set and the test set constitute the speech database.

优选的,所述基于改进的TCN网络和所述语音样本的LMFB特征获取LMFB-TCN特征,包括:Preferably, the LMFB-TCN feature obtained based on the improved TCN network and the LMFB feature of the voice sample includes:

S301、把LMFB特征作为TCN网络的输入,对于T帧的LMFB特征,xt是从语音第t帧中提取的特征,xt∈RD,其中D为每一帧特征的维数,输入X是所有帧特征的串联,即X∈RT×D,输入特征经过一维卷积过滤,计算公式表达如下:S301. Take the LMFB feature as the input of the TCN network. For the LMFB feature of the T frame, xt is the feature extracted from the t-th frame of speech, xt ∈ RD , where D is the dimension of the feature of each frame, and input X is the concatenation of all frame features, namely X∈RT×D , the input features are filtered by one-dimensional convolution, and the calculation formula is expressed as follows:

Y1=σ1(W1*X0) (1)Y11 (W1 *X0 ) (1)

式(1)中:In formula (1):

X0是网络最初的输入特征;X0 is the initial input feature of the network;

W1是第一层网络需要学习的参数;W1 is the parameter that the first layer network needs to learn;

σ1是非线性激活函数Tanh;σ1 is the nonlinear activation function Tanh;

S302、步骤S301的输出进过在TCN网络中的残差模块,残差模块深层网络被分解成若干个残差学习单元Res_unit,每一个Res_unit中的卷积核个数是128,在残差模块中,全部采用扩张卷积,其中参数dilation rate(d)在连续Res_unit中以2的指数形式增加,即d=2n,n=0,1,2,3,4,在TCN中,每个Res_unit的输出通过添加到下一个Res_unit的输入而合并,令Yl代表第l层Res_unit的输出,则:S302. The output of step S301 is passed into the residual module in the TCN network, and the deep network of the residual module is decomposed into several residual learning units Res_unit. The number of convolution kernels in each Res_unit is 128. In the residual module , all adopt dilated convolution, in which the parameter dilation rate(d) increases exponentially by 2 in the continuous Res_unit, that is, d=2n , n=0, 1, 2, 3, 4, in TCN, each The output of the Res_unit is merged by adding to the input of the next Res_unit, let Yl represent the output of the lth layer of Res_unit, then:

Yl=Yl-1+F(Wl,Yl-1) (2)Yl =Yl-1 +F(Wl ,Yl-1 ) (2)

式(2)中:In formula (2):

Wl是第l层Res_unit需要学习的参数,F是在Res_unit中经历的非线性变换;Wl is the parameter that the Res_unit of the lth layer needs to learn, and F is the nonlinear transformation experienced in the Res_unit;

其中,在每个Res_unit中,将输入信号进行卷积之后分别利用Sigmoid激活函数和Tanh激活函数进行线性变换,并将结果相乘,再次经过一维卷积和Tanh激活函数之后输出,计算公式表达如下:Among them, in each Res_unit, after the input signal is convolved, the Sigmoid activation function and the Tanh activation function are used for linear transformation, and the results are multiplied, and then output after the one-dimensional convolution and the Tanh activation function. The calculation formula expresses as follows:

Figure BDA0002401738210000041
Figure BDA0002401738210000041

式(3)中:In formula (3):

σ1是非线性激活函数Tanh;σ1 is the nonlinear activation function Tanh;

σ2是非线性激活函数Sigmoid;σ2 is the nonlinear activation function Sigmoid;

Figure BDA0002401738210000042
Figure BDA0002401738210000043
分别代表在第l层Res_unit中第一层conv和第二层conv的参数,
Figure BDA0002401738210000044
Figure BDA0002401738210000042
and
Figure BDA0002401738210000043
Represent the parameters of the first layer conv and the second layer conv in the lth layer Res_unit, respectively,
Figure BDA0002401738210000044

S303、在经过N个Res_unit的学习后,累加不同输出,经过残差模块之后并经过Relu函数非线性变换后得YN,计算公式表达如下:S303, after the learning of N Res_units, accumulating different outputs, after passing through the residual module and after the nonlinear transformation of the Relu function, YN is obtained, and the calculation formula is expressed as follows:

Figure BDA0002401738210000045
Figure BDA0002401738210000045

式(4)中:In formula (4):

σ3是非线性激活函数Relu;σ3 is the nonlinear activation function Relu;

第一个Res_unit的输出是Y2,TCN中对所有后续Res_unit进行累加;The output of the first Res_unit is Y2 , and all subsequent Res_units are accumulated in the TCN;

在残差模块之后又添加两层卷积层,具体计算见公式(5)和(6):After the residual module, two convolutional layers are added, and the specific calculation is shown in formulas (5) and (6):

YN+1=σ3(WN+1*YN) (5)YN+13 (WN+1 *YN ) (5)

YN+2=WN+2*YN+1 (6)YN+2 = WN+2 *YN+1 (6)

式(5)(6)中:In formula (5) (6):

WN+1是第N+1层Res_unit需要学习的参数;WN+1 is the parameter that the Res_unit of the N+1th layer needs to learn;

WN+2是第N+2层Res_unit需要学习的参数;WN+2 is the parameter that the Res_unit of the N+2th layer needs to learn;

S304、步骤S303的输出YN+2经过全局池化后再经过TCN网络中的softmax层,计算表述式如下:S304. The output YN+2 of step S303 is globally pooled and then passed through the softmax layer in the TCN network. The calculation expression is as follows:

Figure BDA0002401738210000051
Figure BDA0002401738210000051

式(7)中:In formula (7):

YN+3=GlobalMaxPooling1d(YN+2) (8)YN+3 =GlobalMaxPooling1d(YN+2 ) (8)

经过改进的TCN网络的学习,以及不同网络层对数据的处理,最终取YN+2为的LMFB-TCN特征,其中YN+2∈R128×147,为了将高维冗余特征映射到低维有效特征同时去除冗余信息,将LMFB-TCN特征重塑成一维YN+2∈R6016After the learning of the improved TCN network and the processing of data by different network layers, the LMFB-TCN feature of YN+2 is finally taken, where YN+2 ∈ R128×147 , in order to map the high-dimensional redundant features to The low-dimensional effective features simultaneously remove redundant information and reshape the LMFB-TCN features into one-dimensional YN+2 ∈ R6016 .

优选的,所述基于LDA技术和LMFB-TCN特征提取深度语音特征LMFB-TCN-LDA过程主要包括:Preferably, the process of extracting deep speech features LMFB-TCN-LDA based on LDA technology and LMFB-TCN features mainly includes:

S401、计算6016维LMFB-TCN特征的均值向量得到μi,计算所有样本的均值向量μ;S401. Calculate the mean vector of the 6016-dimensional LMFB-TCN feature to obtain μi , and calculate the mean vector μ of all samples;

S402、构造类间散布矩阵SB以及类内散布矩阵SW:S402. Construct the inter-class scatter matrixSB and the intra-class scatter matrixSW :

Figure BDA0002401738210000052
Figure BDA0002401738210000052

Figure BDA0002401738210000053
Figure BDA0002401738210000053

式(9)、(10)中In formulas (9) and (10)

mi是为第i类的样本数目;mi is the number of samples for the i-th class;

Figure BDA0002401738210000054
yi∈{C1,C2......CN},Ci是类别,N是类别数,其中任意样本xi∈R6016,X是全部特征样本集;
Figure BDA0002401738210000054
yi ∈ {C1 , C2 ...... CN }, Ci is the category, N is the number of categories, where any sample xi ∈ R6016 , X is the entire feature sample set;

S403、计算矩阵SW-1SBS403, calculating the matrix SW-1 SB ;

S404、对SW-1SB进行奇异值分解,得到奇异值λi及其对应的特征向量wi,i=1,2,....,N;S404. Perform singular value decomposition on SW-1 SB to obtain singular values λi and their corresponding eigenvectors wi , i=1,2,....,N;

S405、取前k大的奇异值对应的特征向量组成投影矩阵W,k是输出特征的维数,最大为特征类别的个数减1,将k设置为n;S405, take the eigenvectors corresponding to the singular values with the largest k first to form the projection matrix W, where k is the dimension of the output feature, the maximum is the number of feature categories minus 1, and k is set to n;

S406、计算样本集中每个样本xi在新的低维空间的投影zi=WTxiS406: Calculate the projectionzi =WT xi of each sample xi in the sample set on the new low-dimensional space.

S407、得到深度语音特征LMFB-TCN-LDA的输出样本集,

Figure BDA0002401738210000061
其中任意样本zi∈Rn为n维深度语音特征LMFB-TCN-LDA。S407, obtain the output sample set of the deep speech feature LMFB-TCN-LDA,
Figure BDA0002401738210000061
where any samplezi ∈ Rn is the n-dimensional deep speech feature LMFB-TCN-LDA.

优选的,所述基于所述深度语音特征LMFB-TCN-LDA对SVM分类器进行训练和测试,得到语音来源设备识别模型,包括:Preferably, the SVM classifier is trained and tested based on the deep voice feature LMFB-TCN-LDA to obtain a voice source device recognition model, including:

通过语音数据库中训练集中提取的深度语音特征LMFB-TCN-LDA对SVM分类器进行训练,通过语音数据库中测试集中的提取的深度语音特征LMFB-TCN-LDA对SVM分类器进行测试,得到语音来源设备识别模型。The SVM classifier is trained by the deep speech feature LMFB-TCN-LDA extracted from the training set in the speech database, and the SVM classifier is tested by the deep speech feature LMFB-TCN-LDA extracted from the test set in the speech database, and the speech source is obtained. Device identification model.

本发明实施例一种语音来源设备的识别系统,所述系统包括计算机,所述计算机包括:An embodiment of the present invention is a system for identifying a voice source device, the system includes a computer, and the computer includes:

至少一个存储单元;at least one storage unit;

至少一个处理单元;at least one processing unit;

其中,所述至少一个存储单元中存储有至少一条指令,所述至少一条指令由所述至少一个处理单元加载并执行以实现以下步骤:Wherein, at least one instruction is stored in the at least one storage unit, and the at least one instruction is loaded and executed by the at least one processing unit to realize the following steps:

获取包含自然噪声的语音数据库;Obtain a speech database containing natural noise;

提取所述语音数据库中的语音样本的LMFB特征;Extract the LMFB features of the speech samples in the speech database;

基于改进的TCN网络和所述语音样本的LMFB特征学习LMFB-TCN特征;Learning LMFB-TCN features based on the improved TCN network and the LMFB features of the speech samples;

基于LDA技术对LMFB-TCN特征进行优化,获取深度语音特征LMFB-TCN-LDA;Optimize LMFB-TCN features based on LDA technology to obtain deep speech features LMFB-TCN-LDA;

基于所述深度语音特征LMFB-TCN-LDA对SVM分类器进行训练和测试,得到语音来源设备识别模型,所述语音来源设备识别模型用于识别语音来源设备的品牌和型号。The SVM classifier is trained and tested based on the deep voice feature LMFB-TCN-LDA, and a voice source device recognition model is obtained, and the voice source device recognition model is used to identify the brand and model of the voice source device.

优选的,所述获取包含自然噪声的语音数据库,包括:Preferably, the acquiring a speech database containing natural noise includes:

S101、获取自然噪声的语音数据;S101, acquiring speech data of natural noise;

S102、将所述语音数据裁剪成语音样本;S102, cutting the voice data into voice samples;

S103、将所述语音样本分为训练集和测试集,所述训练集和所述测试集构成所述语音数据库。S103. Divide the speech sample into a training set and a test set, and the training set and the test set constitute the speech database.

优选的,所述基于改进的TCN网络和所述语音样本的LMFB特征获取LMFB-TCN特征,包括:Preferably, the LMFB-TCN feature obtained based on the improved TCN network and the LMFB feature of the voice sample includes:

S301、把LMFB特征作为TCN网络的输入,对于T帧的LMFB特征,xt是从语音第t帧中提取的特征,xt∈RD,其中D为每一帧特征的维数,输入X是所有帧特征的串联,即X∈RT×D,输入特征经过一维卷积过滤,计算公式表达如下:S301. Take the LMFB feature as the input of the TCN network. For the LMFB feature of the T frame, xt is the feature extracted from the t-th frame of speech, xt ∈ RD , where D is the dimension of the feature of each frame, and input X is the concatenation of all frame features, namely X∈RT×D , the input features are filtered by one-dimensional convolution, and the calculation formula is expressed as follows:

Y1=σ1(W1*X0) (1)Y11 (W1 *X0 ) (1)

式(1)中:In formula (1):

X0是网络最初的输入特征;X0 is the initial input feature of the network;

W1是第一层网络需要学习的参数;W1 is the parameter that the first layer network needs to learn;

σ1是非线性激活函数Tanh;σ1 is the nonlinear activation function Tanh;

S302、步骤S301的输出进过在TCN网络中的残差模块,残差模块深层网络被分解成若干个残差学习单元Res_unit,每一个Res_unit中的卷积核个数是128,在残差模块中,全部采用扩张卷积,其中参数dilation rate(d)在连续Res_unit中以2的指数形式增加,即d=2n,n=0,1,2,3,4,在TCN中,每个Res_unit的输出通过添加到下一个Res_unit的输入而合并,令Yl代表第l层Res_unit的输出,则:S302. The output of step S301 is passed into the residual module in the TCN network, and the deep network of the residual module is decomposed into several residual learning units Res_unit. The number of convolution kernels in each Res_unit is 128. In the residual module , all adopt dilated convolution, in which the parameter dilation rate(d) increases exponentially by 2 in the continuous Res_unit, that is, d=2n , n=0, 1, 2, 3, 4, in TCN, each The output of the Res_unit is merged by adding to the input of the next Res_unit, let Yl represent the output of the lth layer of Res_unit, then:

Yl=Yl-1+F(Wl,Yl-1) (2)Yl =Yl-1 +F(Wl ,Yl-1 ) (2)

式(2)中:In formula (2):

Wl是第l层Res_unit需要学习的参数,F是在Res_unit中经历的非线性变换;Wl is the parameter that the Res_unit of the lth layer needs to learn, and F is the nonlinear transformation experienced in the Res_unit;

其中,在每个Res_unit中,将输入信号进行卷积之后分别利用Sigmoid激活函数和Tanh激活函数进行线性变换,并将结果相乘,再次经过一维卷积和Tanh激活函数之后输出,计算公式表达如下:Among them, in each Res_unit, after the input signal is convolved, the Sigmoid activation function and the Tanh activation function are used for linear transformation, and the results are multiplied, and then output after the one-dimensional convolution and the Tanh activation function. The calculation formula expresses as follows:

Figure BDA0002401738210000081
Figure BDA0002401738210000081

式(3)中:In formula (3):

σ1是非线性激活函数Tanh;σ1 is the nonlinear activation function Tanh;

σ2是非线性激活函数Sigmoid;σ2 is the nonlinear activation function Sigmoid;

Figure BDA0002401738210000082
Figure BDA0002401738210000083
分别代表在第l层Res_unit中第一层conv和第二层conv的参数,
Figure BDA0002401738210000084
Figure BDA0002401738210000082
and
Figure BDA0002401738210000083
Represent the parameters of the first layer conv and the second layer conv in the lth layer Res_unit, respectively,
Figure BDA0002401738210000084

S303、在经过N个Res_unit的学习后,累加不同输出,经过残差模块之后并经过Relu函数非线性变换后得YN,计算公式表达如下:S303, after the learning of N Res_units, accumulating different outputs, after passing through the residual module and after the nonlinear transformation of the Relu function, YN is obtained, and the calculation formula is expressed as follows:

Figure BDA0002401738210000091
Figure BDA0002401738210000091

式(4)中:In formula (4):

σ3是非线性激活函数Relu;σ3 is the nonlinear activation function Relu;

第一个Res_unit的输出是Y2,TCN中对所有后续Res_unit进行累加;The output of the first Res_unit is Y2 , and all subsequent Res_units are accumulated in the TCN;

在残差模块之后又添加两层卷积层,具体计算见公式(5)和(6):After the residual module, two convolutional layers are added, and the specific calculation is shown in formulas (5) and (6):

YN+1=σ3(WN+1*YN) (5)YN+13 (WN+1 *YN ) (5)

YN+2=WN+2*YN+1 (6)YN+2 = WN+2 *YN+1 (6)

式(5)(6)中:In formula (5) (6):

WN+1是第N+1层Res_unit需要学习的参数;WN+1 is the parameter that the Res_unit of the N+1th layer needs to learn;

WN+2是第N+2层Res_unit需要学习的参数;WN+2 is the parameter that the Res_unit of the N+2th layer needs to learn;

S304、步骤S303的输出YN+2经过全局池化后再经过TCN网络中的softmax层,计算表述式如下:S304. The output YN+2 of step S303 is globally pooled and then passed through the softmax layer in the TCN network. The calculation expression is as follows:

Figure BDA0002401738210000092
Figure BDA0002401738210000092

式(7)中:In formula (7):

YN+3=GlobalMaxPooling1d(YN+2) (8)YN+3 =GlobalMaxPooling1d(YN+2 ) (8)

经过改进的TCN网络的学习,以及不同网络层对数据的处理,最终取YN+2为的LMFB-TCN特征,其中YN+2∈R128×147,为了将高维冗余特征映射到低维有效特征并去除冗余信息,将LMFB-TCN特征重塑成一维YN+2∈R6016After the learning of the improved TCN network and the processing of data by different network layers, the LMFB-TCN feature of YN+2 is finally taken, where YN+2 ∈ R128×147 , in order to map the high-dimensional redundant features to Low-dimensional effective features and removing redundant information, reshape LMFB-TCN features into one-dimensional YN+2 ∈ R6016 .

(三)有益效果(3) Beneficial effects

本发明提供了一种语音来源设备的识别方法和系统。与现有技术相比,具备以下有益效果:The present invention provides a method and system for identifying a voice source device. Compared with the prior art, it has the following beneficial effects:

本发明通过获取包含自然噪声的语音数据库;提取所述语音数据库中的语音样本的LMFB特征;基于改进的TCN网络和所述语音样本的LMFB特征获取深度语音特征LMFB-TCN;基于LDA技术将LMFB-TCN高维特征映射到低维有效特征得到LMFB-TCN-LDA;基于所述深度语音特征LMFB-TCN-LDA对SVM分类器进行训练和测试,得到用于识别语音来源设备的品牌和型号到语音来源设备识别模型。本发明通过用包含自然噪声的语音样本的深度语音特征LMFB-TCN-LDA训练和测试SVM分类器,得到的语音来源设备识别模型能准确识别出包含自然噪声的语音的来源设备的语音来源设备识别模型,同时,本发明基于改进的TCN网络和LDA对LMFB特征进行深度语音特征学习和优化,使得提取的LMFB-TCN-LDA特征更加反应设备本身特性,从而进一步语音来源设备识别模型的识别准确率。The present invention obtains a voice database containing natural noise; extracts the LMFB feature of the voice sample in the voice database; obtains the deep voice feature LMFB-TCN based on the improved TCN network and the LMFB feature of the voice sample; - TCN high-dimensional features are mapped to low-dimensional effective features to obtain LMFB-TCN-LDA; based on the deep voice feature LMFB-TCN-LDA, the SVM classifier is trained and tested, and the brand and model of the device used to identify the voice source are obtained to Speech source device recognition model. In the present invention, the SVM classifier is trained and tested by using the deep speech feature LMFB-TCN-LDA of the speech sample containing natural noise, and the obtained speech source device recognition model can accurately identify the source device of the speech containing natural noise. At the same time, the present invention performs deep voice feature learning and optimization on LMFB features based on the improved TCN network and LDA, so that the extracted LMFB-TCN-LDA features more reflect the characteristics of the device itself, thereby further improving the recognition accuracy of the voice source device recognition model. .

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained according to these drawings without creative efforts.

图1为本发明实施例一种语音来源设备的识别方法的框图;1 is a block diagram of a method for identifying a voice source device according to an embodiment of the present invention;

图2为本发明实施例中改进的TCN网络的框架图,图2包括图2(a)、图2(b)和图2(c);Fig. 2 is a frame diagram of an improved TCN network in an embodiment of the present invention, and Fig. 2 includes Fig. 2(a), Fig. 2(b) and Fig. 2(c);

图3为验证试验中的不同特征的平均识别率;Fig. 3 is the average recognition rate of different features in the verification test;

图4为验证试验中的不同特征对应不同ID的recall;Figure 4 shows the recall of different IDs corresponding to different features in the verification test;

图5为验证试验中的不同特征对应不同ID的precision;Figure 5 shows the precision of different features corresponding to different IDs in the verification test;

图6为验证试验中的不同特征对应不同ID的f1-score;Figure 6 shows the f1-scores of different IDs corresponding to different features in the verification test;

图7为验证试验中的不同特征在不同大小数据集下训练模型结果。Figure 7 shows the results of training models with different features in the validation experiments on datasets of different sizes.

具体实施方式Detailed ways

为使本发明实施例的目的、技术方案和优点更加清楚,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the purposes, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention are described clearly and completely. Obviously, the described embodiments are part of the embodiments of the present invention, rather than all the implementations. example. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

本申请实施例通过提供一种语音来源设备的识别方法,解决了现有的语音来源设备的识别方法在识别含有自然噪声的音频时准确度较低的技术问题,实现提高语音来源设备识别模型的识别准确率。The embodiment of the present application solves the technical problem of low accuracy of the existing voice source device recognition method when recognizing audio containing natural noise by providing a voice source device recognition method, and improves the performance of the voice source device recognition model. recognition accuracy.

本申请实施例中的技术方案为解决上述技术问题,总体思路如下:The technical solutions in the embodiments of the present application are to solve the above-mentioned technical problems, and the general idea is as follows:

本发明实施例通过用包含自然噪声的语音样本的深度语音特征LMFB-TCN-LDA训练和测试SVM分类器,得到的语音来源设备识别模型能准确识别出包含自然噪声的语音的来源设备的语音来源设备识别模型。In the embodiment of the present invention, by training and testing the SVM classifier by using the deep speech feature LMFB-TCN-LDA of the speech samples containing natural noise, the obtained speech source device recognition model can accurately identify the speech source of the source device of the speech containing natural noise. Device identification model.

为了更好的理解上述技术方案,下面将结合说明书附图以及具体的实施方式对上述技术方案进行详细的说明。In order to better understand the above technical solutions, the above technical solutions will be described in detail below with reference to the accompanying drawings and specific embodiments.

本发明实施例提供一种语音来源设备的识别方法,如图1所示,该方法由计算机执行,包括步骤S1~S5:An embodiment of the present invention provides a method for identifying a voice source device. As shown in FIG. 1 , the method is executed by a computer and includes steps S1 to S5:

S1、获取包含自然噪声的语音数据库;S1. Obtain a speech database containing natural noise;

S2、提取语音数据库中的语音样本的LMFB特征;S2, extract the LMFB feature of the speech sample in the speech database;

S3、基于改进的TCN网络和语音样本的LMFB特征获取深度语音特征LMFB-TCN;S3, based on the improved TCN network and the LMFB feature of the voice sample to obtain the deep voice feature LMFB-TCN;

S4、基于LDA技术对LMFB-TCN特征进行优化,获取深度语音特征LMFB-TCN-LDA;S4. Optimize the LMFB-TCN feature based on the LDA technology to obtain the deep speech feature LMFB-TCN-LDA;

S5、基于深度语音特征LMFB-TCN-LDA对SVM分类器进行训练和测试,得到语音来源设备识别模型,语音来源设备识别模型用于识别语音来源设备的品牌和型号。S5. Train and test the SVM classifier based on the deep voice feature LMFB-TCN-LDA, and obtain a voice source device recognition model, which is used to identify the brand and model of the voice source device.

本发明实施例通过用包含自然噪声的语音样本的深度语音特征LMFB-TCN-LDA训练和测试SVM分类器,得到的语音来源设备识别模型能准确识别出包含自然噪声的语音的来源设备的语音来源设备识别模型,同时,本发明实施例基于改进的TCN网络对LMFB特征进行深度语音特征学习,并利用LDA技术进行低维有效特征提取,使得提取的LMFB-TCN-LDA特征更加反应设备本身特性,从而进一步语音来源设备识别模型的识别准确率。为后续验证音频来源的真实性和原始性提供重要的数据支撑。In the embodiment of the present invention, by training and testing the SVM classifier by using the deep speech feature LMFB-TCN-LDA of the speech samples containing natural noise, the obtained speech source device recognition model can accurately identify the speech source of the source device of the speech containing natural noise. Device recognition model, meanwhile, the embodiment of the present invention performs deep speech feature learning on LMFB features based on the improved TCN network, and uses LDA technology to perform low-dimensional effective feature extraction, so that the extracted LMFB-TCN-LDA features more reflect the characteristics of the device itself, Thereby, the recognition accuracy of the speech source device recognition model is further improved. Provide important data support for subsequent verification of the authenticity and originality of audio sources.

下面对各个步骤进行详细说明。Each step is described in detail below.

在步骤S1中,获取包含自然噪声的语音数据库。具体为:In step S1, a speech database containing natural noise is acquired. Specifically:

S101、获取自然噪声的语音数据。在本发明实施例中,获取包含十种常见的手机品牌的47种型号的手机语音信号,语音带自然噪声的场景主要包括:生活中的对话,电影对白,广播对白等。语音格式是MP3。手机的品牌和型号如表1所示。S101. Acquire speech data of natural noise. In the embodiment of the present invention, 47 types of mobile phone voice signals including ten common mobile phone brands are acquired, and scenes with natural noise in the voice mainly include: dialogues in life, dialogues in movies, dialogues in broadcasts, and the like. The voice format is MP3. The brands and models of mobile phones are shown in Table 1.

表1手机的品牌和型号Table 1 Brands and Models of Mobile Phones

Figure BDA0002401738210000131
Figure BDA0002401738210000131

S102、将语音数据裁剪成语音样本。在本发明实施例中,将手机语音信号裁剪成3s的语音片段,即裁剪成3s的语音样本。S102. Cut the voice data into voice samples. In the embodiment of the present invention, the mobile phone voice signal is cut into a 3s voice segment, that is, cut into a 3s voice sample.

S103、将所述语音样本分为训练集和测试集,所述训练集和所述测试集构成所述语音数据库。在本发明实施例中,每种型号的手机最终有700个语音样本。其中600条用作训练,100条用作测试。47种型号的手机的语音样本构成训练集、测试集以及语音数据库,语音数据库中包含32900个语音样本。S103. Divide the speech sample into a training set and a test set, and the training set and the test set constitute the speech database. In the embodiment of the present invention, each model of mobile phone finally has 700 voice samples. 600 of them are used for training and 100 are used for testing. The voice samples of 47 types of mobile phones constitute the training set, test set and voice database, and the voice database contains 32,900 voice samples.

在步骤S2中,提取语音数据库中的语音样本的LMFB特征。具体为:In step S2, the LMFB features of the speech samples in the speech database are extracted. Specifically:

S201、分帧:将语音样本的N个采样点集合成一个观测单位,称为帧。发明实施例中N的值设置为2048,为了避免相邻两帧的变化过大,因此会让两相邻帧之间有一段重叠区域,此重叠区域包含了M个取样点,本发明实施例中M设置为512。S201. Framing: the N sampling points of the speech sample are assembled into an observation unit, which is called a frame. In the embodiment of the invention, the value of N is set to 2048. In order to avoid excessive changes in two adjacent frames, there will be an overlapping area between the two adjacent frames, and the overlapping area includes M sampling points. Medium M is set to 512.

S202、加窗:将每一帧乘以汉明窗,以增加帧左端和右端的连续性。S202. Windowing: multiply each frame by a Hamming window to increase the continuity between the left end and the right end of the frame.

S203、傅里叶变换:对分帧加窗后的各帧信号进行快速傅里叶变换得到各帧的频谱,并对语音信号的频谱取模平方得到语音信号的能量谱。S203, Fourier transform: perform fast Fourier transform on the framed and windowed frame signals to obtain the frequency spectrum of each frame, and take the modulo square of the frequency spectrum of the speech signal to obtain the energy spectrum of the speech signal.

S204、梅尔滤波:将能量谱通过一组Mel尺度的三角形滤波器组,得到得到每帧信号的Mel子带谱。S204, Mel filtering: passing the energy spectrum through a set of Mel-scale triangular filter banks to obtain the Mel subband spectrum of each frame of signal.

S205、对数运算:采用对数函数对Mel子带谱进行非线性变换,得到语音样本的对数谱,即语音样本的LMFB特征。S205 , logarithmic operation: use a logarithmic function to perform nonlinear transformation on the Mel subband spectrum to obtain the logarithmic spectrum of the speech sample, that is, the LMFB feature of the speech sample.

LMFB特征相对于MFCC特征,其提取步骤少了离散余弦变化,因而保留了更多的有效语音信息,为下一步TCN网络的进一步处理创造了更好的前提。Compared with the MFCC feature, the LMFB feature has fewer discrete cosine changes in its extraction steps, thus retaining more effective speech information, creating a better premise for the further processing of the TCN network in the next step.

在步骤S3中,基于改进的TCN网络和语音样本的LMFB特征获取LMFB-TCN特征。具体为:In step S3, LMFB-TCN features are obtained based on the improved TCN network and the LMFB features of the speech samples. Specifically:

在本发明实施例中,改进的TCN网络的框架图如图2所示。In the embodiment of the present invention, the frame diagram of the improved TCN network is shown in FIG. 2 .

S301、TCN网络整体结构如图2(a)所示,把LMFB特征作为TCN网络的输入,对于T帧的LMFB特征,xt是从语音第t帧中提取的特征,xt∈RD,其中D为每一帧特征的维数,在本发明实施例中D=44,输入X是所有帧特征的串联,即X∈RT×D,输入特征经过一维卷积过滤,计算公式表达如下:S301. The overall structure of the TCN network is shown in Figure 2(a). The LMFB feature is used as the input of the TCN network. For the LMFB feature of the T frame, xt is the feature extracted from the t-th frame of speech, xt ∈ RD , D is the dimension of each frame feature, in the embodiment of the present invention D=44, the input X is the concatenation of all frame features, that is, X∈RT×D , the input feature is filtered by one-dimensional convolution, and the calculation formula expresses as follows:

Y1=σ1(W1*X0) (1)Y11 (W1 *X0 ) (1)

式(1)中:In formula (1):

X0是网络最初的输入特征;X0 is the initial input feature of the network;

W1是第一层网络需要学习的参数;W1 is the parameter that the first layer network needs to learn;

σ1是非线性激活函数Tanh;σ1 is the nonlinear activation function Tanh;

S302、步骤S301的输出进过在TCN网络中的残差模块,残差模块的结构如图2(b)所示。残差模块深层网络被分解成若干个残差学习单元Res_unit,每一个Res_unit中的卷积核个数是128,在残差模块中,全部采用扩张卷积,其中最关键的参数dilation rate(d)在连续Res_unit中以2的指数形式增加,即d=2n,n=0,1,2,3,4,能够在不显著增加参数数量的情况下,可在很大程度上增加感受野。在TCN中,每个Res_unit的输出通过添加到下一个Res_unit的输入而简单的合并,令Yl代表第l层Res_unit的输出,则:S302. The output of step S301 is passed through the residual module in the TCN network, and the structure of the residual module is shown in Figure 2(b). The deep network of the residual module is decomposed into several residual learning units Res_unit. The number of convolution kernels in each Res_unit is 128. In the residual module, all dilated convolutions are used, and the most critical parameter dilation rate (d ) increases exponentially by 2 in the continuous Res_unit, i.e. d=2n , n=0, 1, 2, 3, 4, which can greatly increase the receptive field without significantly increasing the number of parameters . In TCN, the output of each Res_unit is simply combined by adding to the input of the next Res_unit, let Yl represent the output of the lth layer of Res_unit, then:

Yl=Yl-1+F(Wl,Yl-1) (2)Yl =Yl-1 +F(Wl ,Yl-1 ) (2)

式(2)中:In formula (2):

Wl是第l层Res_unit需要学习的参数,F是在Res_unit中经历的非线性变换;Wl is the parameter that the Res_unit of the lth layer needs to learn, and F is the nonlinear transformation experienced in the Res_unit;

其中,每个Res_unit的具体结构如图2(c)所示。与普通网络连接方式不同的是,在每个残差学习单元Res_unit中,将输入信号进行卷积之后分别利用Sigmoid激活函数和Tanh激活函数进行线性变换,并将结果相乘,再次经过一维卷积和Tanh激活函数之后输出,计算公式表达如下:The specific structure of each Res_unit is shown in Figure 2(c). Different from the ordinary network connection method, in each residual learning unit Res_unit, after convolving the input signal, the Sigmoid activation function and the Tanh activation function are used for linear transformation, and the results are multiplied, and then go through a one-dimensional volume again. The product is output after the Tanh activation function, and the calculation formula is expressed as follows:

Figure BDA0002401738210000161
Figure BDA0002401738210000161

式(3)中:In formula (3):

σ1是非线性激活函数Tanh;σ1 is the nonlinear activation function Tanh;

σ2是非线性激活函数Sigmoid;σ2 is the nonlinear activation function Sigmoid;

Figure BDA0002401738210000162
Figure BDA0002401738210000163
分别代表在第l层Res_unit中第一层conv和第二层conv的参数,
Figure BDA0002401738210000164
Figure BDA0002401738210000162
and
Figure BDA0002401738210000163
Represent the parameters of the first layer conv and the second layer conv in the lth layer Res_unit, respectively,
Figure BDA0002401738210000164

S303、在经过N个Res_unit的学习后,累加不同输出,经过残差模块之后并经过Relu函数非线性变换后得YN,计算公式表达如下:S303, after the learning of N Res_units, accumulating different outputs, after passing through the residual module and after the nonlinear transformation of the Relu function, YN is obtained, and the calculation formula is expressed as follows:

Figure BDA0002401738210000165
Figure BDA0002401738210000165

式(4)中:In formula (4):

σ3是非线性激活函数Relu;σ3 is the nonlinear activation function Relu;

第一个Res_unit的输出是Y2,TCN中对所有后续Res_unit进行累加;利用网络学习不同语音信号中有区别的语音特征,在图2(a)中,在残差模块之后又添加两层卷积层,具体计算见公式(5)和(6):The output of the first Res_unit is Y2 , and all subsequent Res_units are accumulated in the TCN; the network is used to learn different speech features in different speech signals. In Figure 2(a), two layers of volumes are added after the residual module. Laminate, see formulas (5) and (6) for specific calculations:

YN+1=σ3(WN+1*YN) (5)YN+13 (WN+1 *YN ) (5)

YN+2=WN+2*YN+1 (6)YN+2 = WN+2 *YN+1 (6)

式(5)(6)中:In formula (5) (6):

WN+1是第N+1层Res_unit需要学习的参数;WN+1 is the parameter that the Res_unit of the N+1th layer needs to learn;

WN+2是第N+2层Res_unit需要学习的参数;WN+2 is the parameter that the Res_unit of the N+2th layer needs to learn;

S304、步骤S303的输出YN+2经过全局池化后再TCN网络中的softmax层,计算表述式如下:S304. The output YN+2 of step S303 is globally pooled and then the softmax layer in the TCN network. The calculation expression is as follows:

Figure BDA0002401738210000171
Figure BDA0002401738210000171

式(7)中:In formula (7):

YN+3=GlobalMaxPooling1d(YN+2) (8)YN+3 =GlobalMaxPooling1d(YN+2 ) (8)

此外,在本发明实施例中,在整体网络中还多次利用了加速神经网络训练的BatchNorm算法,以提高收敛速度和稳定性。经过TCN网络的学习处理,以及不同网络层对数据的处理,最终取YN+2为本发明实施例的LMFB-TCN特征,其中YN+2∈R128×147,而为了进行低维有效特征提取,将特征重塑成一维YN+2∈R6016In addition, in the embodiment of the present invention, the BatchNorm algorithm for accelerating neural network training is also used many times in the overall network to improve the convergence speed and stability. After the learning processing of the TCN network and the processing of data by different network layers, YN+2 is finally taken as the LMFB-TCN feature of the embodiment of the present invention, where YN+2 ∈ R128×147 , and in order to perform low-dimensional effective Feature extraction, reshape features into one-dimensional YN+2 ∈ R6016 .

在步骤S4中,基于LDA技术对LMFB-TCN特征进行优化,获取深度语音特征LMFB-TCN-LDA。具体为:In step S4, the LMFB-TCN feature is optimized based on the LDA technology to obtain the deep speech feature LMFB-TCN-LDA. Specifically:

S401、计算6016维LMFB-TCN特征的均值向量得到μi,计算所有样本的均值向量μ;S401. Calculate the mean vector of the 6016-dimensional LMFB-TCN feature to obtain μi , and calculate the mean vector μ of all samples;

S402、构造类间散布矩阵SB以及类内散布矩阵SW:S402. Construct the inter-class scatter matrixSB and the intra-class scatter matrixSW :

Figure BDA0002401738210000172
Figure BDA0002401738210000172

Figure BDA0002401738210000173
Figure BDA0002401738210000173

式(9)、(10)中In formulas (9) and (10)

mi是为第i类的样本数目;mi is the number of samples for the i-th class;

Figure BDA0002401738210000174
yi∈{C1,C2......CN},Ci是不同类别的手机,N是类别数,其中任意样本xi∈R6016为6016维LMFB-TCN特征向量,X是全部特征样本集;
Figure BDA0002401738210000174
yi ∈ {C1 , C2 ...... CN }, Ci is the mobile phone of different categories, N is the number of categories, where any sample xi ∈ R6016 is a 6016-dimensional LMFB-TCN feature vector, X is the entire feature sample set;

S403、计算矩阵SW-1SBS403, calculating the matrix SW-1 SB ;

S404、对SW-1SB进行奇异值分解,得到奇异值λi及其对应的特征向量wi,i=1,2,....,N;S404. Perform singular value decomposition on SW-1 SB to obtain singular values λi and their corresponding eigenvectors wi , i=1,2,....,N;

S405、取前k大的奇异值对应的特征向量组成投影矩阵W,k是输出特征的维数,最大为特征类别的个数减1,将k设置为n,在本发明实施例中,n为46;S405, take the eigenvectors corresponding to the singular values with the largest k first to form a projection matrix W, where k is the dimension of the output feature, the maximum is the number of feature categories minus 1, and k is set to n, in the embodiment of the present invention, n is 46;

S406、计算样本集中每个样本xi在新的低维空间的投影zi=WTxiS406, calculate the projectionzi =WT xi of each sample xi in the new low-dimensional space in the sample set;

S407、得到深度语音特征LMFB-TCN-LDA的输出样本集,

Figure BDA0002401738210000181
其中任意样本zi∈R46。S407, obtain the output sample set of the deep speech feature LMFB-TCN-LDA,
Figure BDA0002401738210000181
where any samplezi ∈ R46 .

在步骤S5中,基于深度语音特征LMFB-TCN-LDA对SVM分类器进行训练和测试,得到语音来源设备识别模型,语音来源设备识别模型用于识别语音来源设备的品牌和型号。具体为:In step S5, the SVM classifier is trained and tested based on the deep voice feature LMFB-TCN-LDA to obtain a voice source device recognition model, which is used to identify the brand and model of the voice source device. Specifically:

通过语音数据库中训练集中提取的深度语音特征LMFB-TCN-LDA对SVM分类器进行训练,通过语音数据库中测试集中的提取的深度语音特征LMFB-TCN-LDA对SVM分类器进行测试,得到用于识别语音来源设备的品牌和型号的语音来源设备识别模型。The SVM classifier is trained by the deep speech feature LMFB-TCN-LDA extracted from the training set in the speech database, and the SVM classifier is tested by the deep speech feature LMFB-TCN-LDA extracted from the test set in the speech database. A voice source device recognition model that identifies the make and model of the voice source device.

为了验证本发明实施例提供的方法的有效性,下面使用以下四种常见的评价标准去评价所提出方法的性能:Accuracy,Precision,Recall以及F1-score,其中TP是正阳性,FP为假阳性TN为正阴性,FN为假阴性。以上标准定义如公式(a)(b)(c)(d)所示。总的来说,这四个评价标准的值越高,性能就越好。In order to verify the effectiveness of the method provided in the embodiment of the present invention, the following four common evaluation criteria are used to evaluate the performance of the proposed method: Accuracy, Precision, Recall and F1-score, where TP is a positive positive and FP is a false positive TN Positive negative, FN false negative. The above standard definitions are shown in formula (a)(b)(c)(d). In general, the higher the value of these four evaluation criteria, the better the performance.

Figure BDA0002401738210000182
Figure BDA0002401738210000182

Figure BDA0002401738210000191
Figure BDA0002401738210000191

Figure BDA0002401738210000192
Figure BDA0002401738210000192

Figure BDA0002401738210000193
Figure BDA0002401738210000193

实验结果:Experimental results:

对于不同的评价标准,分别将现有技术中的特征BED和CQT以及常见的语音特征MFCC和本发明实施例提出的深度语音特征LMFB-TCN-LDA输入到分类器SVM进行识别对比。实验结果如图3~图6所示。附图中,本发明实施例提出的深度语音特征LMFB-TCN-LDA为Theproposed feature。For different evaluation criteria, the features BED and CQT in the prior art, the common speech feature MFCC and the deep speech feature LMFB-TCN-LDA proposed by the embodiment of the present invention are respectively input into the classifier SVM for identification and comparison. The experimental results are shown in FIGS. 3 to 6 . In the drawings, the deep speech feature LMFB-TCN-LDA proposed by the embodiment of the present invention is The proposed feature.

图3显示了不同特征的平均识别率,从图中可以看出本发明实施例提出的深度语音特征LMFB-TCN-LDA的平均识别率最高,达到99.98%。Figure 3 shows the average recognition rate of different features. It can be seen from the figure that the deep speech feature LMFB-TCN-LDA proposed by the embodiment of the present invention has the highest average recognition rate, reaching 99.98%.

图4、图5、图6分别表示了不同特征在不同型号设备下的recall、precision、f1-score。其中红线代表本文所提出的特征,从图中可以明显看出,在各个方面,本发明实施例提出的深度语音特征LMFB-TCN-LDA性能更加优秀。Figure 4, Figure 5, and Figure 6 show the recall, precision, and f1-score of different features under different types of equipment, respectively. The red line represents the features proposed in this paper. It can be clearly seen from the figure that, in various aspects, the deep speech feature LMFB-TCN-LDA proposed in the embodiment of the present invention has better performance.

而为了测试发明实施例提出的深度语音特征LMFB-TCN-LDA在不同大小数据集下的性能,并与其余特征进行对比。对于每一型号的设备,分别采用100、200、400、600条数据对模型进行训练,并对模型进行测试,实验对比结果如图7所示。In order to test the performance of the deep speech feature LMFB-TCN-LDA proposed in the embodiment of the invention under different data sets, and compare it with other features. For each type of equipment, 100, 200, 400, and 600 pieces of data were used to train the model, and the model was tested. The experimental comparison results are shown in Figure 7.

由图7可以看出,随着训练数据的减少,BED、CQT、MFCC的性能都有明显降低,但是对于发明实施例提出的深度语音特征LMFB-TCN-LDA,其性能降低不明显,也进一步说明了发明实施例提出的深度语音特征LMFB-TCN-LDA的有效性。It can be seen from Figure 7 that with the reduction of training data, the performance of BED, CQT, and MFCC are significantly reduced, but for the deep speech feature LMFB-TCN-LDA proposed in the embodiment of the invention, the performance is not significantly reduced, and further The effectiveness of the deep speech feature LMFB-TCN-LDA proposed in the embodiment of the invention is illustrated.

混合矩阵能给予一个模型更全面的认识,画出发明实施例提出的深度语音特征LMFB-TCN-LDA的混合矩阵,如表2所示。The mixing matrix can give a more comprehensive understanding of the model, and the mixing matrix of the deep speech feature LMFB-TCN-LDA proposed in the embodiment of the invention is drawn, as shown in Table 2.

表2发明实施例提出的深度语音特征LMFB-TCN-LDA的混合矩阵结果Table 2 Mixing matrix results of deep speech feature LMFB-TCN-LDA proposed in the embodiment of the invention

Figure BDA0002401738210000201
Figure BDA0002401738210000201

从表2中可以看出,除了ID15,其余设备都可以较准确的预测出对应ID。As can be seen from Table 2, except for ID15, other devices can accurately predict the corresponding ID.

本发明实施例还提供一种语音来源设备的识别系统,上述系统包括计算机,上述计算机包括:An embodiment of the present invention also provides a system for identifying a voice source device, where the system includes a computer, and the computer includes:

至少一个存储单元;at least one storage unit;

至少一个处理单元;at least one processing unit;

其中,上述至少一个存储单元中存储有至少一条指令,上述至少一条指令由上述至少一个处理单元加载并执行以实现以下步骤:Wherein, at least one instruction is stored in the above-mentioned at least one storage unit, and the above-mentioned at least one instruction is loaded and executed by the above-mentioned at least one processing unit to realize the following steps:

S1、获取包含自然噪声的语音数据库;S1. Obtain a speech database containing natural noise;

S2、提取语音数据库中的语音样本的LMFB特征;S2, extract the LMFB feature of the speech sample in the speech database;

S3、基于改进的TCN网络和语音样本的LMFB特征获取LMFB-TCN特征;S3. Obtain LMFB-TCN features based on the improved TCN network and the LMFB features of the speech samples;

S4、基于LDA技术对LMFB-TCN特征进行优化,获取低维语音有效特征LMFB-TCN-LDA;S4. Optimize the LMFB-TCN feature based on the LDA technology to obtain the low-dimensional speech effective feature LMFB-TCN-LDA;

S5、基于深度语音特征LMFB-TCN-LDA对SVM分类器进行训练和测试,得到语音来源设备识别模型,语音来源设备识别模型用于识别语音来源设备的品牌和型号。S5. Train and test the SVM classifier based on the deep voice feature LMFB-TCN-LDA, and obtain a voice source device recognition model, which is used to identify the brand and model of the voice source device.

可理解的是,本发明实施例提供的上述语音来源设备的识别系统与上述语音来源设备的识别方法相对应,其有关内容的解释、举例、有益效果等部分可以参考语音来源设备的识别方法中的相应内容,此处不再赘述。It is understandable that the recognition system for the above-mentioned voice source device provided by the embodiment of the present invention corresponds to the above-mentioned method for identifying the voice source device, and the explanations, examples, beneficial effects and other parts of the relevant content can be referred to in the identification method of the voice source device. The corresponding content is not repeated here.

综上所述,与现有技术相比,具备以下有益效果:To sum up, compared with the prior art, it has the following beneficial effects:

1、本发明实施例通过用包含自然噪声的语音样本的深度语音特征LMFB-TCN-LDA训练和测试SVM分类器,得到的语音来源设备识别模型能准确识别出包含自然噪声的语音的来源设备的语音来源设备识别模型。1. The embodiment of the present invention trains and tests the SVM classifier with the deep speech feature LMFB-TCN-LDA of the speech sample comprising natural noise, and the obtained speech source device recognition model can accurately identify the source device of the speech comprising natural noise. Speech source device recognition model.

2、本发明实施例基于改进的TCN网络和LDA对LMFB特征进行深度语音特征学习,使得提取的LMFB-TCN-LDA特征更加反应设备本身特性,从而进一步语音来源设备识别模型的识别准确率。为后续验证音频来源的真实性和原始性提供重要的数据支撑。2. The embodiment of the present invention performs deep speech feature learning on LMFB features based on the improved TCN network and LDA, so that the extracted LMFB-TCN-LDA features more reflect the characteristics of the device itself, thereby further improving the recognition accuracy of the speech source device recognition model. Provide important data support for subsequent verification of the authenticity and originality of audio sources.

需要说明的是,通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现。基于这样的理解,上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。It should be noted that, from the description of the above embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus a necessary general hardware platform. Based on this understanding, the above-mentioned technical solutions can be embodied in the form of software products in essence or the parts that make contributions to the prior art, and the computer software products can be stored in computer-readable storage media, such as ROM/RAM, magnetic A disc, an optical disc, etc., includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the methods described in various embodiments or some parts of the embodiments.

在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。In this document, relational terms such as first and second, etc. are used only to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any such existence between these entities or operations. The actual relationship or sequence. Moreover, the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article or device that includes a list of elements includes not only those elements, but also includes not explicitly listed or other elements inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in a process, method, article or apparatus that includes the element.

以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。The above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The recorded technical solutions are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (8)

Translated fromChinese
1.一种语音来源设备的识别方法,其特征在于,所述方法由计算机执行,包括:1. a recognition method of voice source equipment, is characterized in that, described method is carried out by computer, comprises:获取包含自然噪声的语音数据库;Obtain a speech database containing natural noise;提取所述语音数据库中的语音样本的LMFB特征;Extract the LMFB features of the speech samples in the speech database;基于改进的TCN网络和所述语音样本的LMFB特征学习LMFB-TCN特征;Learning LMFB-TCN features based on the improved TCN network and the LMFB features of the speech samples;基于LDA技术对所述LMFB-TCN特征进行优化,获取深度语音特征LMFB-TCN-LDA;Optimize the LMFB-TCN feature based on the LDA technology to obtain the deep speech feature LMFB-TCN-LDA;基于所述深度语音特征LMFB-TCN-LDA对SVM分类器进行训练和测试,得到语音来源设备识别模型,所述语音来源设备识别模型用于识别语音来源设备的品牌和型号。The SVM classifier is trained and tested based on the deep voice feature LMFB-TCN-LDA, and a voice source device recognition model is obtained, and the voice source device recognition model is used to identify the brand and model of the voice source device.2.如权利要求1所述的语音来源设备的识别方法,其特征在于,所述获取包含自然噪声的语音数据库,包括:2. The method for identifying a voice source device according to claim 1, wherein the acquiring a voice database containing natural noise comprises:S101、获取自然噪声的语音数据;S101, acquiring speech data of natural noise;S102、将所述语音数据裁剪成语音样本;S102, cutting the voice data into voice samples;S103、将所述语音样本分为训练集和测试集,所述训练集和所述测试集构成所述语音数据库。S103. Divide the speech sample into a training set and a test set, and the training set and the test set constitute the speech database.3.如权利要求1所述的语音来源设备的识别方法,其特征在于,所述基于改进的TCN网络和所述语音样本的LMFB特征获取LMFB-TCN特征,包括:3. the identification method of voice source equipment as claimed in claim 1, is characterized in that, described based on improved TCN network and the LMFB feature of described voice sample acquisition LMFB-TCN feature, comprising:S301、把LMFB特征作为TCN网络的输入,对于T帧的LMFB特征,xt是从语音第t帧中提取的特征,xt∈RD,其中D为每一帧特征的维数,输入X是所有帧特征的串联,即X∈RT×D,输入特征经过一维卷积过滤,计算公式表达如下:S301. Take the LMFB feature as the input of the TCN network. For the LMFB feature of the T frame, xt is the feature extracted from the t-th frame of speech, xt ∈ RD , where D is the dimension of the feature of each frame, and input X is the concatenation of all frame features, namely X∈RT×D , the input features are filtered by one-dimensional convolution, and the calculation formula is expressed as follows:Y1=σ1(W1*X0) (1)Y11 (W1 *X0 ) (1)式(1)中:In formula (1):X0是网络最初的输入特征;X0 is the initial input feature of the network;W1是第一层网络需要学习的参数;W1 is the parameter that the first layer network needs to learn;σ1是非线性激活函数Tanh;σ1 is the nonlinear activation function Tanh;S302、步骤S301的输出进过在TCN网络中的残差模块,残差模块深层网络被分解成若干个残差学习单元Res_unit,每一个Res_unit中的卷积核个数是128,在残差模块中,全部采用扩张卷积,其中参数dilation rate(d)在连续Res_unit中以2的指数形式增加,即d=2n,n=0,1,2,3,4,在TCN中,每个Res_unit的输出通过添加到下一个Res_unit的输入而合并,令Yl代表第l层Res_unit的输出,则:S302. The output of step S301 is passed into the residual module in the TCN network, and the deep network of the residual module is decomposed into several residual learning units Res_unit. The number of convolution kernels in each Res_unit is 128. In the residual module , all adopt dilated convolution, in which the parameter dilation rate(d) increases exponentially by 2 in the continuous Res_unit, that is, d=2n , n=0, 1, 2, 3, 4, in TCN, each The output of the Res_unit is merged by adding to the input of the next Res_unit, let Yl represent the output of the lth layer of Res_unit, then:Yl=Yl-1+F(Wl,Yl-1) (2)Yl =Yl-1 +F(Wl ,Yl-1 ) (2)式(2)中:In formula (2):Wl是第l层Res_unit需要学习的参数,F是在Res_unit中经历的非线性变换;Wl is the parameter that the Res_unit of the lth layer needs to learn, and F is the nonlinear transformation experienced in the Res_unit;其中,在每个残差学习单元Res_unit中,将输入信号进行卷积之后分别利用Sigmoid激活函数和Tanh激活函数进行线性变换,并将结果相乘,再次经过一维卷积和Tanh激活函数之后输出,计算公式表达如下:Among them, in each residual learning unit Res_unit, after the input signal is convolved, the Sigmoid activation function and the Tanh activation function are used for linear transformation, and the results are multiplied, and then output after one-dimensional convolution and Tanh activation function. , the calculation formula is as follows:
Figure FDA0002401738200000031
Figure FDA0002401738200000031
式(3)中:In formula (3):σ1是非线性激活函数Tanh;σ1 is the nonlinear activation function Tanh;σ2是非线性激活函数Sigmoid;σ2 is the nonlinear activation function Sigmoid;
Figure FDA0002401738200000032
Figure FDA0002401738200000033
分别代表在第l层Res_unit中第一层conv和第二层conv的参数,
Figure FDA0002401738200000034
Figure FDA0002401738200000032
and
Figure FDA0002401738200000033
Represent the parameters of the first layer conv and the second layer conv in the lth layer Res_unit, respectively,
Figure FDA0002401738200000034
S303、在经过N个Res_unit的学习后,累加不同输出,经过残差模块之后并经过Relu函数非线性变换后得YN,计算公式表达如下:S303, after the learning of N Res_units, accumulating different outputs, after passing through the residual module and after the nonlinear transformation of the Relu function, YN is obtained, and the calculation formula is expressed as follows:
Figure FDA0002401738200000035
Figure FDA0002401738200000035
式(4)中:In formula (4):σ3是非线性激活函数Relu;σ3 is the nonlinear activation function Relu;第一个Res_unit的输出是Y2,TCN中对所有后续Res_unit进行累加;The output of the first Res_unit is Y2 , and all subsequent Res_units are accumulated in the TCN;在残差模块之后又添加两层卷积层,具体计算见公式(5)和(6):After the residual module, two convolutional layers are added, and the specific calculation is shown in formulas (5) and (6):YN+1=σ3(WN+1*YN) (5)YN+13 (WN+1 *YN ) (5)YN+2=WN+2*YN+1 (6)YN+2 = WN+2 *YN+1 (6)式(5)(6)中:In formula (5) (6):WN+1是第N+1层Res_unit需要学习的参数;WN+1 is the parameter that the Res_unit of the N+1th layer needs to learn;WN+2是第N+2层Res_unit需要学习的参数;WN+2 is the parameter that the Res_unit of the N+2th layer needs to learn;S304、步骤S303的输出YN+2经过全局池化后再经过TCN网络中的softmax层,计算表述式如下:S304. The output YN+2 of step S303 is globally pooled and then passed through the softmax layer in the TCN network. The calculation expression is as follows:
Figure FDA0002401738200000036
Figure FDA0002401738200000036
式(7)中:In formula (7):YN+3=GlobalMaxPooling1d(YN+2) (8)YN+3 =GlobalMaxPooling1d(YN+2 ) (8)经过改进的TCN网络的学习,以及不同网络层对数据的处理,最终取YN+2为的LMFB-TCN特征,其中YN+2∈R128×147,为了将高维冗余特征映射到低维有效特征同时去除冗余信息,将LMFB-TCN特征重塑成一维YN+2∈R6016After the learning of the improved TCN network and the processing of data by different network layers, the LMFB-TCN feature of YN+2 is finally taken, where YN+2 ∈ R128×147 , in order to map the high-dimensional redundant features to The low-dimensional effective features simultaneously remove redundant information and reshape the LMFB-TCN features into one-dimensional YN+2 ∈ R6016 .4.如权利要求3所述的语音来源设备的识别方法,其特征在于,所述基于LDA技术和所述LMFB-TCN特征获取深度语音特征LMFB-TCN-LDA,包括:4. the identification method of the voice source equipment as claimed in claim 3, is characterized in that, described based on LDA technology and described LMFB-TCN characteristic obtains deep speech feature LMFB-TCN-LDA, comprises:S401、计算6016维LMFB-TCN特征的均值向量得到μi,计算所有样本的均值向量μ;S401. Calculate the mean vector of the 6016-dimensional LMFB-TCN feature to obtain μi , and calculate the mean vector μ of all samples;S402、构造类间散布矩阵SB以及类内散布矩阵SW:S402. Construct the inter-class scatter matrixSB and the intra-class scatter matrixSW :
Figure FDA0002401738200000041
Figure FDA0002401738200000041
Figure FDA0002401738200000042
Figure FDA0002401738200000042
式(9)、(10)中In formulas (9) and (10)mi是为第i类的样本数目;mi is the number of samples for the i-th class;
Figure FDA0002401738200000043
yi∈{C1,C2......CN},Ci是类别,N是类别数,其中任意样本xi∈R6016,X是全部特征样本集;
Figure FDA0002401738200000043
yi ∈ {C1 , C2 ...... CN }, Ci is the category, N is the number of categories, where any sample xi ∈ R6016 , X is the entire feature sample set;
S403、计算矩阵SW-1SBS403, calculating the matrix SW-1 SB ;S404、对SW-1SB进行奇异值分解,得到奇异值λi及其对应的特征向量wi,i=1,2,....,N;S404. Perform singular value decomposition on SW-1 SB to obtain singular values λi and their corresponding eigenvectors wi , i=1,2,....,N;S405、取前k大的奇异值对应的特征向量组成投影矩阵W,k是输出特征的维数,最大为特征类别的个数减1,将k设置为n;S405, take the eigenvectors corresponding to the singular values with the largest k first to form the projection matrix W, where k is the dimension of the output feature, the maximum is the number of feature categories minus 1, and k is set to n;S406、计算样本集中每个样本xi在新的低维空间的投影zi=WTxiS406: Calculate the projectionzi =WT xi of each sample xi in the sample set on the new low-dimensional space.S407、得到深度语音特征LMFB-TCN-LDA的输出样本集,
Figure FDA0002401738200000051
其中任意样本zi∈Rn为n维深度语音特征LMFB-TCN-LDA。
S407, obtain the output sample set of the deep speech feature LMFB-TCN-LDA,
Figure FDA0002401738200000051
where any samplezi ∈ Rn is the n-dimensional deep speech feature LMFB-TCN-LDA.
5.如权利要求1所述的语音来源设备的识别方法,其特征在于,所述基于所述深度语音特征LMFB-TCN-LDA对SVM分类器进行训练和测试,得到语音来源设备识别模型,包括:5. the identification method of voice source equipment as claimed in claim 1, is characterized in that, described based on described deep voice feature LMFB-TCN-LDA, SVM classifier is trained and tested, obtains voice source equipment recognition model, including :通过语音数据库中训练集中提取的深度语音特征LMFB-TCN-LDA对SVM分类器进行训练,通过语音数据库中测试集中的提取的深度语音特征LMFB-TCN-LDA对SVM分类器进行测试,得到语音来源设备识别模型。The SVM classifier is trained by the deep speech feature LMFB-TCN-LDA extracted from the training set in the speech database, and the SVM classifier is tested by the deep speech feature LMFB-TCN-LDA extracted from the test set in the speech database, and the speech source is obtained. Device identification model.6.一种语音来源设备的识别系统,其特征在于,所述系统包括计算机,所述计算机包括:6. A recognition system for a voice source device, wherein the system comprises a computer, and the computer comprises:至少一个存储单元;at least one storage unit;至少一个处理单元;at least one processing unit;其中,所述至少一个存储单元中存储有至少一条指令,所述至少一条指令由所述至少一个处理单元加载并执行以实现以下步骤:Wherein, at least one instruction is stored in the at least one storage unit, and the at least one instruction is loaded and executed by the at least one processing unit to realize the following steps:获取包含自然噪声的语音数据库;Obtain a speech database containing natural noise;提取所述语音数据库中的语音样本的LMFB特征;Extract the LMFB features of the speech samples in the speech database;基于改进的TCN网络和所述语音样本的LMFB特征学习LMFB-TCN特征;Learning LMFB-TCN features based on the improved TCN network and the LMFB features of the speech samples;基于LDA技术对LMFB-TCN特征进行优化,获取深度语音特征LMFB-TCN-LDA;Optimize LMFB-TCN features based on LDA technology to obtain deep speech features LMFB-TCN-LDA;基于所述深度语音特征LMFB-TCN-LDA对SVM分类器进行训练和测试,得到语音来源设备识别模型,所述语音来源设备识别模型用于识别语音来源设备的品牌和型号。The SVM classifier is trained and tested based on the deep voice feature LMFB-TCN-LDA, and a voice source device recognition model is obtained, and the voice source device recognition model is used to identify the brand and model of the voice source device.7.如权利要求6所述的语音来源设备的识别方法,其特征在于,所述获取包含自然噪声的语音数据库,包括:7. The method for identifying a voice source device according to claim 6, wherein the acquiring a voice database containing natural noise comprises:S101、获取自然噪声的语音数据;S101, acquiring speech data of natural noise;S102、将所述语音数据裁剪成语音样本;S102, cutting the voice data into voice samples;S103、将所述语音样本分为训练集和测试集,所述训练集和所述测试集构成所述语音数据库。S103. Divide the speech sample into a training set and a test set, and the training set and the test set constitute the speech database.8.如权利要求6所述的语音来源设备的识别方法,其特征在于,所述基于改进的TCN网络和所述语音样本的LMFB特征获取LMFB-TCN特征,包括:8. the identification method of voice source equipment as claimed in claim 6, is characterized in that, described based on improved TCN network and the LMFB feature of described voice sample acquisition LMFB-TCN feature, comprising:S301、把LMFB特征作为TCN网络的输入,对于T帧的LMFB特征,xt是从语音第t帧中提取的特征,xt∈RD,其中D为每一帧特征的维数,输入X是所有帧特征的串联,即X∈RT×D,输入特征经过一维卷积过滤,计算公式表达如下:S301. Take the LMFB feature as the input of the TCN network. For the LMFB feature of the T frame, xt is the feature extracted from the t-th frame of the speech, xt ∈ RD , where D is the dimension of each frame feature, and input X is the concatenation of all frame features, namely X∈RT×D , the input features are filtered by one-dimensional convolution, and the calculation formula is expressed as follows:Y1=σ1(W1*X0) (1)Y11 (W1 *X0 ) (1)式(1)中:In formula (1):X0是网络最初的输入特征;X0 is the initial input feature of the network;W1是第一层网络需要学习的参数;W1 is the parameter that the first layer network needs to learn;σ1是非线性激活函数Tanh;σ1 is the nonlinear activation function Tanh;S302、步骤S301的输出进过在TCN网络中的残差模块,残差模块深层网络被分解成若干个残差学习单元Res_unit,每一个Res_unit中的卷积核个数是128,在残差模块中,全部采用扩张卷积,其中参数dilation rate(d)在连续Res_unit中以2的指数形式增加,即d=2n,n=0,1,2,3,4,在TCN中,每个Res_unit的输出通过添加到下一个Res_unit的输入而合并,令Yl代表第l层Res_unit的输出,则:S302. The output of step S301 is passed into the residual module in the TCN network, and the deep network of the residual module is decomposed into several residual learning units Res_unit, and the number of convolution kernels in each Res_unit is 128. In the residual module , all adopt dilated convolution, in which the parameter dilation rate(d) increases exponentially by 2 in the continuous Res_unit, that is, d=2n , n=0, 1, 2, 3, 4, in TCN, each The output of the Res_unit is merged by adding to the input of the next Res_unit, let Yl represent the output of the lth layer of Res_unit, then:Yl=Yl-1+F(Wl,Yl-1) (2)Yl =Yl-1 +F(Wl ,Yl-1 ) (2)式(2)中:In formula (2):Wl是第l层Res_unit需要学习的参数,F是在Res_unit中经历的非线性变换;Wl is the parameter that the Res_unit of the lth layer needs to learn, and F is the nonlinear transformation experienced in the Res_unit;其中,在每个Res_unit中,将输入信号进行卷积之后分别利用Sigmoid激活函数和Tanh激活函数进行线性变换,并将结果相乘,再次经过一维卷积和Tanh激活函数之后输出,计算公式表达如下:Among them, in each Res_unit, after the input signal is convolved, the Sigmoid activation function and the Tanh activation function are used for linear transformation, and the results are multiplied, and then output after the one-dimensional convolution and the Tanh activation function. The calculation formula expresses as follows:
Figure FDA0002401738200000071
Figure FDA0002401738200000071
式(3)中:In formula (3):σ1是非线性激活函数Tanh;σ1 is the nonlinear activation function Tanh;σ2是非线性激活函数Sigmoid;σ2 is the nonlinear activation function Sigmoid;
Figure FDA0002401738200000072
Figure FDA0002401738200000073
分别代表在第l层Res_unit中第一层conv和第二层conv的参数,
Figure FDA0002401738200000074
Figure FDA0002401738200000072
and
Figure FDA0002401738200000073
Represent the parameters of the first layer conv and the second layer conv in the lth layer Res_unit, respectively,
Figure FDA0002401738200000074
S303、在经过N个Res_unit的学习后,累加不同输出,经过残差模块之后并经过Relu函数非线性变换后得YN,计算公式表达如下:S303, after the learning of N Res_units, different outputs are accumulated, and YN is obtained after the residual module and the nonlinear transformation of the Relu function. The calculation formula is expressed as follows:
Figure FDA0002401738200000081
Figure FDA0002401738200000081
式(4)中:In formula (4):σ3是Relu函数;σ3 is the Relu function;第一个Res_unit的输出是Y2,TCN中对所有后续Res_unit进行累加;The output of the first Res_unit is Y2 , and all subsequent Res_units are accumulated in the TCN;在残差模块之后又添加两层卷积层,具体计算见公式(5)和(6):After the residual module, two convolutional layers are added, and the specific calculation is shown in formulas (5) and (6):YN+1=σ3(WN+1*YN) (5)YN+13 (WN+1 *YN ) (5)YN+2=WN+2*YN+1 (6)YN+2 = WN+2 *YN+1 (6)式(5)(6)中:In formula (5) (6):WN+1是第N+1层Res_unit需要学习的参数;WN+1 is the parameter that the Res_unit of the N+1th layer needs to learn;WN+2是第N+2层Res_unit需要学习的参数;WN+2 is the parameter that the Res_unit of the N+2th layer needs to learn;S304、步骤S303的输出YN+2经过全局池化后再经过TCN网络中的softmax层,计算表述式如下:S304. The output YN+2 of step S303 is globally pooled and then passed through the softmax layer in the TCN network. The calculation expression is as follows:
Figure FDA0002401738200000082
Figure FDA0002401738200000082
式(7)中:In formula (7):YN+3=GlobalMaxPooling1d(YN+2) (8)YN+3 =GlobalMaxPooling1d(YN+2 ) (8)经过改进的TCN网络的学习,以及不同网络层对数据的处理,最终取YN+2为的LMFB-TCN特征,其中YN+2∈R128×147,为了将高维冗余特征映射到低维有效特征同时去除冗余信息,将LMFB-TCN特征重塑成一维YN+2∈R6016After the learning of the improved TCN network and the processing of data by different network layers, the LMFB-TCN feature of YN+2 is finally taken, where YN+2 ∈ R128×147 , in order to map the high-dimensional redundant features to The low-dimensional effective features simultaneously remove redundant information and reshape the LMFB-TCN features into one-dimensional YN+2 ∈ R6016 .
CN202010148882.1A2020-03-052020-03-05 Method and system for identifying voice source equipmentActiveCN111508524B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202010148882.1ACN111508524B (en)2020-03-052020-03-05 Method and system for identifying voice source equipment

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202010148882.1ACN111508524B (en)2020-03-052020-03-05 Method and system for identifying voice source equipment

Publications (2)

Publication NumberPublication Date
CN111508524Atrue CN111508524A (en)2020-08-07
CN111508524B CN111508524B (en)2023-02-21

Family

ID=71863930

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202010148882.1AActiveCN111508524B (en)2020-03-052020-03-05 Method and system for identifying voice source equipment

Country Status (1)

CountryLink
CN (1)CN111508524B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN113096672A (en)*2021-03-242021-07-09武汉大学Multi-audio object coding and decoding method applied to low code rate
WO2022053900A1 (en)*2020-09-092022-03-17International Business Machines CorporationSpeech recognition using data analysis and dilation of interlaced audio input
WO2022066328A1 (en)*2020-09-252022-03-31Intel CorporationReal-time dynamic noise reduction using convolutional networks
US11538464B2 (en)2020-09-092022-12-27International Business Machines Corporation .Speech recognition using data analysis and dilation of speech content from separated audio input

Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN109285538A (en)*2018-09-192019-01-29宁波大学 A mobile phone source identification method based on constant-Q transform domain in additive noise environment
CN109378014A (en)*2018-10-222019-02-22华中师范大学 A method and system for source identification of mobile devices based on convolutional neural network
US20190066713A1 (en)*2016-06-142019-02-28The Trustees Of Columbia University In The City Of New YorkSystems and methods for speech separation and neural decoding of attentional selection in multi-speaker environments
CN110277099A (en)*2019-06-132019-09-24北京百度网讯科技有限公司 Speech-based mouth shape generation method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20190066713A1 (en)*2016-06-142019-02-28The Trustees Of Columbia University In The City Of New YorkSystems and methods for speech separation and neural decoding of attentional selection in multi-speaker environments
CN109285538A (en)*2018-09-192019-01-29宁波大学 A mobile phone source identification method based on constant-Q transform domain in additive noise environment
CN109378014A (en)*2018-10-222019-02-22华中师范大学 A method and system for source identification of mobile devices based on convolutional neural network
CN110277099A (en)*2019-06-132019-09-24北京百度网讯科技有限公司 Speech-based mouth shape generation method and device

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
GIANMARCO BALDINI,IRENE AMERINI.: "Smartphones Identification Through the Built-In Microphones With Convolutional Neural Network.", 《IEEE ACCESS 》*
YANXIONG LI,等: "Mobile Phone Clustering From Speech Recordings Using Deep Representation and Spectral Clustering.", 《IEEE TRANS. INFORMATION FORENSICS AND SECURITY》*
王海坤等: "基于时域建模的自动语音识别", 《计算机工程与应用》*
裴安山等: "基于语音静音段特征的手机来源识别方法", 《电信科学》*

Cited By (9)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
WO2022053900A1 (en)*2020-09-092022-03-17International Business Machines CorporationSpeech recognition using data analysis and dilation of interlaced audio input
US11495216B2 (en)2020-09-092022-11-08International Business Machines CorporationSpeech recognition using data analysis and dilation of interlaced audio input
US11538464B2 (en)2020-09-092022-12-27International Business Machines Corporation .Speech recognition using data analysis and dilation of speech content from separated audio input
GB2615421A (en)*2020-09-092023-08-09IbmSpeech recognition using data analysis and dilation of interlaced audio input
GB2615421B (en)*2020-09-092025-05-07IbmSpeech recognition using data analysis and dilation of interlaced audio input
WO2022066328A1 (en)*2020-09-252022-03-31Intel CorporationReal-time dynamic noise reduction using convolutional networks
US12062369B2 (en)2020-09-252024-08-13Intel CorporationReal-time dynamic noise reduction using convolutional networks
CN113096672A (en)*2021-03-242021-07-09武汉大学Multi-audio object coding and decoding method applied to low code rate
CN113096672B (en)*2021-03-242022-06-14武汉大学Multi-audio object coding and decoding method applied to low code rate

Also Published As

Publication numberPublication date
CN111508524B (en)2023-02-21

Similar Documents

PublicationPublication DateTitle
CN112687263B (en) Speech recognition neural network model and its training method, speech recognition method
CN110289003B (en) A voiceprint recognition method, model training method and server
CN111508524B (en) Method and system for identifying voice source equipment
WO2021012734A1 (en)Audio separation method and apparatus, electronic device and computer-readable storage medium
WO2020177380A1 (en)Voiceprint detection method, apparatus and device based on short text, and storage medium
CN109767785A (en) Environmental noise recognition and classification method based on convolutional neural network
WO2021000408A1 (en)Interview scoring method and apparatus, and device and storage medium
CN103531198B (en)Speech emotion feature regularization method based on pseudo speaker clustering
CN108550375A (en)A kind of emotion identification method, device and computer equipment based on voice signal
CN109036382A (en)A kind of audio feature extraction methods based on KL divergence
CN106205624B (en) A Voiceprint Recognition Method Based on DBSCAN Algorithm
CN108986798B (en)Processing method, device and the equipment of voice data
CN110728991B (en) An Improved Recording Device Recognition Algorithm
CN111564163A (en) An RNN-based voice detection method for multiple forgery operations
Wang et al.A network model of speaker identification with new feature extraction methods and asymmetric BLSTM
CN114155460B (en)User type identification method, device, computer equipment and storage medium
CN114238849B (en) False audio detection method and system based on complex spectrum sub-band fusion
CN111091809B (en)Regional accent recognition method and device based on depth feature fusion
CN115312080A (en) A speech emotion recognition model and method based on complementary acoustic representation
CN117496980B (en)Voiceprint recognition method based on local and global cross-channel fusion
CN105702251A (en)Speech emotion identifying method based on Top-k enhanced audio bag-of-word model
CN115050374A (en)Feature fusion method, electronic device, and storage medium
CN114822560A (en) Voiceprint recognition model training and voiceprint recognition method, system, equipment and medium
CN115035916A (en)Noise-containing speech emotion recognition method based on deep learning
CN111627448A (en)System and method for realizing trial and talk control based on voice big data

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp