CN112346013B

Movatterモバイル変換

Info

Publication number: CN112346013B
Application number: CN202011173630.0A
Authority: CN
Inventors: 张雯; 郗经纬; 杨懿晨
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2020-10-28
Filing date: 2020-10-28
Publication date: 2023-06-30
Anticipated expiration: 2040-10-28
Also published as: CN112346013A

Abstract

The invention relates to a binaural sound source localization method based on deep learning, which uses a convolutional neural network to process binaural received signals and a multitasking neural network (Multitask Neural Network, MNN) to simultaneously estimate the localization method of sound source azimuth-elevation. Parameters determining the target three-dimensional DOA estimation in different environments (noise and reverberation) are learned through a neural network, and the parameters are used for three-dimensional DOA estimation. The invention can estimate the sound source direction by using the same trained model aiming at different environments, thereby avoiding special treatment aiming at specific environments and meeting the requirement that the target azimuth angle and pitch angle can be accurately estimated under various environmental conditions; meanwhile, the algorithm has high positioning accuracy, and exceeds the existing binaural sound source positioning algorithm in various complex environments. The method can effectively solve the problem of influence of environmental interference on positioning in the traditional method, has wide application prospect, and can be directly put into use.

Description

Translated fromChinese

一种基于深度学习的双耳声源定位方法A binaural sound source localization method based on deep learning

技术领域technical field

本发明属于人机交互、深度学习、双耳定位等技术领域，涉及一种基于深度学习的双耳声源定位方法，是一种基于深度神经网络的双耳听觉方面的声源定位方法。The invention belongs to the technical fields of human-computer interaction, deep learning, and binaural positioning, and relates to a binaural sound source localization method based on deep learning, which is a sound source localization method for binaural hearing based on a deep neural network.

背景技术Background technique

双耳声源定位旨在实现与人类听音定位相同的能力，即通过模拟双耳听觉原理，使用两个声学传感器识别声源的空间位置。与音频、雷达和声纳应用中部署的许多定位系统相比，双传感器阵列的主要优点是体积小、响应时间快和易于校准。Binaural sound source localization aims to achieve the same ability as human hearing localization, that is, by simulating the principle of binaural hearing, using two acoustic sensors to identify the spatial location of sound sources. The main advantages of a dual sensor array over many positioning systems deployed in audio, radar, and sonar applications are small size, fast response time, and ease of calibration.

双耳定位的线索可以分为双耳线索和单耳线索。双耳线索是指左右耳信号间的双耳相位和声级差，常用来确定侧耳方向(左、前、右，或称之为前半水平面)；单耳线索指由于声波在耳廓和人体周围的散射和衍射引起的光谱线索，主要用于仰角定位以及分辨前后。头部相关传递函数(Head-Related Transfer Function,HRTF)定义为自由场环境下，声信号从声源发出到两耳的接受整个过程中的频域传递函数。从HRTF中我们可以提取双耳定位的线索。The cues for binaural positioning can be divided into binaural cues and monaural cues. Binaural cues refer to the binaural phase and sound level difference between the left and right ear signals, which are often used to determine the direction of the lateral ear (left, front, right, or called the front half horizontal plane); Spectral cues caused by scattering and diffraction are mainly used for elevation orientation and for resolving before and after. The head-related transfer function (Head-Related Transfer Function, HRTF) is defined as the frequency-domain transfer function of the sound signal from the sound source to the reception of the two ears in a free-field environment. From HRTF we can extract binaural localization cues.

目前较为传统的双耳声源定位算法有基于互相关技术，即从两个麦克风信号估计双耳线索，通过与双耳线索数据集进行比较，估计声源方位，见文献：M.Raspaud,H. Viste,and G.Evangelista,“Binaural source localization by joint estimation of ILDand ITD,”IEEE Trans.Audio,Speech,Language Process.,vol.18,pp.68–77,2010.以及R. Parisi,F.Camoes,and A.Uncini,“Cepstrum prefiltering for binaural sourcelocalization in reverberant environments,”IEEE Signal Processing Letters,vol.19,pp.99–102,2012.；基于模型的算法，即通过概率模型的统计数据，应用最大似然估计进行声源定位，见文献：J.Woodruff and D.Wang,“Binaural localization ofmultiple sources in reverberant and noisy environments,”IEEE Trans.Audio,Speech,Language Process.,vol.20,pp. 1503–1512,2012.；以及基于谱差异，即通过比较接收到的双耳信号与HRTF数据之间的谱差异，从而进行仰角上的方位估计，见文献：B.R.Hammond and P.J.Jackson, “Robust full-sphere binaural sound sourcelocalization using interaural and spectral cues,” in ICASSP 2019-2019IEEEInternational Conference on Acoustics,Speech and Signal Processing(ICASSP),pp.421–425,Brighton,United Kingdom,May 2019.。At present, the traditional binaural sound source localization algorithm is based on cross-correlation technology, that is, binaural cues are estimated from two microphone signals, and the direction of the sound source is estimated by comparing with the binaural cues data set. See literature: M.Raspaud, H . Viste, and G. Evangelista, "Binaural source localization by joint estimation of ILD and ITD," IEEE Trans. Audio, Speech, Language Process., vol.18, pp.68–77, 2010. and R. Parisi, F. Camoes, and A. Uncini, "Cepstrum prefiltering for binaural sourcelocalization in reverberant environments," IEEE Signal Processing Letters, vol.19, pp.99–102, 2012.; model-based algorithms, i.e., statistics through probabilistic models, apply Maximum likelihood estimation for sound source localization, see literature: J.Woodruff and D.Wang, "Binaural localization of multiple sources in reverberant and noisy environments," IEEE Trans.Audio, Speech, Language Process., vol.20, pp. 1503 –1512, 2012.; and based on the spectral difference, that is, by comparing the spectral difference between the received binaural signal and the HRTF data, the azimuth estimation on the elevation angle is performed, see the literature: B.R.Hammond and P.J.Jackson, “Robust full- sphere binaural sound source localization using interaural and spectral cues,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.421–425, Brighton, United Kingdom, May 2019.

随着机器学习的兴起，基于神经网络的方法被广泛用于解决双耳听觉定位问题。利用卷积神经网络(Convolutional Neural Network,CNN)将定位问题转换为一个分类问题。实验结果表明，在简单的声源定位任务，见文献：N.Ma,T.May,and G.J.Brown,“Exploiting deep neural networks and head movements for robust binaurallocalization of multiple sources in reverberant environments,”IEEE/ACMTransactions on Audio,Speech and Language Processing(TASLP),vol.25,no.12,pp.2444–2453,2017.和F.Vesperini,P. Vecchiotti,E.Principi,S.Squartini,andF.Piazza,“Localizing speakers in multiple rooms by using deep neuralnetworks,”Computer Speech&Language,vol.49,pp.83–106,2018. 中，通过训练CNN可以达到与人类相当的分类性能。端到端系统也被用于双听觉SSL 见文献：P.Vecchiotti,N.Ma,S.Squartini,and G.J.Brown,“End-to-end binaural sound localisation fromthe raw waveform,”in ICASSP 2019-2019 IEEE International Conference onAcoustics,Speech and Signal Processing(ICASSP),pp.451–455,Brighton,UnitedKingdom,May 2019.。然而，双耳声源定位的精度面临着噪声和混响环境，以及会对方位角和俯仰角同时定位的挑战。With the rise of machine learning, neural network-based methods are widely used to solve binaural auditory localization problems. The localization problem is transformed into a classification problem by using a convolutional neural network (CNN). Experimental results show that in simple sound source localization tasks, see literature: N.Ma, T.May, and G.J.Brown, "Exploiting deep neural networks and head movements for robust binaural localization of multiple sources in reverberant environments," IEEE/ACMTransactions on Audio, Speech and Language Processing (TASLP), vol.25, no.12, pp.2444–2453, 2017. and F. Vesperini, P. Vecchiotti, E. Principi, S. Squartini, and F. Piazza, “Localizing speakers in multiple rooms by using deep neural networks,"Computer Speech&Language,vol.49,pp.83–106,2018. In, the classification performance comparable to that of humans can be achieved by training CNN. End-to-end systems have also been used for binaural sound localization from the raw waveform. International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.451–455, Brighton, United Kingdom, May 2019. However, the accuracy of binaural sound source localization faces the challenges of noisy and reverberant environments, as well as simultaneous localization of azimuth and elevation angles.

发明内容Contents of the invention

要解决的技术问题technical problem to be solved

为了避免现有技术的不足之处，本发明提出一种基于深度学习的双耳声源定位方法，基于双耳平台实现对目标语音信号的方位-俯仰角同时进行估计。In order to avoid the shortcomings of the prior art, the present invention proposes a binaural sound source localization method based on deep learning, which realizes simultaneous estimation of the azimuth-pitch angle of the target voice signal based on the binaural platform.

技术方案Technical solutions

一种基于深度学习的双耳声源定位方法，其特征在于：接收平台为2个阵元组成的双耳平台，步骤如下：A binaural sound source localization method based on deep learning, characterized in that: the receiving platform is a binaural platform composed of two array elements, and the steps are as follows:

步骤1：将接收信号的方位角划分成N_θ个，为

将俯仰角划分成/>

个，为/>

利用双耳接收来自不同方向和俯仰角的目标语音：Step 1: Divide the azimuth angle of the received signal into N_θ , as

Divide the pitch angle into />

for />

Use both ears to receive target voices from different directions and pitch angles:

Y_l(t,f)＝S(t,f)×B_l(f,Θ)+N_l(t,f)Y_l (t,f)=S(t,f)×B_l (f,Θ)+N_l (t,f)

t,f分别表示时间和频率索引，Y(t,f),S(t,f),N(t,f)分别表示每个时频帧内接收的信号，声源发出的信号以及叠加的噪声，l,r分别表示左右耳。B_l(f,Θ),B_r(f,Θ)分别表示生成的双耳室内传递函数；t, f represent the time and frequency index respectively, Y(t, f), S(t, f), N(t, f) represent the signal received in each time-frequency frame, the signal emitted by the sound source and the superimposed Noise, l, r represent the left and right ears respectively. B_l (f, Θ), B_r (f, Θ) respectively represent the generated binaural indoor transfer function;

对接收到的信号进行预处理：提取出双耳信号的幅度谱E_l(t,f),E_r(t,f)和双耳信号的双耳相位差IPD(t,f)Preprocess the received signal: extract the magnitude spectrum E_l (t,f),E_r (t,f) of the binaural signal and the binaural phase difference IPD(t,f) of the binaural signal

步骤2、搭建用于提取双耳定位特征的卷积神经网络：Step 2. Build a convolutional neural network for extracting binaural positioning features:

在幅度谱输入分支后接的是16个尺寸为3×2的卷积核，同时提取双耳和单耳特征；在IPD输入分支后接的是16个尺寸为3×1的卷积核，用以提取双耳特征；After the amplitude spectrum input branch, 16 convolution kernels with a size of 3×2 are connected to extract binaural and monaural features at the same time; after the IPD input branch, 16 convolution kernels with a size of 3×1 are connected, Used to extract binaural features;

每个分支中，在第一个卷积层后都是尺寸为2×1的最大池化层，其后使用4个卷积层来搜索适合于定位的特征；其中，前2个使用大小为3×1的64个卷积核，并添加添加尺寸为2×1的最大池化层，后2个使用128个3×1大小的卷积核；所有卷积层均通过整流线性单位ReLU进行激活并使用批归一化操作进行处理；In each branch, after the first convolutional layer is a maximum pooling layer with a size of 2×1, and then 4 convolutional layers are used to search for features suitable for positioning; among them, the first two use a size of 64 convolution kernels of 3×1, and add a maximum pooling layer with a size of 2×1, and the last two use 128 convolution kernels of 3×1 size; all convolution layers are performed by rectified linear unit ReLU Activate and process with batch normalization operations;

将两个分支的输出展开并串联在一起，即将幅值特征和IPD特征合并在一起，然后将合并后的特征陆续传入两个大小分别为8192和4096的全连接层中，后一个全连接层即为共享特征Shared Feature，为接下来的声源定位做准备；The output of the two branches is expanded and concatenated together, that is, the amplitude feature and the IPD feature are merged together, and then the merged features are successively passed into two fully connected layers with sizes of 8192 and 4096 respectively, and the latter fully connected The layer is the shared feature Shared Feature, which prepares for the next sound source localization;

步骤3：卷积神经网络的输出是共享特征层，共享特征层连接多任务神经网络，作为其输入；所述多任务神经网络包含了两个分支，分别代表了对方位角和俯仰角的估计。每个分支均具有五个全连接层和两个具有softmax激活的并行输出层；Step 3: The output of the convolutional neural network is a shared feature layer, and the shared feature layer is connected to the multi-task neural network as its input; the multi-task neural network includes two branches, representing the estimation of the azimuth and pitch angles respectively . Each branch has five fully connected layers and two parallel output layers with softmax activation;

步骤4、对步骤2与步骤3中的网络进行多环境训练：将步骤1提取后的数据分成训练数据和验证数据，将训练数据对对步骤2与步骤3中的网络进行多环境训练和验证数据对训练好的网络进行检验，得到多环境训练好的网络；Step 4. Perform multi-environment training on the network in step 2 and step 3: divide the data extracted in step 1 into training data and verification data, and use the training data to perform multi-environment training and verification on the network in step 2 and step 3 The data is used to test the trained network to obtain a multi-environment trained network;

步骤5：将接收平台接收的语音信号，进行预处理，得到双耳信号的幅度谱 E_l(t,f),E_r(t,f)和双耳信号的双耳相位差IPD(t,f)，以其作为多环境训练好的网络的输入，网络输出的信号为双耳声源定位的角度信息。Step 5: Preprocess the voice signal received by the receiving platform to obtain the amplitude spectrum E_l (t,f),E_r (t,f) of the binaural signal and the binaural phase difference IPD(t,f) of the binaural signal f), using it as the input of the multi-environment trained network, and the output signal of the network is the angle information of binaural sound source localization.

有益效果Beneficial effect

本发明提出的一种基于深度学习的双耳声源定位方法，使用卷积神经网络来对双耳接收信号处理，并使用多任务神经网络(Multitask Neural Network,MNN)同时估计声源方位-俯仰的定位方法。通过神经网络学习到不同环境(噪声和混响)中决定目标三维DOA估计的参数，并使用这些参数进行三维DOA估计。实现在不同环境下均可对目标进行较好的方位-俯仰估计，且定位准确度更高，操作相对于原有算法大大简化，弥补了原有算法的不足。A binaural sound source localization method based on deep learning proposed by the present invention uses a convolutional neural network to process binaural received signals, and uses a multitask neural network (Multask Neural Network, MNN) to simultaneously estimate the sound source azimuth-pitch positioning method. The parameters that determine the 3D DOA estimation of the target in different environments (noise and reverberation) are learned through the neural network, and these parameters are used for 3D DOA estimation. It can achieve better azimuth-pitch estimation of the target in different environments, and the positioning accuracy is higher. Compared with the original algorithm, the operation is greatly simplified, which makes up for the shortcomings of the original algorithm.

本发明的有益效果在于：The beneficial effects of the present invention are:

可以针对不同环境使用同一个训练好的模型进行声源方向估计，从而避免了针对特定环境的特殊处理，可以满足在各种环境条件下均可准确地估计目标方位角和俯仰角；同时该算法定位准确度高，在各种复杂环境中超过了目前已有的双耳声源定位算法。能够有效解决传统方法中的环境干扰对定位影响的问题，具有广泛的应用前景，可直接投入使用。The same trained model can be used for different environments to estimate the sound source direction, thereby avoiding special processing for specific environments, and can accurately estimate the target azimuth and elevation angles under various environmental conditions; at the same time, the algorithm The positioning accuracy is high, surpassing the current binaural sound source positioning algorithm in various complex environments. It can effectively solve the problem of the influence of environmental interference on positioning in traditional methods, has wide application prospects, and can be directly put into use.

附图说明Description of drawings

图1：本专利双耳声源定位系统深度神经网络结构图：由预处理阶段，卷积神经网络阶段和多任务神经网络阶段三部分构成。预处理阶段从双耳信号Y_l(t,f),Y_r(t,f)中提取幅度谱E_l(t,f),E_r(t,f)和相位差IPD(t,f)分别送入CNN网络。CNN网络从输入的这两种数据信息提取特征，输出到共享特征层。CNN网络通过共享特征层与多任务神经网络阶段相连，分别执行方位定位和俯仰角定位两个子任务。Figure 1: The structural diagram of the deep neural network of the binaural sound source localization system of this patent: it consists of three parts: the preprocessing stage, the convolutional neural network stage and the multi-task neural network stage. The preprocessing stage extracts the amplitude spectrum E_l (t, f), E_r (t, f) and the phase difference IPD (t, f) from the binaural signals Y_l (t, f), Y_r (t, f) respectively sent to the CNN network. The CNN network extracts features from these two kinds of input data information, and outputs them to the shared feature layer. The CNN network is connected to the multi-task neural network stage through the shared feature layer, and performs two subtasks of azimuth positioning and pitch angle positioning respectively.

具体实施方式Detailed ways

现结合实施例、附图对本发明作进一步描述：Now in conjunction with embodiment, accompanying drawing, the present invention will be further described:

本发明解决其技术问题所采用的技术方案是：一种使用卷积神经网络来对双耳接收信号处理，并使用多任务神经网络(Multitask Neural Network,MNN)同时估计声源方位-俯仰的求解算法，主要包括以下步骤：The technical solution adopted by the present invention to solve the technical problem is: a method of using a convolutional neural network to process binaural received signals, and using a multitask neural network (Multask Neural Network, MNN) to simultaneously estimate the sound source azimuth-pitch solution The algorithm mainly includes the following steps:

1)构建用于训练神经网络的数据集；1) Construct a data set for training a neural network;

假设接收平台为双耳平台，即阵列由2个阵元组成。将方位角划分成N_θ个，为

将俯仰角划分成/>

个，为/>

利用双耳接收来自不同方向和俯仰角的目标语音，并对接收到的信号进行预处理，提取双耳的幅度与相位差，构建网络训练所需要的数据集；Assume that the receiving platform is a binaural platform, that is, the array consists of 2 array elements. Divide the azimuth angle into N_θ , as

Divide the pitch angle into />

for />

Use both ears to receive target voices from different directions and pitch angles, and preprocess the received signals, extract the amplitude and phase difference of the two ears, and construct the data set required for network training;

2)搭建用于提取双耳定位特征的卷积神经网络；2) Build a convolutional neural network for extracting binaural positioning features;

3)搭建用于对方位角与俯仰角同时定位的多任务神经网络；3) Build a multi-task neural network for simultaneous positioning of azimuth and pitch;

4)对步骤2)与步骤3)中的网络进行多环境训练；4) carry out multi-environment training to the network in step 2) and step 3);

5)利用训练好的卷积神经网络与多任务神经网络对目标语音进行方向估计。5) Use the trained convolutional neural network and multi-task neural network to estimate the direction of the target speech.

本发明的基本思路是训练一个用于提取三维定位特征的卷积神经网络与一个用于同时进行方位与俯仰估计的多任务神经网络，根据接收到的声源语音信号，通过训练好的网络，估计声源方位角和俯仰角。The basic idea of the present invention is to train a convolutional neural network for extracting three-dimensional positioning features and a multi-task neural network for simultaneous azimuth and pitch estimation. According to the received sound source voice signal, through the trained network, Estimation of sound source azimuth and elevation.

仿真环境参数设置：Simulation environment parameter settings:

房间尺寸为：长5m、宽5m、高3m；

Room size: length 5m, width 5m, height 3m;

人头位置坐标：长2.5m、宽2.5m、高1.5m处；

Head position coordinates: 2.5m in length, 2.5m in width, and 1.5m in height;

声源与人头中心相距1m；

The distance between the sound source and the center of the head is 1m;

角度估计类别设置。方位角度分为25类，分别为[-80°,-65°,-55°,-45°:5°:45°,55°, 65°,80°]；俯仰角度分为50类，范围为-45°到230.625°均匀分布，步长为5.625°。25个方位位置与50个俯仰位置共同形成1250个空间位置。

Angle estimation category settings. Azimuth angles are divided into 25 categories, respectively [-80°, -65°, -55°, -45°:5°:45°, 55°, 65°, 80°]; pitch angles are divided into 50 categories, the range Uniformly distributed from -45° to 230.625°, with a step size of 5.625°. 25 azimuth positions and 50 elevation positions together form 1250 spatial positions.

多种混响条件设置。通过镜像模型方法调整房间墙壁的反射系数，利用CIPIC 数据库提供的头相关冲激响应(HRIR)生成双耳室内传递函数(Binaural Room TransferFunctions,BRTFs)。混响等级共8种，混响时间范围为150ms～500ms均匀分布，步长为50ms。

Various reverb condition settings. The reflection coefficient of the room wall is adjusted by the mirror model method, and the binaural room transfer functions (Binaural Room Transfer Functions, BRTFs) are generated using the head-related impulse response (HRIR) provided by the CIPIC database. There are 8 kinds of reverberation levels, the reverberation time range is 150ms~500ms, and the reverberation time is evenly distributed, and the step size is 50ms.

多种噪声环境设置。噪声等级共7种，信噪比范围为5dB～35dB均匀分布，步长为5dB。

Various noise environment settings. There are 7 kinds of noise levels, the signal-to-noise ratio ranges from 5dB to 35dB and is evenly distributed, with a step size of 5dB.

步骤一：构建用于训练神经网络的数据集。Step 1: Build a dataset for training the neural network.

本专利假设在噪声和混响环境下，一个单声源信号被双耳系统的左右耳传声器捕获。在短时傅里叶变换域内各时频单元捕获的信号记为This patent assumes that a single source signal is captured by the left and right ear microphones of a binaural system in a noisy and reverberant environment. The signals captured by each time-frequency unit in the short-time Fourier transform domain are denoted as

Y_l(t,f)＝S(t,f)×B_l(f,Θ)+N_l(t,f)Y_l (t,f)=S(t,f)×B_l (f,Θ)+N_l (t,f)

Y_r(t,f)＝S(t,f)×B_r(f,Θ)+N_r(t,f)Y_r (t,f)=S(t,f)×B_r (f,Θ)+N_r (t,f)

t,f分别表示时间和频率索引，Y(t,f),S(t,f),N(t,f)分别表示每个时频帧内接收的信号，声源发出的信号以及叠加的噪声，l,r分别表示左右耳。B_l(f,Θ),B_r(f,Θ)分别表示生成的双耳室内传递函数。t, f represent the time and frequency index respectively, Y(t, f), S(t, f), N(t, f) represent the signal received in each time-frequency frame, the signal emitted by the sound source and the superimposed Noise, l, r represent the left and right ears respectively. B_l (f, Θ), B_r (f, Θ) denote the generated binaural chamber transfer functions, respectively.

提取出双耳信号的幅度谱E_l(t,f),E_r(t,f)，公式如下：The magnitude spectrum E_l (t,f),E_r (t,f) of the binaural signal is extracted, and the formula is as follows:

E_l(t,f)＝20log₁₀|Y_l(t,f)|E_l (t,f)＝20log₁₀ |Y_l (t,f)|

E_r(t,f)＝20log₁₀|Y_r(t,f)|E_r (t,f)＝20log₁₀ |Y_r (t,f)|

接着，提取出双耳信号的双耳相位差IPD(t,f)，公式如下：Next, extract the binaural phase difference IPD(t,f) of the binaural signal, the formula is as follows:

在本方法中，双耳的幅度谱与双耳相位差将作为网络的输入。接下来构建网络的输出数据，也就是标签。In this method, the binaural amplitude spectrum and binaural phase difference will be used as the input of the network. Next, build the output data of the network, which is the label.

由于需要网络同时输出方位角与俯仰角，这里网络的输出设置为2个one-hot的标签，即，对于方位角，输出是一个25维的向量，向量中的元素除了对应声源方位角的值为1以外，其余均为0；对于俯仰角，输出为一个50维的向量，向量中的元素除了对应声源俯仰角的值为1以外，其余均为0。这里向量的维数对应了空间的分类数。Since the network needs to output the azimuth and elevation angles at the same time, the output of the network here is set to two one-hot labels, that is, for the azimuth, the output is a 25-dimensional vector, and the elements in the vector are except for the corresponding sound source azimuth The value is 1, and the rest are 0; for the pitch angle, the output is a 50-dimensional vector, and the elements in the vector are all 0 except for the value of 1 corresponding to the pitch angle of the sound source. Here the dimensionality of the vector corresponds to the number of categories in the space.

步骤二：搭建用于提取双耳定位特征的卷积神经网络。Step 2: Build a convolutional neural network for extracting binaural positioning features.

该卷积神经网络的示意图如图1所示。A schematic diagram of the convolutional neural network is shown in Figure 1.

这里采用了两个独立的CNN分别从幅度谱和IPD中学习定位特征，用来进行双耳声源定位。Here, two independent CNNs are used to learn localization features from amplitude spectrum and IPD, respectively, for binaural sound source localization.

首先，在幅度谱输入分支后接的是32个尺寸为3×2的卷积核，可同时提取双耳和单耳特征。在IPD输入分支后接的是32个尺寸为3×1的卷积核，用以提取双耳特征。First, 32 convolution kernels with a size of 3×2 are connected after the amplitude spectrum input branch, which can simultaneously extract binaural and monaural features. After the IPD input branch, 32 convolution kernels with a size of 3×1 are connected to extract binaural features.

其次，每个分支中，在第一个卷积层后都是尺寸为2×1的最大池化层，其后使用4个卷积层来搜索适合于定位的特征。其中，前2个使用大小为3×1的64个卷积核，并添加添加尺寸为2×1的最大池化层，后2个使用128个3×1大小的卷积核。所有卷积层均通过整流线性单位(ReLU)进行激活并使用批归一化操作进行处理。Second, in each branch, the first convolutional layer is followed by a maximum pooling layer of size 2×1, followed by 4 convolutional layers to search for features suitable for localization. Among them, the first two use 64 convolution kernels with a size of 3×1, and add a maximum pooling layer with a size of 2×1, and the last two use 128 convolution kernels with a size of 3×1. All convolutional layers are activated with Rectified Linear Units (ReLU) and processed using a batch normalization operation.

最后，将两个分支的输出展开并串联在一起，即将幅值特征和IPD特征合并在一起，然后将合并后的特征传入两个大小分别为8192和4096的全连接层中，形成共享特征(Shared Feature)，为接下来的声源定位做准备。Finally, the output of the two branches is expanded and concatenated, that is, the amplitude feature and the IPD feature are merged together, and then the merged feature is passed into two fully connected layers with sizes of 8192 and 4096 respectively to form a shared feature. (Shared Feature), prepare for the next sound source localization.

步骤三：搭建用于对方位角与俯仰角同时定位的多任务神经网络。Step 3: Build a multi-task neural network for simultaneous positioning of azimuth and elevation.

在基于神经网络的学习中，解决问题的典型方法是针对特定任务构建单个模型，并根据特定标准优化该模型的参数。但是，如果仅针对单个任务进行了优化，则当多个相关任务需要同时完成时，网络将无法实现最优。一种合适的解决方法是在几个相关的任务之间使用共享的特征，使得多个任务之间可以一起训练，并为每个任务提供最佳性能，这种方法被称为多任务学习。In neural network-based learning, a typical approach to problem solving is to build a single model for a specific task and optimize the parameters of that model according to specific criteria. However, if only optimized for a single task, the network will not be optimal when multiple related tasks need to be completed simultaneously. A suitable solution is to use shared features between several related tasks, so that multiple tasks can be trained together and provide the best performance for each task, this method is called multi-task learning.

如图1所示，在共享特征层后接多任务神经网络，其包含了两个分支，分别代表了对方位角和俯仰角的估计。每个分支均具有五个全连接层和两个具有softmax激活的并行输出层。As shown in Figure 1, a multi-task neural network is connected after the shared feature layer, which contains two branches, which represent the estimation of the azimuth and pitch angles respectively. Each branch has five fully connected layers and two parallel output layers with softmax activation.

步骤四：多环境训练。Step 4: Multi-environment training.

为了提升算法对多种噪声和混响环境下性能的鲁棒性，构建不同信噪比和混响时间环境下的训练数据，进行多环境训练，以提升网络在不同环境下的泛化能力。In order to improve the robustness of the algorithm to performance in various noise and reverberation environments, training data in different SNR and reverberation time environments are constructed, and multi-environment training is carried out to improve the generalization ability of the network in different environments.

步骤五：利用训练好的卷积神经网络与多任务神经网络对目标语音进行方向估计。Step 5: Use the trained convolutional neural network and multi-task neural network to estimate the direction of the target voice.

分别就不同噪声和混响条件下，将本专利方案与已有的两种方法之间进行比较。对标方案1利用互信息(Mutual Information)分析选择的双耳线索的复合特征向量进行定位，见文献：X.Wu,D.S.Talagala,W.Zhang,and T.D.Abhayapala,“Individualizedinteraural feature learning and personalized binaural localization model,”Applied Sciences, vol.9,no.13,p.2682,2019.；对标方案2将双耳相位差和双耳幅度差直接输入卷积神经网络进行定位，见文献：C.Pang,H.Liu,and X.Li,“Multitask learningof time-frequency cnn for sound source localization,”IEEE Access,vol.7,pp.40725–40737, 2019.。Under different noise and reverberation conditions, the patented solution is compared with the two existing methods. Benchmarking scheme 1 uses mutual information (Mutual Information) analysis to select the compound feature vector of binaural cues for positioning, see literature: X.Wu, D.S.Talagala, W.Zhang, and T.D.Abhayapala, "Individualized interaural feature learning and personalized binaural localization model,"Applied Sciences, vol.9, no.13, p.2682, 2019.; Benchmarking scheme 2 directly inputs binaural phase difference and binaural amplitude difference into convolutional neural network for positioning, see literature: C.Pang , H. Liu, and X. Li, "Multask learning of time-frequency cnn for sound source localization," IEEE Access, vol.7, pp.40725–40737, 2019.

与这两种方案相比，本方案在大多数情况下都能获得最佳性能，特别是在低信噪比(信噪比≤25dB)和强混响(T60≥200ms)情况下。由于单耳的幅度信息是进行人耳仰角定位的关键线索，本方案保留单耳谱线索的操作，使仰角定位结果较对标方案提升更明显。(-表示文献未提供此环境下的结果)Compared with these two schemes, this scheme can obtain the best performance in most cases, especially in the case of low SNR (SNR≤25dB) and strong reverberation (T60≥200ms). Since the amplitude information of the single ear is the key clue for the elevation angle positioning of the human ear, this scheme retains the operation of the monaural spectral clues, so that the elevation angle positioning results are more obvious than the benchmarking scheme. (- indicates that the literature does not provide results in this environment)

不同信噪比环境下的定位结果：Positioning results under different SNR environments:

表1.不同信噪比环境下的方位角定位准确率(％)。Table 1. Azimuth positioning accuracy (%) under different SNR environments.

SNRSNR25dB25dB20dB20dB15dB15dB10dB10dB5dB5dB对标方案1Benchmarking scheme 1--97.2097.20--94.4094.40--对标方案2Benchmarking scheme 296.8896.8895.5795.5793.0593.0588.4888.4879.8779.87本方案This program98.1098.1098.0998.0998.0798.0797.9497.9496.9596.95

表2.不同信噪比环境下的俯仰角定位准确率。Table 2. Pitch angle positioning accuracy under different SNR environments.

SNRSNR25dB25dB20dB20dB15dB15dB10dB10dB5dB5dB对标方案1Benchmarking scheme 1--72.6472.64--37.0437.04--对标方案2Benchmarking scheme 292.4292.4286.9386.9378.3778.3765.7765.7748.4748.47本方案This program98.2898.2897.5997.5996.1796.1793.0693.0685.2585.25

不同混响环境下的定位结果：Positioning results in different reverberation environments:

表3.不同混响时间下的方位角定位准确率。Table 3. Azimuth positioning accuracy under different reverberation times.

混响时间reverberation time300ms300ms350ms350ms400ms400ms450ms450ms500ms500ms对标方案1Benchmarking scheme 191.4491.44--89.4489.44--78.8878.88对标方案2Benchmarking scheme 291.6091.6092.1292.1287.4487.4490.4090.4083.6483.64本方案This program94.2394.2395.7795.7792.5792.5794.9894.9890.0290.02

表4.不同混响时间下的俯仰角定位准确率。Table 4. Pitch angle positioning accuracy under different reverberation times.

混响时间reverberation time300ms300ms350ms350ms400ms400ms450ms450ms500ms500ms对标方案1Benchmarking scheme 168.4868.48--55.5255.52--42.6442.64对标方案2Benchmarking scheme 291.7691.7691.7391.7386.9386.9389.5789.5781.7081.70本方案This plan93.0993.0995.0895.0891.1391.1394.4394.4387.4787.47

在本方案创新点在于将双耳幅度谱(同时包含了双耳和单耳线索)和IPD输入CNN进行特征选择。这种操作可以使声源定位更加精确，特别是在有噪声和混响的条件下。而现有的最好方法只有在高信噪比和低混响条件下才有更好的表现。这些实验结果证实了在复杂环境下，相较于使用耳间幅度差信息，使用双耳幅度信息更能准确地保留定位线索。The innovation of this scheme is to input binaural amplitude spectrum (both binaural and monaural cues) and IPD into CNN for feature selection. This operation allows for more precise localization of sound sources, especially in noisy and reverberant conditions. While the best existing methods only perform better under high SNR and low reverberation conditions. These experimental results confirm that in complex environments, using binaural amplitude information can more accurately preserve localization cues than using interaural amplitude difference information.

Claims

1. A binaural sound source localization method based on deep learning is characterized in that: the receiving platform is a binaural platform consisting of 2 array elements, and the steps are as follows:

step 1: dividing azimuth of received signal into N_θ And is as follows

Dividing pitch angle into->

The number is->

Target speech from different directions and pitch angles is received with both ears:

Y_l (t,f)＝S(t,f)×B_l (f,Θ)+N_l (t,f)

Y_r (t,f)＝S(t,f)×B_r (f,Θ)+N_r (t,f)

t, f respectively represent time and frequency indexes, Y (t, f), S (t, f), N (t, f) respectively represent signals received in each time-frequency frame, signals emitted by a sound source and superimposed noise, l, r respectively represent left and right ears, B_l (f,Θ),B_r (f, Θ) each represents the generated binaural indoor transfer function;

preprocessing the received signal: extracting an amplitude spectrum E of a binaural signal_l (t,f),E_r (t, f) and a binaural phase difference IPD (t, f) of the binaural signal;

step 2, building a convolutional neural network for extracting binaural positioning features:

following the amplitude spectrum input branch are 16 convolution kernels of size 3 x 2, extracting binaural and monaural features simultaneously; following the IPD input branch are 16 convolution kernels of size 3 x 1 to extract binaural features;

in each branch, the first convolution layer is followed by a maximum pooling layer of size 2 x 1, after which 4 convolution layers are used to search for features suitable for positioning; wherein the first 2 use 64 convolution kernels of 3 x 1 size and add a maximum pooling layer of 2 x 1 size, and the second 2 use 128 convolution kernels of 3 x 1 size; all convolution layers are activated through a rectifying linear unit ReLU and are processed through batch normalization operation;

the output of the two branches is unfolded and connected in series, namely the amplitude characteristic and the IPD characteristic are combined together, then the combined characteristic is sequentially transmitted into two full-connection layers with the sizes of 8192 and 4096 respectively, and the latter full-connection layer is the Shared characteristic Shared Feature and is used for preparing the following sound source positioning;

step 3: the output of the convolutional neural network is a shared characteristic layer, and the shared characteristic layer is connected with the multi-task neural network and is used as the input of the multi-task neural network; the multi-task neural network comprises two branches which respectively represent the estimation of azimuth angle and pitch angle, and each branch is provided with five full-connection layers and two parallel output layers with softmax activation;

step 4, performing multi-environment training on the network in the step 2 and the step 3: dividing the data extracted in the step 1 into training data and verification data, and carrying out multi-environment training on the training data and the verification data on the network in the step 2 and the step 3 to verify the trained network so as to obtain the multi-environment trained network;

step 5: the voice signal received by the receiving platform is preprocessed to obtain the amplitude spectrum E of the binaural signal_l (t,f),E_r (t, f) and binaural phase difference IPD (t, f) of the binaural signal, which is taken as input to the multi-environment trained network, the network output signal is the angle information of binaural sound source localization.