CN108231085A

Movatterモバイル変換

Info

Publication number: CN108231085A
Application number: CN201611154066.1A
Authority: CN
Inventors: 何赛娟; 陈扬坤; 陈展
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2016-12-14
Filing date: 2016-12-14
Publication date: 2018-06-29

Abstract

The embodiment of the invention discloses a kind of sound localization method and device, the method includes：Obtain the target audio signal of each microphone acquisition in microphone array；Sub-frame processing is carried out to the target audio signal of each microphone acquisition, and according to framing as a result, determining the target audio frame corresponding to each microphone；Calculate the target latency vector corresponding to target audio frame；Target latency vector is input to the target machine learning model that training is completed in advance, obtains azimuth of target ident value；Based on azimuth of target ident value, the azimuth of target corresponding to the sound source of target audio signal is obtained.By target machine learning model be using in practical application scene the corresponding time delay vector sample of the collected audio frame samples of institute as input content, and the machine learning model trained using the corresponding azimuth ident value of audio signal samples as output content, so even if in the case where time-delay calculation is inaccurate, the azimuth of sound source also can be accurately determined.

Description

Translated fromChinese

一种声源定位方法及装置Sound source localization method and device

技术领域technical field

本发明涉及音频信号处理领域，特别是涉及一种声源定位方法及装置。The invention relates to the field of audio signal processing, in particular to a sound source localization method and device.

背景技术Background technique

现如今声源定位的应用越来越广泛，例如在视频会议系统、智能家电、机器人等产品中都有重要应用。目前最常用的声源定位方法是基于麦克风阵列的声源定位方法，在该方法中，通过由若干个麦克风组成的麦克风阵列接收音频信号，然后通过信号处理方法对音频信号进行处理，进而确定声源方向对应的方位角，完成声源定位。其中，最常用的麦克风阵列为线阵，如图1所示，即为由多个麦克风沿直线排列形成的直线型麦克风阵列。Nowadays, the application of sound source localization is more and more extensive, for example, it has important applications in video conferencing systems, smart home appliances, robots and other products. At present, the most commonly used sound source localization method is the sound source localization method based on a microphone array. In this method, an audio signal is received by a microphone array composed of several microphones, and then the audio signal is processed by a signal processing method to determine the sound source localization method. The azimuth angle corresponding to the source direction completes the sound source localization. Among them, the most commonly used microphone array is a linear array, as shown in FIG. 1 , which is a linear microphone array formed by arranging a plurality of microphones along a straight line.

在现有的基于麦克风阵列的声源定位方法中，通过麦克风阵列接收到音频信号后，采用时延估计方法来定位声源。具体地，通过麦克风阵列接收音频信号，并计算每个麦克风接收的音频信号相对于参考点接收的音频信号的时延，然后根据时延与方位角的固定的映射关系，预估声源的方位角，进而完成对声源的定位。In the existing sound source localization method based on the microphone array, after the audio signal is received through the microphone array, the time delay estimation method is used to localize the sound source. Specifically, the audio signal is received by the microphone array, and the time delay of the audio signal received by each microphone relative to the audio signal received by the reference point is calculated, and then the direction of the sound source is estimated according to the fixed mapping relationship between the time delay and the azimuth angle Angle, and then complete the localization of the sound source.

在实际应用中，由于噪声等环境因素的影响，时延的计算往往不准确，导致采用固定的映射关系进行声源定位的精确度低，特别是在线阵端射位置(即图1中所示θ为0度和180度附近位置)，甚至会出现声源定位失败的问题。In practical applications, due to the influence of environmental factors such as noise, the calculation of time delay is often inaccurate, resulting in low accuracy of sound source localization using a fixed mapping relationship, especially at the end-fire position of the line array (that is, as shown in Figure 1 θ is near 0 degrees and 180 degrees), and even the problem of sound source localization failure may occur.

发明内容Contents of the invention

本发明实施例公开了一种声源定位方法及装置，用以对声源进行精准定位。技术方案如下：The embodiment of the invention discloses a sound source localization method and device, which are used for precise localization of the sound source. The technical solution is as follows:

第一方面，本发明实施例提供了一种声源定位方法，所述方法包括：In a first aspect, an embodiment of the present invention provides a sound source localization method, the method comprising:

获得麦克风阵列中各个麦克风采集的目标音频信号；Obtain the target audio signal collected by each microphone in the microphone array;

对所述各个麦克风采集的目标音频信号进行分帧处理，并根据分帧结果，确定所述各个麦克风所对应的目标音频帧；Carrying out framing processing on the target audio signals collected by the respective microphones, and determining the target audio frames corresponding to the respective microphones according to the framing results;

计算所述目标音频帧所对应的目标时延向量，其中，所述目标时延向量为：基于各个麦克风接收相应目标音频帧的时间差所形成的向量；Calculating a target delay vector corresponding to the target audio frame, wherein the target delay vector is: a vector formed based on the time difference of each microphone receiving the corresponding target audio frame;

将所述目标时延向量输入至预先训练完成的目标机器学习模型，得到目标方位角标识值，其中，所述目标机器学习模型为：以音频帧样本对应的时延向量样本作为输入内容，且以音频信号样本对应的方位角标识值作为输出内容所训练得到的机器学习模型，所述音频帧样本为对所述音频信号样本进行分帧处理得到的音频帧；Input the target delay vector into the pre-trained target machine learning model to obtain the target azimuth identification value, wherein the target machine learning model is: the delay vector sample corresponding to the audio frame sample is used as input content, and Taking the azimuth identification value corresponding to the audio signal sample as the machine learning model trained by the output content, the audio frame sample is an audio frame obtained by performing frame division processing on the audio signal sample;

基于所述目标方位角标识值，得到所述目标音频信号的声源所对应的目标方位角。Based on the target azimuth identification value, the target azimuth corresponding to the sound source of the target audio signal is obtained.

可选的，所述计算所述目标音频帧所对应的目标时延向量的步骤，包括：Optionally, the step of calculating the target delay vector corresponding to the target audio frame includes:

对所述目标音频帧进行两两互相关处理，得到所述目标时延向量。Perform pairwise cross-correlation processing on the target audio frames to obtain the target time delay vector.

对所述目标音频帧进行上采样处理，并将上采样处理后的音频帧转换为频域信号帧；performing upsampling processing on the target audio frame, and converting the upsampled audio frame into a frequency domain signal frame;

对所述频域信号帧进行两两互相关处理，得到所述目标时延向量。Perform pairwise cross-correlation processing on the frequency domain signal frames to obtain the target time delay vector.

可选的，所述目标机器学习模型的训练方式包括：Optionally, the training method of the target machine learning model includes:

构建初始机器学习模型；Build an initial machine learning model;

确定用于模型训练的多个预设方位角的方位角标识值；determining azimuth identification values of a plurality of preset azimuths used for model training;

获得所述麦克风阵列中各个麦克风采集的多个音频信号样本，并对每一个音频信号样本进行分帧处理，得到多个音频帧样本，其中，所述音频信号样本为：所对应声源的方位角为所述预设方位角的音频信号；Obtaining a plurality of audio signal samples collected by each microphone in the microphone array, and performing frame division processing on each audio signal sample to obtain a plurality of audio frame samples, wherein the audio signal sample is: the orientation of the corresponding sound source an audio signal whose angle is the preset azimuth angle;

计算每一个音频帧样本所对应的时延向量样本；Calculate the delay vector samples corresponding to each audio frame sample;

将各时延向量样本输入所述初始机器学习模型，并利用所述预设方位角对应的方位角标识值对相应时延向量样本进行训练；Inputting each delay vector sample into the initial machine learning model, and using the azimuth identification value corresponding to the preset azimuth to train the corresponding delay vector samples;

当各时延向量样本与训练得到的时延向量的均方差值均小于预设值时，完成训练，得到所述目标机器学习模型。When the mean square deviation values of each delay vector sample and the delay vector obtained through training are less than a preset value, the training is completed, and the target machine learning model is obtained.

可选的，所述确定用于模型训练的多个预设方位角的方位角标识值的步骤，包括：Optionally, the step of determining azimuth identification values of a plurality of preset azimuths used for model training includes:

分别将用于训练的每个预设方位角本身作为自身的方位角标识值；Each preset azimuth used for training is used as its own azimuth identification value;

所述基于所述目标方位角标识值，得到所述目标音频信号的声源所对应的目标方位角的步骤，包括：The step of obtaining the target azimuth corresponding to the sound source of the target audio signal based on the target azimuth identification value includes:

将所述目标方位角标识值作为所述目标音频信号的声源所对应的目标方位角。The target azimuth identification value is used as the target azimuth corresponding to the sound source of the target audio signal.

可选的，所述确定用于模型训练的预设方位角的方位角标识值的步骤，包括：Optionally, the step of determining the azimuth identification value of the preset azimuth used for model training includes:

根据预设的编码规则对用于模型训练的多个预设方位角进行编码，获得各个预设方位角对应的二进制数组；Encoding a plurality of preset azimuths used for model training according to preset encoding rules to obtain a binary array corresponding to each preset azimuth;

按照与所述编码规则对应的解码规则，对所述目标方位角标识值进行解码，得到解码结果；Decoding the target azimuth identification value according to the decoding rule corresponding to the encoding rule to obtain a decoding result;

将所述解码结果作为所述目标音频信号的声源所对应的目标方位角。The decoding result is used as the target azimuth angle corresponding to the sound source of the target audio signal.

可选的，所述目标机器学习模型的特征匹配公式为：Optionally, the feature matching formula of the target machine learning model is:

其中，y^(p)为所述目标二进制数组中的第p位二进制数，p＝1、2…n，n＝[log₂N]，N为所述预设方位角的数量，k＝1、2…N，及b^(p)为针对第k个预设方位角预先训练得到的所述第p位二进制数所对应的参数，K(x，y)为所述目标机器学习模型的核函数，为预先训练得到的第k个预设方位角所对应的时延向量，M为所述麦克风阵列中麦克风的数量，Γ＝[τ₁₂,…,τ_1M,τ₂₃,…,τ_2M,…,τ_(M-1)M]^T为所述目标时延向量。Wherein, y^(p) is the p-th binary number in the target binary array, p=1, 2...n, n=[log₂ N], N is the quantity of the preset azimuth, k=1 , 2...N, And b^(p) is the parameter corresponding to the p-th binary number obtained by pre-training for the k-th preset azimuth, K (x, y) is the kernel function of the target machine learning model, is the time delay vector corresponding to the k-th preset azimuth obtained through pre-training, M is the number of microphones in the microphone array, Γ=[τ₁₂ ,...,τ_1M ,τ₂₃ ,...,τ_2M ,... ,τ_(M-1)M ]^T is the target delay vector.

第二方面，本发明实施例还提供了一种声源定位装置，所述装置包括：In the second aspect, the embodiment of the present invention also provides a sound source localization device, the device comprising:

目标音频信号获得模块，用于获得麦克风阵列中各个麦克风采集的目标音频信号；The target audio signal acquisition module is used to obtain the target audio signal collected by each microphone in the microphone array;

目标音频帧确定模块，用于对所述各个麦克风采集的目标音频信号进行分帧处理，并根据分帧结果，确定所述各个麦克风所对应的目标音频帧；A target audio frame determination module, configured to perform frame division processing on the target audio signals collected by the respective microphones, and determine the target audio frames corresponding to the respective microphones according to the frame division results;

目标时延向量计算模块，用于计算所述目标音频帧所对应的目标时延向量，其中，所述目标时延向量为：基于各个麦克风接收相应目标音频帧的时间差所形成的向量；The target delay vector calculation module is used to calculate the target delay vector corresponding to the target audio frame, wherein the target delay vector is: a vector formed based on the time difference of each microphone receiving the corresponding target audio frame;

目标方位角标识值获得模块，用于将所述目标延时向量输入至由模型训练模块预先训练完成的目标机器学习模型，得到目标方位角标识值，其中，所述目标机器学习模型为：以音频帧样本对应的时延向量样本作为输入内容，且以音频信号样本对应的方位角标识值作为输出内容所训练得到的机器学习模型，所述音频帧样本为对所述音频信号样本进行分帧处理得到的音频帧；The target azimuth identification value obtaining module is used to input the target delay vector into the target machine learning model pre-trained by the model training module to obtain the target azimuth identification value, wherein the target machine learning model is: The delay vector sample corresponding to the audio frame sample is used as the input content, and the azimuth identification value corresponding to the audio signal sample is used as the output content to train the machine learning model obtained, and the audio frame sample is for framing the audio signal sample Processed audio frames;

目标方位角确定模块，用于基于所述目标方位角标识值，得到所述目标音频信号的声源所对应的目标方位角。A target azimuth determining module, configured to obtain a target azimuth corresponding to a sound source of the target audio signal based on the target azimuth identification value.

可选的，所述目标时延向量计算模块包括：Optionally, the target delay vector calculation module includes:

第一互相关单元，用于对所述目标音频帧进行两两互相关处理，得到所述目标时延向量。The first cross-correlation unit is configured to perform pairwise cross-correlation processing on the target audio frames to obtain the target delay vector.

转换单元，用于对所述目标音频帧进行上采样处理，并将上采样处理后的音频帧转换为频域信号帧；A conversion unit, configured to perform upsampling processing on the target audio frame, and convert the upsampled audio frame into a frequency domain signal frame;

第二互相关单元，用于对所述频域信号帧进行两两互相关处理，得到所述目标时延向量。The second cross-correlation unit is configured to perform pairwise cross-correlation processing on the frequency-domain signal frames to obtain the target delay vector.

可选的，所述模型训练模块包括：Optionally, the model training module includes:

构建单元，用于构建初始机器学习模型；A building block for building an initial machine learning model;

方位角标识值确定单元，用于确定用于模型训练的多个预设方位角的方位角标识值；An azimuth identification value determination unit, configured to determine the azimuth identification values of a plurality of preset azimuths used for model training;

音频帧样本获得单元，用于获得所述麦克风阵列中各个麦克风采集的多个音频信号样本，并对每一个音频信号样本进行分帧处理，得到多个音频帧样本，其中，所述音频信号样本为：所对应声源的方位角为所述预设方位角的音频信号；An audio frame sample obtaining unit, configured to obtain a plurality of audio signal samples collected by each microphone in the microphone array, and perform frame division processing on each audio signal sample to obtain a plurality of audio frame samples, wherein the audio signal sample is: an audio signal whose azimuth angle of the corresponding sound source is the preset azimuth angle;

时延向量样本计算单元，用于计算每一个音频帧样本所对应的时延向量样本；A delay vector sample calculation unit, configured to calculate a delay vector sample corresponding to each audio frame sample;

样本训练单元，用于将各时延向量样本输入所述初始机器学习模型，并利用所述预设方位角对应的方位角标识值对相应时延向量样本进行训练；A sample training unit, configured to input each time delay vector sample into the initial machine learning model, and use the azimuth identification value corresponding to the preset azimuth to train the corresponding time delay vector samples;

目标机器学习模型获得单元，用于当各时延向量样本与训练得到的时延向量的均方差值均小于预设值时，完成训练，得到所述目标机器学习模型。The target machine learning model obtaining unit is configured to complete the training and obtain the target machine learning model when the mean square difference values between each time delay vector sample and the time delay vector obtained through training are less than a preset value.

可选的，所述方位角标识值确定单元包括：Optionally, the azimuth identification value determination unit includes:

第一标识值确定单元，用于分别将用于训练的每个预设方位角本身作为自身的方位角标识值；The first identification value determination unit is used to respectively use each preset azimuth used for training as its own azimuth identification value;

所述目标方位角确定模块包括：The target azimuth determination module includes:

第一目标方位角确定单元，用于将所述目标方位角标识值作为所述目标音频信号的声源所对应的目标方位角。The first target azimuth determination unit is configured to use the target azimuth identification value as the target azimuth corresponding to the sound source of the target audio signal.

第二标识值确定单元，用于根据预设的编码规则对用于模型训练的多个预设方位角进行编码，获得各个预设方位角对应的二进制数组；The second identification value determination unit is used to encode a plurality of preset azimuths used for model training according to preset coding rules, and obtain a binary array corresponding to each preset azimuth;

解码单元，用于按照与所述编码规则对应的解码规则，对所述目标方位角标识值进行解码，得到解码结果；A decoding unit, configured to decode the target azimuth identification value according to a decoding rule corresponding to the encoding rule to obtain a decoding result;

第二目标方位角确定单元，用于将所述解码结果作为所述目标音频信号的声源所对应的目标方位角。The second target azimuth determining unit is configured to use the decoding result as the target azimuth corresponding to the sound source of the target audio signal.

本方案中，首先获得麦克风阵列中各个麦克风采集的目标音频信号，对各个麦克风采集的目标音频信号进行分帧处理，并根据分帧结果，确定各个麦克风所对应的目标音频帧，然后计算目标音频帧所对应的目标时延向量，将目标时延向量输入至预先训练完成的目标机器学习模型，得到目标方位角标识值，最后基于目标方位角标识值，得到目标音频信号的声源所对应的目标方位角。由于目标机器学习模型是以实际应用场景中所采集到的音频帧样本对应的时延向量样本作为输入内容，且以音频信号样本对应的方位角标识值作为输出内容所训练得到的机器学习模型，所以即使在存在噪声等环境因素影响及时延的计算不够精确的情况下，也能够准确确定声源的方位角。In this solution, the target audio signal collected by each microphone in the microphone array is first obtained, and the target audio signal collected by each microphone is divided into frames, and the target audio frame corresponding to each microphone is determined according to the frame division result, and then the target audio signal is calculated. The target delay vector corresponding to the frame, input the target delay vector into the pre-trained target machine learning model to obtain the target azimuth identification value, and finally based on the target azimuth identification value, obtain the target audio signal corresponding to the sound source Target azimuth. Since the target machine learning model is a machine learning model trained with delay vector samples corresponding to audio frame samples collected in actual application scenarios as input content and azimuth identification values corresponding to audio signal samples as output content, Therefore, the azimuth angle of the sound source can be accurately determined even if there are environmental factors such as noise and the calculation of the delay is not accurate enough.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention. Those skilled in the art can also obtain other drawings based on these drawings without creative work.

图1为一种麦克风阵列的结构示意图；Fig. 1 is the structural representation of a kind of microphone array;

图2为本发明实施例所提供的一种声源定位方法的流程图；Fig. 2 is a flow chart of a sound source localization method provided by an embodiment of the present invention;

图3为本发明实施例所提供的一种目标机器学习模型的训练方法的流程图；3 is a flowchart of a method for training a target machine learning model provided by an embodiment of the present invention;

图4为本发明实施例所提供的一种声源定位装置的结构示意图。Fig. 4 is a schematic structural diagram of a sound source localization device provided by an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

为了提高声源定位的准确度，本发明实施例提供了一种声源定位方法及装置。In order to improve the accuracy of sound source localization, embodiments of the present invention provide a sound source localization method and device.

首先需要说明的是，本发明实施例所提供的一种声源定位方法可以应用于具有麦克风阵列的电子设备(以下简称电子设备)，或与麦克风阵列相通信的电子设备(以下简称电子设备)，例如，视频会议系统、智能家电等。First of all, it should be explained that the sound source localization method provided by the embodiment of the present invention can be applied to an electronic device with a microphone array (hereinafter referred to as an electronic device), or an electronic device that communicates with a microphone array (hereinafter referred to as an electronic device) , for example, video conferencing systems, smart home appliances, etc.

下面首先对本发明实施例所提供的一种声源定位方法进行介绍。A method for localizing a sound source provided by an embodiment of the present invention is firstly introduced below.

如图2所示，一种声源定位方法，所述方法包括以下步骤：As shown in Fig. 2, a kind of sound source localization method, described method comprises the following steps:

S201，获得麦克风阵列中各个麦克风采集的目标音频信号；S201. Obtain the target audio signal collected by each microphone in the microphone array;

当声源发出音频信号后，麦克风阵列中的麦克风便可以采集到该音频信号，进而电子设备便可以获得麦克风阵列中各个麦克风采集的音频信号，即目标音频信号。When the sound source emits an audio signal, the microphones in the microphone array can collect the audio signal, and then the electronic device can obtain the audio signal collected by each microphone in the microphone array, that is, the target audio signal.

S202，对所述各个麦克风采集的目标音频信号进行分帧处理，并根据分帧结果，确定所述各个麦克风所对应的目标音频帧；S202. Perform frame division processing on the target audio signal collected by each microphone, and determine the target audio frame corresponding to each microphone according to the frame division result;

为了对音频信号进行实时处理，电子设备获取各个麦克风阵列采集的音频信号后，可以按照预设时长将各个麦克风阵列采集的目标音频信号进行分帧处理，将目标音频信号分为若干帧音频信号，得到目标音频帧。In order to process the audio signal in real time, after the electronic device acquires the audio signal collected by each microphone array, the target audio signal collected by each microphone array can be divided into frames according to the preset time length, and the target audio signal is divided into several frame audio signals, Get the target audio frame.

需要说明的是，上述预设时长可以由本领域技术人员根据音频信号实际长度及应用场景等因素确定，在此不做具体限定。例如，对实时性要求比较高时，可以将该预设时长适当设置的短一些。当需要精度较高的计算结果时，便可以将该预设时长设置的长一些。上述分帧处理为本领域常见处理方法，本领域技术人员可以根据实际情况进行分帧处理操作，在此不做具体说明。It should be noted that the aforementioned preset duration can be determined by those skilled in the art according to factors such as the actual length of the audio signal and application scenarios, and is not specifically limited here. For example, when the real-time requirement is relatively high, the preset duration can be appropriately set shorter. When a calculation result with higher precision is required, the preset duration can be set longer. The above-mentioned frame division processing is a common processing method in the field, and those skilled in the art can perform the frame division processing operation according to the actual situation, and no specific description is given here.

对目标音频信号进行分帧处理后，可以从中选择一帧音频信号，将其确定为目标音频帧，也可以从中选择多帧音频信号作为目标音频帧，这都是合理的。需要说明的是，针对于每一个麦克风所确定的目标音频帧一般是对应的，且数量相同。也就是说，如果将其中一个麦克风采集的目标音频信号按照20ms每帧进行分帧处理，并选择从第一帧开始连续的30帧进行一次处理，那么针对麦克风阵列中的其他麦克风所采集的目标音频信号也按照同样的分帧方式即20ms每帧进行处理，并同样选择从第一帧开始连续的30帧进行处理。After the target audio signal is divided into frames, one frame of the audio signal can be selected and determined as the target audio frame, or multiple frames of audio signals can be selected as the target audio frame, which is reasonable. It should be noted that the target audio frames determined for each microphone are generally corresponding and have the same number. That is to say, if the target audio signal collected by one of the microphones is divided into frames according to 20ms per frame, and 30 consecutive frames starting from the first frame are selected for one processing, then the target audio signals collected by other microphones in the microphone array The audio signal is also processed according to the same frame division method, that is, each frame is 20ms, and 30 consecutive frames starting from the first frame are also selected for processing.

S203，计算所述目标音频帧所对应的目标时延向量；S203. Calculate a target delay vector corresponding to the target audio frame;

其中，该目标时延向量为：基于各个麦克风接收相应目标音频帧的时间差所形成的向量。由于麦克风阵列中的各个麦克风与声源的距离一般不同，所以麦克风阵列中的各个麦克风接收到目标音频帧的时间也不同，那么电子设备确定了上述目标音频帧后，便可以根据各个麦克风接收相应目标音频帧的时间差，计算出目标音频帧所对应的目标时延向量。举例而言，如果上述目标音频帧是分帧处理后的第一帧音频帧，那么该目标时延向量便是各个麦克风接收该第一帧音频帧的时间差所形成的向量。Wherein, the target delay vector is: a vector formed based on a time difference when each microphone receives a corresponding target audio frame. Since each microphone in the microphone array is generally different from the sound source, the time at which each microphone in the microphone array receives the target audio frame is also different. After the electronic device determines the above target audio frame, it can receive the corresponding audio frame according to each microphone. The time difference of the target audio frame is used to calculate the target delay vector corresponding to the target audio frame. For example, if the above-mentioned target audio frame is the first audio frame after frame division processing, then the target delay vector is a vector formed by the time difference of each microphone receiving the first audio frame.

可以理解的是，如果各个麦克风所对应的目标音频帧为一帧，那么目标时延向量即为一维向量；如果各个麦克风所对应的目标音频帧为多帧，那么目标时延向量即为多维向量，也就是一个向量矩阵。It can be understood that if the target audio frame corresponding to each microphone is one frame, then the target delay vector is a one-dimensional vector; if the target audio frame corresponding to each microphone is multiple frames, then the target delay vector is a multi-dimensional vector A vector, that is, a matrix of vectors.

为了布局清楚和方案清晰，后续将会对计算目标时延向量的具体实现方式进行举例介绍。In order to make the layout clear and the scheme clear, an example will be introduced later to describe the specific implementation of calculating the target delay vector.

S204，将所述目标延时向量输入至预先训练完成的目标机器学习模型，得到目标方位角标识值；S204. Input the target delay vector into the pre-trained target machine learning model to obtain the target azimuth identification value;

其中，该目标机器学习模型可以为：以音频帧样本对应的时延向量样本作为输入内容，且以音频信号样本对应的方位角标识值作为输出内容所训练得到的机器学习模型，而音频帧样本可以为对音频信号样本进行分帧处理得到的音频帧。该目标机器学习模型可以为现有任意机器学习模型，在此不做具体限定，例如可以是支持向量机等机器学习模型。Wherein, the target machine learning model may be: a machine learning model trained by taking the delay vector sample corresponding to the audio frame sample as the input content, and taking the azimuth identification value corresponding to the audio signal sample as the output content, and the audio frame sample It may be an audio frame obtained by performing frame division processing on the audio signal sample. The target machine learning model may be any existing machine learning model, which is not specifically limited here, for example, may be a machine learning model such as a support vector machine.

将上述目标时延向量输入至该目标机器学习模型，进行特征匹配，便可以得到目标方位角标识值。为了布局清楚和方案清晰，后续将会对通过对目标时延向量进行特征匹配得到目标方位角标识值的具体实现方式进行举例介绍。Input the above target time delay vector into the target machine learning model, and perform feature matching to obtain the target azimuth identification value. In order to make the layout clear and the scheme clear, the specific implementation method of obtaining the target azimuth identification value by performing feature matching on the target delay vector will be introduced later with examples.

需要说明的是，用于训练目标机器学习模型的音频帧样本一般是在与实际声源定位时相同或相近的环境中采集的，也就是说环境噪声、信号衰减等因素是相近的，这样可以使通过目标机器学习模型得到的目标方位角标识值更加准确。It should be noted that the audio frame samples used to train the target machine learning model are generally collected in the same or similar environment as the actual sound source localization, that is to say, environmental noise, signal attenuation and other factors are similar, so that Make the target azimuth identification value obtained through the target machine learning model more accurate.

例如，在电子设备为视频会议系统的情况下，可以在会议开始前，通过麦克风阵列采集每个与会人员在会议位置处发出声音时的音频帧样本，然后进行训练，得到目标机器学习模型，这样，该目标机器学习模型是非常适用于当前的视频会议环境的，由于在训练目标机器学习模型时所利用的音频帧样本与声源定位时所采集的目标音频帧的环境相同，所以即使在目标时延向量计算不够准确的情况下，也能通过该目标时延向量得到准确的目标方位角标识值。For example, in the case where the electronic device is a video conferencing system, before the start of the meeting, the audio frame samples of each participant making a sound at the meeting position can be collected through the microphone array, and then trained to obtain the target machine learning model. , the target machine learning model is very suitable for the current video conferencing environment. Since the audio frame samples used in training the target machine learning model are in the same environment as the target audio frame collected during sound source localization, even in the target When the calculation of the time delay vector is not accurate enough, the accurate target azimuth identification value can also be obtained through the target time delay vector.

在另一种实施方式中，也可以预先训练得到多个机器学习模型，该多个机器学习模型时在不同的环境下训练得到的。这样，在声源定位时，可以根据实际环境状况选择一个最为合适的机器学习模型作为目标机器学习模型。以电子设备为视频会议系统为例，可以根据会议房间大小、与会人数等因素，预先训练得到多个适用于不同房间大小等因素的机器学习模型，在实际使用该视频会议系统时，可以根据实际会议房间大小，与会人数等选择一个最为适合的机器学习模型作为目标机器学习模型来使用。In another implementation manner, multiple machine learning models may also be obtained through pre-training, and the multiple machine learning models are obtained through training in different environments. In this way, when the sound source is localized, the most suitable machine learning model can be selected as the target machine learning model according to the actual environmental conditions. Taking electronic equipment as a video conferencing system as an example, multiple machine learning models suitable for different room sizes and other factors can be pre-trained according to factors such as the size of the conference room and the number of participants. Choose the most suitable machine learning model as the target machine learning model based on the size of the conference room and the number of participants.

S205，基于所述目标方位角标识值，得到所述目标音频信号的声源所对应的目标方位角。S205. Obtain a target azimuth corresponding to a sound source of the target audio signal based on the target azimuth identification value.

电子设备获得了上述目标方位角标识值后，便可以根据该目标方位角标识值确定目标音频信号的声源所对应的目标方位角，进而完成声源定位。After the electronic device obtains the target azimuth identification value, it can determine the target azimuth corresponding to the sound source of the target audio signal according to the target azimuth identification value, thereby completing sound source location.

举例而言，如果该目标方位角标识值为目标方位角本身，例如30度、50度等，那么便可以确定目标方位角的角度与目标方位角标识值相同。如果该目标方位角标识值是根据某种规则编码得到的，那么便可以对该目标方位角标识值进行解码，进而获得目标方位角，这也是合理的。For example, if the target azimuth identification value is the target azimuth itself, such as 30 degrees, 50 degrees, etc., then it can be determined that the angle of the target azimuth is the same as the target azimuth identification value. If the target azimuth identification value is encoded according to a certain rule, then the target azimuth identification value can be decoded to obtain the target azimuth, which is also reasonable.

可见，本发明实施例所提供的方案中，首先获得麦克风阵列中各个麦克风采集的目标音频信号，对各个麦克风采集的目标音频信号进行分帧处理，并根据分帧结果，确定各个麦克风所对应的目标音频帧，然后计算目标音频帧所对应的目标时延向量，将目标延时向量输入至预先训练完成的目标机器学习模型，得到目标方位角标识值，最后基于目标方位角标识值，得到目标音频信号的声源所对应的目标方位角。由于目标机器学习模型是以实际应用场景中所采集到的音频帧样本对应的时延向量样本作为输入内容，且以音频信号样本对应的方位角标识值作为输出内容所训练得到的机器学习模型，所以即使在存在噪声等环境因素影响及时延的计算不够精确的情况下，也能够准确确定声源的方位角。It can be seen that in the solution provided by the embodiment of the present invention, the target audio signal collected by each microphone in the microphone array is firstly obtained, the target audio signal collected by each microphone is processed by frame division, and the corresponding audio signal of each microphone is determined according to the frame division result. Target audio frame, then calculate the target delay vector corresponding to the target audio frame, input the target delay vector into the pre-trained target machine learning model, obtain the target azimuth identification value, and finally obtain the target azimuth identification value based on the target azimuth identification value Target azimuth corresponding to the sound source of the audio signal. Since the target machine learning model is a machine learning model trained with delay vector samples corresponding to audio frame samples collected in actual application scenarios as input content and azimuth identification values corresponding to audio signal samples as output content, Therefore, the azimuth angle of the sound source can be accurately determined even if there are environmental factors such as noise and the calculation of the delay is not accurate enough.

作为本发明实施例的一种实施方式，计算上述目标音频帧所对应的目标时延向量的方式可以包括：As an implementation manner of the embodiment of the present invention, the manner of calculating the target delay vector corresponding to the above target audio frame may include:

下面以一帧目标音频帧为例，对计算目标时延向量的方式进行说明，麦克风阵列可以为如图1所示的线阵，麦克风11、麦克风12、麦克风13及麦克风14沿直线排列，两个麦克风之间的距离为d，声源方位与麦克风阵列的夹角，即目标方位角以θ表示，那么从图中可以得出：Taking a frame of target audio frame as an example below, the method of calculating the target delay vector is explained. The microphone array can be a linear array as shown in FIG. The distance between the microphones is d, the angle between the sound source azimuth and the microphone array, that is, the target azimuth angle is represented by θ, then it can be drawn from the figure:

其中，c为音频信号在空气中的传播速度，τ₁₂为麦克风11与麦克风12接收目标音频帧的时间差。从公式(1)可以得出：Wherein, c is the propagation speed of the audio signal in the air, and_τ12 is the time difference between the microphone 11 and the microphone 12 receiving the target audio frame. From formula (1), it can be concluded that:

为了方便说明计算过程，假设环境不存混响，麦克风11和麦克风12接收的目标音频帧分别为x₁(n)和x₂(n)，那么可以得出：In order to facilitate the description of the calculation process, assuming that there is no reverberation in the environment, and the target audio frames received by the microphone 11 and the microphone 12 are respectively x₁ (n) and x₂ (n), then it can be drawn that:

在公式(3)中，s(x)为声源音频信号，α₁为声源音频信号到麦克风11的声音传播衰减，α₂为声源到麦克风12的声音传播衰减，τ₁为声源音频信号到麦克风11的时间延迟，τ₂为声源音频信号到麦克风12的时间延迟，n₁(n)为麦克风11接收到的加性噪声，n₂(n)为麦克风12接收到的加性噪声。其中，s(x)、n₁(n)、n₂(n)两两互不相关。In formula (3), s (x) is the sound source audio signal, α₁ is the sound propagation attenuation of the sound source audio signal to the microphone 11, α₂ is the sound propagation attenuation of the sound source to the microphone 12, τ₁ is the sound source Audio signal to the time delay of microphone 11, τ₂ is the time delay of sound source audio signal to microphone 12, n₁ (n) is the additive noise that microphone 11 receives, n₂ (n) is the additive noise that microphone 12 receives sexual noise. Among them, s(x), n₁ (n), and n₂ (n) are not correlated with each other.

那么，将x₁(n)与x₂(n)进行互相关便可以得到：Then, cross-correlate x₁ (n) with x₂ (n) to get:

R(τ₁₂)＝E[x₁(n)x₂(n-τ₁₂)] (4)R(τ₁₂ )＝E[x₁ (n)x₂ (n-τ₁₂ )] (4)

将公式(3)带入公式(4)中可以得到：Put formula (3) into formula (4) can get:

R(τ₁₂)＝α₁α₂E[s(n-τ₁)s(n-τ₂-τ₁₂)]+α₁E[s(n-τ₁)n₂(n-τ₁₂)]+R(τ₁₂ )＝α₁ α₂ E[s(n-τ₁ )s(n-τ₂ -τ₁₂ )]+α₁ E[s(n-τ₁ )n₂ (n-τ₁₂ ) ]+

α₂E[s(n-τ₂-τ₁₂)n₁(n)]+E[n₁(n)n₂(n)] (5)α₂ E[s(n-τ₂ -τ₁₂ )n₁ (n)]+E[n₁ (n)n₂ (n)] (5)

由于s(x)、n₁(n)、n₂(n)两两互不相关，所以s(n-τ₁)n₂(n-τ₁₂)＝0，Since s(x), n₁ (n) and n₂ (n) are not correlated with each other, so s(n-τ₁ )n₂ (n-τ₁₂ )=0,

s(n-τ₂-τ₁₂)n₁(n)＝0，E[n₁(n)n₂(n)]＝0。那么，由公式(5)可以得出：s(n−τ₂ −τ₁₂ )n₁ (n)=0, E[n₁ (n)n₂ (n)]=0. Then, from formula (5), it can be concluded that:

R(τ₁₂)＝α₁α₂E[s(n-τ₁)s(n-τ₂-τ₁₂)] (6)R(τ₁₂ )＝α₁ α₂ E[s(n-τ₁ )s(n-τ₂ -τ₁₂ )] (6)

即：which is:

R(τ₁₂)＝α₁α₂R_s(τ₁₂-(τ₁-τ₂)) (7)R(τ₁₂ )＝α₁ α₂ R_s (τ₁₂ -(τ₁ -τ₂ )) (7)

从公式(7)中可以看出，当τ₁₂＝τ₁-τ₂时，R(τ₁₂)取得最大值。因此，如果求得R(τ₁₂)的最大值，便可以找出麦克风11与麦克风12的相对延迟τ₁₂＝τ₁-τ₂。也就是说，可以通过搜索R(τ₁₂)的峰值的方式来获得对应的时间差τ₁₂。It can be seen from formula (7) that when τ₁₂ =τ₁ −τ₂ , R(τ₁₂ ) takes the maximum value. Therefore, if the maximum value of R(τ₁₂ ) is obtained, the relative delay τ₁₂ =τ₁ −τ₂ between the microphone 11 and the microphone 12 can be found. That is to say, the corresponding time difference τ₁₂ can be obtained by searching the peak value of R(τ₁₂ ).

通过上述方式便可以计算得到麦克风阵列中所有麦克风接收目标音频帧所对应的时间差，例如，麦克风11与麦克风13接收目标音频帧所对应的时间差τ₁₃等，进而，便可以得到目标时延向量：By the above method, the corresponding time difference of all microphones in the microphone array receiving the target audio frame can be calculated, for example, the corresponding time difference τ₁₃ of the microphone 11 and microphone 13 receiving the target audio frame, etc., and then, the target time delay vector can be obtained:

Γ＝[τ₁₂,…,τ_1M,τ₂₃,…,τ_2M,…,τ_(M-1)M]^TΓ=[τ₁₂ ,…,τ_1M ,τ₂₃ ,…,τ_2M ,…,τ_(M-1)M ]^T

其中，M为麦克风阵列中麦克风的数量。Wherein, M is the number of microphones in the microphone array.

可以理解的是，当目标音频帧为多帧时，则可以分别计算每一帧目标音频帧对应的时延向量，然后由多个时延向量组成一个多维的目标时延向量。其中，计算每一帧目标音频帧对应的时延向量的方式与上述计算目标时延向量的方式相同，在此不再赘述。It can be understood that, when the target audio frame has multiple frames, the delay vector corresponding to each frame of the target audio frame can be calculated respectively, and then a multi-dimensional target delay vector is composed of multiple delay vectors. Wherein, the method of calculating the delay vector corresponding to each target audio frame is the same as the above-mentioned method of calculating the target delay vector, and will not be repeated here.

作为本发明实施例的一种实施方式，为了提高目标时延向量的准确度，同时减少计算量，计算上述目标音频帧所对应的目标时延向量的方式可以包括：As an implementation manner of the embodiment of the present invention, in order to improve the accuracy of the target delay vector and reduce the amount of calculation, the method of calculating the target delay vector corresponding to the above target audio frame may include:

对所述目标音频帧进行上采样处理，并将上采样处理后的音频帧转换为频域信号帧；对所述频域信号帧进行两两互相关处理，得到所述目标时延向量。performing upsampling processing on the target audio frame, and converting the upsampled audio frame into a frequency domain signal frame; performing pairwise cross-correlation processing on the frequency domain signal frame to obtain the target delay vector.

为了提高采样率，使目标时延向量的准确度更高，可以先对目标音频帧进行上采样处理。然后可以将上采样处理后的音频帧转换为频域信号帧，进而对该频域信号帧进行两两互相关处理，得到目标时延向量，以减少计算量。In order to increase the sampling rate and increase the accuracy of the target delay vector, the target audio frame can be up-sampled first. Then, the upsampled audio frame can be converted into a frequency domain signal frame, and then pairwise cross-correlation processing is performed on the frequency domain signal frame to obtain a target delay vector, so as to reduce the amount of calculation.

对频域信号帧进行两两互相关处理过程中，主要运用两次FFT(Fast FourierTransformation，快速傅里叶变换)和一次IFFT(Inverse Fast Fourier Transformation，快速傅里叶逆变换)，而FFT可以选择用蝶形运算来替换，大大减小算法复杂度。In the process of pairwise cross-correlation processing of frequency domain signal frames, two FFTs (Fast Fourier Transformation, Fast Fourier Transformation) and one IFFT (Inverse Fast Fourier Transformation, Inverse Fast Fourier Transformation) are mainly used, and FFT can be selected Replace it with butterfly operation, which greatly reduces the complexity of the algorithm.

需要说明的是，上述上采样处理、对频域信号帧的快速傅里叶变换、快速傅里叶逆变换及蝶形运算均为本领域常用的信号处理方法，本领域技术人员可以根据目标音频信号及实际环境状况等因素进行处理，在此不做具体限定及说明。It should be noted that the above-mentioned upsampling processing, fast Fourier transform, inverse fast Fourier transform and butterfly operation on the frequency domain signal frame are all commonly used signal processing methods in the field. Signals and actual environmental conditions and other factors are processed, and no specific limitations and descriptions are made here.

作为本发明实施例的一种实施方式，如图3所示，上述目标机器学习模型的训练方式可以包括以下步骤：As an implementation manner of the embodiment of the present invention, as shown in FIG. 3, the training method of the above-mentioned target machine learning model may include the following steps:

S301，构建初始机器学习模型；S301, building an initial machine learning model;

可以理解的是，电子设备首先需要构建一个初始机器学习模型，然后对其进行训练，进而得到目标机器学习模型。例如，可以构建一个最小二乘支持向量机作为初始机器学习模型。It is understandable that an electronic device first needs to construct an initial machine learning model, and then train it to obtain a target machine learning model. For example, a least squares support vector machine can be built as an initial machine learning model.

S302，确定用于模型训练的多个预设方位角的方位角标识值；S302. Determine azimuth identification values of a plurality of preset azimuths used for model training;

其中，该预设方位角的数量及角度可以根据实际需要进行选择，可以理解的是，如图1所示，方位角即为声源所在方向与麦克风阵列所在直线的夹角，即为图中的θ角。例如，可以预设方位角为0度、30度、60度、90度、120度、150度及180度，当然也可以为10度、15度等，这都是合理的。Wherein, the number and angle of the preset azimuth angles can be selected according to actual needs. It can be understood that, as shown in Figure 1, the azimuth angle is the angle between the direction of the sound source and the straight line where the microphone array is located, which is Theta angle. For example, the preset azimuth angles can be 0 degrees, 30 degrees, 60 degrees, 90 degrees, 120 degrees, 150 degrees and 180 degrees, and of course they can also be 10 degrees, 15 degrees, etc., which are all reasonable.

在一种实施方式中，可以分别将用于训练的每个预设方位角本身作为自身的方位角标识值。也就是说，如果预设方位角为0度、30度、60度、90度、120度、150度及180度，那么其所对应的方位角标识值则分别为0度、30度、60度、90度、120度、150度及180度。In an implementation manner, each preset azimuth used for training may be used as its own azimuth identification value. That is to say, if the preset azimuth angles are 0°, 30°, 60°, 90°, 120°, 150° and 180°, then the corresponding azimuth identification values are 0°, 30°, 60° degrees, 90 degrees, 120 degrees, 150 degrees and 180 degrees.

在另一种实施方式中，可以根据预设的编码规则对用于模型训练的多个预设方位角进行编码，获得各个预设方位角对应的二进制数组作为其方位角标识值。一般情况下，二进制数组的位数可以通过公式p＝[log₂N]来确定，其中，N为预设方位角的数量。可以理解的是，p＝[log₂N]表示：p的值等于log₂N向上取整，例如，预设方位角的数量N为6，那么p＝[log₂6]＝3，那么此时每个预设方位角对应的二进制数组便可以是一个3位的二进制数组。In another embodiment, multiple preset azimuths used for model training may be encoded according to a preset encoding rule, and a binary array corresponding to each preset azimuth is obtained as its azimuth identification value. Generally, the number of bits in the binary array can be determined by the formula p=[log₂ N], where N is the number of preset azimuth angles. It can be understood that p=[log₂ N] means: the value of p is equal to log₂ N rounded up, for example, the number N of preset azimuth angles is 6, then p=[log₂ 6]=3, then this At this time, the binary array corresponding to each preset azimuth angle may be a 3-bit binary array.

举例而言，如果预设方位角为0度、30度、60度、90度、120度及150度六个角度，那么便可以确定预设方位角的方位角标识值分别为(-1，-1，-1)、(-1，-1，1)、(-1，1，-1)、(-1，1，1)、(1，-1，-1)及(1，-1，1)，这样可以简便且不重复地确定各个预设方位角对应的二进制数组。For example, if the preset azimuth angles are 0 degrees, 30 degrees, 60 degrees, 90 degrees, 120 degrees and 150 degrees, then it can be determined that the azimuth identification values of the preset azimuth angles are (-1, -1, -1), (-1, -1, 1), (-1, 1, -1), (-1, 1, 1), (1, -1, -1) and (1, - 1, 1), so that the binary array corresponding to each preset azimuth angle can be determined simply and without repetition.

S303，获得所述麦克风阵列中各个麦克风采集的多个音频信号样本，并对每一个音频信号样本进行分帧处理，得到多个音频帧样本；S303. Obtain a plurality of audio signal samples collected by each microphone in the microphone array, and perform frame division processing on each audio signal sample to obtain a plurality of audio frame samples;

可以理解的是，该音频信号样本为：所对应声源的方位角为上述预设方位角的音频信号。在采集音频信号样本时，一般针对每一个预设方位角采集多个音频信号样本，用来训练上述初始机器学习模型。It can be understood that the audio signal sample is: an audio signal whose azimuth angle of the corresponding sound source is the aforementioned preset azimuth angle. When collecting audio signal samples, generally a plurality of audio signal samples are collected for each preset azimuth angle to train the aforementioned initial machine learning model.

需要说明的是，在采集音频信号样本时，采集环境一般是与声源定位时的环境相同或相似的，这样可以保证训练得到的目标机器学习模型更加适用于对目标时延向量的特征匹配，可以得到准确的目标方位角标识值。It should be noted that when collecting audio signal samples, the collection environment is generally the same or similar to the environment during sound source localization, which can ensure that the trained target machine learning model is more suitable for feature matching of the target delay vector. Accurate target azimuth identification value can be obtained.

采集到多个音频信号样本后，可以对每一个音频信号样本进行分帧处理，得到多个音频帧样本，该分帧处理方式与上对目标音频信号的分帧处理方式相同，在此不再赘述。After collecting multiple audio signal samples, each audio signal sample can be divided into frames to obtain multiple audio frame samples. The frame division processing method is the same as the above frame division processing method for the target audio signal, and will not be repeated here repeat.

S304，计算每一个音频帧样本所对应的时延向量样本；S304, calculating the delay vector samples corresponding to each audio frame sample;

分帧处理得到音频帧样本后，便可以计算每一个音频帧样本所对应的时延向量样本，由于计算每一个音频帧样本所对应的时延向量样本的方式与上述计算目标时延向量的方式类似，相关之处可以参见上述计算目标时延向量的方式部分的说明，在此不再赘述。After the audio frame samples are obtained by frame division, the delay vector samples corresponding to each audio frame sample can be calculated, because the method of calculating the delay vector samples corresponding to each audio frame sample is the same as the above-mentioned method of calculating the target delay vector Similarly, for relevant details, refer to the description of the above-mentioned method for calculating the target delay vector, and details are not repeated here.

需要说明的是，由于对于每一个预设方位角均得到了多个音频帧样本，所以与每一个预设方位角相对应的时延向量样本也为多个。该时延向量样本可以为一维向量，也可以是多维向量，这都是合理的。It should be noted that since multiple audio frame samples are obtained for each preset azimuth angle, there are also multiple delay vector samples corresponding to each preset azimuth angle. The delay vector sample may be a one-dimensional vector or a multi-dimensional vector, which is reasonable.

S305，将各时延向量样本输入所述初始机器学习模型，并利用所述预设方位角对应的方位角标识值对相应时延向量样本进行训练；S305. Input each delay vector sample into the initial machine learning model, and use the azimuth identification value corresponding to the preset azimuth to train the corresponding delay vector samples;

计算得到上述时延向量样本后，便可以利用该时延向量样本和已确定的方位角标识值对初始机器学习模型进行训练。After the above time delay vector sample is calculated, the initial machine learning model can be trained by using the time delay vector sample and the determined azimuth identification value.

在一种实现方式中，如果方位角标识值为二进制数组，那么对初始机器学习模型进行训练时所基于的公式可以为：In one implementation, if the azimuth identification value is a binary array, the formula based on which the initial machine learning model is trained can be:

其中，y^(p)为所述目标二进制数组中的第p位二进制数，K(x，y)为目标机器学习模型的核函数，可以为线性核函数、多项式核函数、径向基核函数等，在此不做具体限定。为第k个预设方位角对应的时延向量样本，k＝1、2…N，M为麦克风阵列中麦克风的数量。Γ₀为迭代时延向量，其初始值可以由本领域技术人员根据实际情况确定，在此不做具体限定。Wherein, y^(p) is the pth binary number in the target binary array, and K (x, y) is the kernel function of the target machine learning model, which can be a linear kernel function, a polynomial kernel function, a radial basis kernel function etc., which are not specifically limited here. is the delay vector sample corresponding to the kth preset azimuth angle, k=1, 2...N, and M is the number of microphones in the microphone array. Γ₀ is an iteration delay vector, and its initial value can be determined by those skilled in the art according to the actual situation, and is not specifically limited here.

在训练过程中，对于每个预设方位角，将输入初始机器学习模型，进行迭代运算，同时不断调整参数及b^(p)，以得到目标机器学习模型。During training, for each preset azimuth, the Input the initial machine learning model, perform iterative operations, and continuously adjust the parameters and b^(p) to get the target machine learning model.

S306，当各时延向量样本与训练得到的时延向量的均方差值均小于预设值时，完成训练，得到所述目标机器学习模型。S306. When the mean square difference values between each delay vector sample and the delay vector obtained through training are less than a preset value, complete the training, and obtain the target machine learning model.

当各时延向量样本与训练得到的时延向量的均方差值均小于预设值时，说明此时的初始机器学习模型对于所有预设方位角，都能得到准确的估计值，那么便可以停止训练，将此时的初始机器学习模型作为目标机器学习模型。同时，将此时确定的参数b^(p)作为第k个预设方位角的第p位二进制数所对应的参数，将训练得到的时延向量作为第k个预设方位角的第p位二进制数所对应的训练得到的时延向量，用来在声源定位过程中计算目标方位角标识值。When the mean square error values of each time delay vector sample and the time delay vector obtained by training are less than the preset value, it means that the initial machine learning model at this time can obtain accurate estimates for all preset azimuths, then You can stop the training and use the initial machine learning model at this time as the target machine learning model. At the same time, the parameters determined at this time b^(p) is used as the parameter corresponding to the p-th binary number of the k-th preset azimuth, and the delay vector obtained by training is The delay vector obtained from training corresponding to the p-th binary number of the k-th preset azimuth is used to calculate the identification value of the target azimuth during the sound source localization process.

上述预设值可以由本领域技术人员根据实际声源定位所需的精度等因素确定，在此不做具体限定，例如可以为10^-5、10^-4等。The above preset value can be determined by those skilled in the art according to factors such as the accuracy required for actual sound source localization, and is not specifically limited here, for example, it can be 10⁻⁵ , 10⁻⁴ , etc.

对于每个预设方位角的方位角标识值为二进制数组的情况而言，作为本发明实施例的一种实施方式，上述目标机器学习模型的特征匹配公式可以为：For the case where the azimuth identification value of each preset azimuth is a binary array, as an implementation of the embodiment of the present invention, the feature matching formula of the above-mentioned target machine learning model can be:

其中，及b^(p)即为针对第k个预设方位角预先训练得到的第p位二进制数所对应的参数，即为预先训练得到的第k个预设方位角所对应的时延向量，Γ＝[τ₁₂,…,τ_1M,τ₂₃,…,τ_2M,…,τ_(M-1)M]^T则为目标时延向量。in, And b^(p) is the parameter corresponding to the p-th binary number obtained in advance for the k-th preset azimuth, That is, the delay vector corresponding to the k-th preset azimuth obtained in pre-training, Γ=[τ₁₂ ,…,τ_1M ,τ₂₃ ,…,τ_2M ,…,τ_(M-1)M ]^T is the target delay vector.

将目标时延向量输入目标机器学习模型，利用该特征匹配公式，即可以计算得到y^(p)的值，从而确定目标方位角标识值。Input the target time delay vector into the target machine learning model, and use the feature matching formula to calculate the value of y^(p) , so as to determine the target azimuth identification value.

对于在模型训练过程中，分别将用于训练的每个预设方位角的方位角本身作为自身的方位角标识值的情况而言，作为本发明实施例的一种实施方式，基于目标方位角标识值，得到目标音频信号的声源所对应的目标方位角的方式可以包括：In the model training process, the azimuth of each preset azimuth used for training is used as its own azimuth identification value, as an implementation of the embodiment of the present invention, based on the target azimuth Identification value, the way to obtain the target azimuth angle corresponding to the sound source of the target audio signal may include:

可以理解的是，由于每个预设方位角的方位角本身作为自身的方位角标识值，所以得到的目标方位角标识值即为目标方位角。It can be understood that, since the azimuth of each preset azimuth itself serves as its own azimuth identification value, the obtained target azimuth identification value is the target azimuth.

对于在模型训练过程中，根据预设的编码规则对用于模型训练的多个预设方位角进行编码，获得各个预设方位角对应的二进制数组的情况而言，作为本发明实施例的一种实施方式，基于目标方位角标识值，得到目标音频信号的声源所对应的目标方位角的方式可以包括：In the process of model training, a plurality of preset azimuths used for model training are encoded according to preset coding rules, and the binary array corresponding to each preset azimuth is obtained, as an example of an embodiment of the present invention In one embodiment, based on the target azimuth identification value, the method of obtaining the target azimuth corresponding to the sound source of the target audio signal may include:

按照与所述编码规则对应的解码规则，对所述目标方位角标识值进行解码，得到解码结果；将所述解码结果作为所述目标音频信号的声源所对应的目标方位角。Decoding the target azimuth identification value according to the decoding rule corresponding to the encoding rule to obtain a decoding result; using the decoding result as the target azimuth corresponding to the sound source of the target audio signal.

由于目标方位角标识值为二进制数组，那么得到该目标方位角标识值后，便可以对其进行解码，进而得到目标方位角。例如，如果得到的目标方位角标识值为二进制数组(-1，1，-1)，对该二进制数组进行解码，得到解码结果为60度，那么目标方位角即为60度。Since the target azimuth identification value is a binary array, after obtaining the target azimuth identification value, it can be decoded to obtain the target azimuth. For example, if the obtained target azimuth identification value is a binary array (-1, 1, -1), and the binary array is decoded to obtain a decoding result of 60 degrees, then the target azimuth is 60 degrees.

相应于上述方法实施例，本发明实施例还提供了一种声源定位装置，下面对本发明实施例所提供的一种声源定位装置进行介绍。Corresponding to the foregoing method embodiments, an embodiment of the present invention further provides a sound source localization device, and the following describes the sound source localization device provided by the embodiment of the present invention.

如图4所示，一种声源定位装置，所述装置包括：As shown in Figure 4, a sound source localization device, the device includes:

目标音频信号获得模块410，用于获得麦克风阵列中各个麦克风采集的目标音频信号；The target audio signal obtaining module 410 is used to obtain the target audio signal collected by each microphone in the microphone array;

目标音频帧确定模块420，用于对所述各个麦克风采集的目标音频信号进行分帧处理，并根据分帧结果，确定所述各个麦克风所对应的目标音频帧；The target audio frame determination module 420 is configured to perform frame division processing on the target audio signals collected by the respective microphones, and determine the target audio frames corresponding to the respective microphones according to the frame division results;

目标时延向量计算模块430，用于计算所述目标音频帧所对应的目标时延向量，其中，所述目标时延向量为：基于各个麦克风接收相应目标音频帧的时间差所形成的向量；The target delay vector calculation module 430 is configured to calculate the target delay vector corresponding to the target audio frame, wherein the target delay vector is: a vector formed based on the time difference of each microphone receiving the corresponding target audio frame;

目标方位角标识值获得模块440，用于将所述目标延时向量输入至由模型训练模块预先训练完成的目标机器学习模型，得到目标方位角标识值，其中，所述目标机器学习模型为：以音频帧样本对应的时延向量样本作为输入内容，且以音频信号样本对应的方位角标识值作为输出内容所训练得到的机器学习模型，所述音频帧样本为对所述音频信号样本进行分帧处理得到的音频帧；The target azimuth identification value obtaining module 440 is used to input the target delay vector into the target machine learning model pre-trained by the model training module to obtain the target azimuth identification value, wherein the target machine learning model is: A machine learning model trained with the delay vector sample corresponding to the audio frame sample as the input content and the azimuth identification value corresponding to the audio signal sample as the output content, the audio frame sample is for analyzing the audio signal sample The audio frame obtained by frame processing;

目标方位角确定模块450，用于基于所述目标方位角标识值，得到所述目标音频信号的声源所对应的目标方位角。The target azimuth determination module 450 is configured to obtain the target azimuth corresponding to the sound source of the target audio signal based on the target azimuth identification value.

作为本发明实施例的一种实施方式，所述目标时延向量计算模块430可以包括：As an implementation manner of the embodiment of the present invention, the target delay vector calculation module 430 may include:

第一互相关单元(图中未示出)，用于对所述目标音频帧进行两两互相关处理，得到所述目标时延向量。The first cross-correlation unit (not shown in the figure) is configured to perform pairwise cross-correlation processing on the target audio frames to obtain the target delay vector.

转换单元(图中未示出)，用于对所述目标音频帧进行上采样处理，并将上采样处理后的音频帧转换为频域信号帧；A conversion unit (not shown in the figure), configured to perform upsampling processing on the target audio frame, and convert the upsampled audio frame into a frequency domain signal frame;

第二互相关单元(图中未示出)，用于对所述频域信号帧进行两两互相关处理，得到所述目标时延向量。The second cross-correlation unit (not shown in the figure) is configured to perform pairwise cross-correlation processing on the frequency-domain signal frames to obtain the target delay vector.

作为本发明实施例的一种实施方式，所述模型训练模块可以包括：As an implementation manner of an embodiment of the present invention, the model training module may include:

构建单元(图中未示出)，用于构建初始机器学习模型；A construction unit (not shown in the figure), used to construct an initial machine learning model;

方位角标识值确定单元(图中未示出)，用于确定用于模型训练的多个预设方位角的方位角标识值；Azimuth identification value determination unit (not shown in the figure), for determining the azimuth identification value of a plurality of preset azimuths for model training;

音频帧样本获得单元(图中未示出)，用于获得所述麦克风阵列中各个麦克风采集的多个音频信号样本，并对每一个音频信号样本进行分帧处理，得到多个音频帧样本，其中，所述音频信号样本为：所对应声源的方位角为所述预设方位角的音频信号；An audio frame sample obtaining unit (not shown in the figure), configured to obtain a plurality of audio signal samples collected by each microphone in the microphone array, and perform frame processing on each audio signal sample to obtain a plurality of audio frame samples, Wherein, the audio signal sample is: an audio signal whose azimuth angle of the corresponding sound source is the preset azimuth angle;

时延向量样本计算单元(图中未示出)，用于计算每一个音频帧样本所对应的时延向量样本；A delay vector sample calculation unit (not shown in the figure), used to calculate the delay vector sample corresponding to each audio frame sample;

样本训练单元(图中未示出)，用于将各时延向量样本输入所述初始机器学习模型，并利用所述预设方位角对应的方位角标识值对相应时延向量样本进行训练；A sample training unit (not shown in the figure), configured to input each time delay vector sample into the initial machine learning model, and use the azimuth identification value corresponding to the preset azimuth to train the corresponding time delay vector sample;

目标机器学习模型获得单元(图中未示出)，用于当各时延向量样本与训练得到的时延向量的均方差值均小于预设值时，完成训练，得到所述目标机器学习模型。A target machine learning model acquisition unit (not shown in the figure), used to complete the training when the mean square difference between each time delay vector sample and the time delay vector obtained through training is less than a preset value, and obtain the target machine learning model Model.

作为本发明实施例的一种实施方式，所述方位角标识值确定单元可以包括：As an implementation manner of the embodiment of the present invention, the azimuth identification value determination unit may include:

第一标识值确定单元(图中未示出)，用于分别将用于训练的每个预设方位角本身作为自身的方位角标识值；The first identification value determination unit (not shown in the figure) is used to respectively use each preset azimuth angle itself for training as its own azimuth identification value;

所述目标方位角确定模块450可以包括：The target azimuth determination module 450 may include:

第一目标方位角确定单元(图中未示出)，用于将所述目标方位角标识值作为所述目标音频信号的声源所对应的目标方位角。A first target azimuth determination unit (not shown in the figure) is configured to use the target azimuth identification value as the target azimuth corresponding to the sound source of the target audio signal.

作为本发明实施例的一种实施方式，所述方位角标识值确定单元可以包括：As an implementation manner of the embodiment of the present invention, the azimuth identification value determining unit may include:

第二标识值确定单元(图中未示出)，用于根据预设的编码规则对用于模型训练的多个预设方位角进行编码，获得各个预设方位角对应的二进制数组；The second identification value determination unit (not shown in the figure) is used to encode a plurality of preset azimuths used for model training according to preset coding rules to obtain a binary array corresponding to each preset azimuth;

解码单元(图中未示出)，用于按照与所述编码规则对应的解码规则，对所述目标方位角标识值进行解码，得到解码结果；A decoding unit (not shown in the figure), configured to decode the target azimuth identification value according to a decoding rule corresponding to the encoding rule to obtain a decoding result;

第二目标方位角确定单元(图中未示出)，用于将所述解码结果作为所述目标音频信号的声源所对应的目标方位角。The second target azimuth determination unit (not shown in the figure) is configured to use the decoding result as the target azimuth corresponding to the sound source of the target audio signal.

作为本发明实施例的一种实施方式，所述目标机器学习模型的特征匹配公式为：As an implementation manner of an embodiment of the present invention, the feature matching formula of the target machine learning model is:

其中，y^(p)为所述目标二进制数组中的第p位二进制数，p＝1、2…n，n＝[log₂N]，N为所述预设方位角的数量，k＝1、2…N，及b^(p)为针对第k个预设方位角预先训练得到的所述第p位二进制数所对应的参数，K(Γ_k，Γ)为所述目标机器学习模型的核函数，为预先训练得到的第k个预设方位角所对应的时延向量，M为所述麦克风阵列中麦克风的数量，Γ＝[τ₁₂,…,τ_1M,τ₂₃,…,τ_2M,…,τ_(M-1)M]^T为所述目标时延向量。Wherein, y^(p) is the p-th binary number in the target binary array, p=1, 2...n, n=[log₂ N], N is the quantity of the preset azimuth, k=1 , 2...N, And b^(p) is the parameter corresponding to the p-th binary number obtained by pre-training for the k-th preset azimuth, K (Γ_k , Γ) is the kernel function of the target machine learning model, is the time delay vector corresponding to the k-th preset azimuth obtained through pre-training, M is the number of microphones in the microphone array, Γ=[τ₁₂ ,...,τ_1M ,τ₂₃ ,...,τ_2M ,... ,τ_(M-1)M ]^T is the target delay vector.

需要说明的是，在本文中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that in this article, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply that there is a relationship between these entities or operations. There is no such actual relationship or order between them. Furthermore, the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article, or apparatus comprising a set of elements includes not only those elements, but also includes elements not expressly listed. other elements of or also include elements inherent in such a process, method, article, or device. Without further limitations, an element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in the process, method, article or apparatus comprising said element.

本说明书中的各个实施例均采用相关的方式描述，各个实施例之间相同相似的部分互相参见即可，每个实施例重点说明的都是与其他实施例的不同之处。尤其，对于装置实施例而言，由于其基本相似于方法实施例，所以描述的比较简单，相关之处参见方法实施例的部分说明即可。Each embodiment in this specification is described in a related manner, the same and similar parts of each embodiment can be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, as for the device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for relevant parts, please refer to part of the description of the method embodiment.

以上所述仅为本发明的较佳实施例而已，并非用于限定本发明的保护范围。凡在本发明的精神和原则之内所作的任何修改、等同替换、改进等，均包含在本发明的保护范围内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the protection scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present invention are included in the protection scope of the present invention.

Claims

Translated fromChinese

1.一种声源定位方法，其特征在于，所述方法包括：1. a sound source localization method, is characterized in that, described method comprises:

2.如权利要求1所述的方法，其特征在于，所述计算所述目标音频帧所对应的目标时延向量的步骤，包括：2. The method according to claim 1, wherein the step of calculating the corresponding target delay vector of the target audio frame comprises:

3.如权利要求1所述的方法，其特征在于，所述计算所述目标音频帧所对应的目标时延向量的步骤，包括：3. The method according to claim 1, wherein the step of calculating the corresponding target delay vector of the target audio frame comprises:

4.如权利要求1所述的方法，其特征在于，所述目标机器学习模型的训练方式包括：4. The method according to claim 1, wherein the training method of the target machine learning model comprises:

构建初始机器学习模型；Build an initial machine learning model;

5.如权利要求4所述的方法，其特征在于，所述确定用于模型训练的多个预设方位角的方位角标识值的步骤，包括：5. The method according to claim 4, wherein the step of determining the azimuth identification values of a plurality of preset azimuths for model training includes:

6.如权利要求4所述的方法，其特征在于，所述确定用于模型训练的预设方位角的方位角标识值的步骤，包括：6. The method according to claim 4, wherein the step of determining the azimuth identification value of the preset azimuth used for model training includes:

7.如权利要求6所述的方法，其特征在于，所述目标机器学习模型的特征匹配公式为：7. The method according to claim 6, wherein the feature matching formula of the target machine learning model is:

8.一种声源定位装置，其特征在于，所述装置包括：8. A sound source localization device, characterized in that the device comprises:

9.如权利要求8所述的装置，其特征在于，所述目标时延向量计算模块包括：9. The device according to claim 8, wherein the target delay vector calculation module comprises:

10.如权利要求8所述的装置，其特征在于，所述目标时延向量计算模块包括：10. The device according to claim 8, wherein the target delay vector calculation module comprises:

11.如权利要求8所述的装置，其特征在于，所述模型训练模块包括：11. The device according to claim 8, wherein the model training module comprises:

12.如权利要求11所述的装置，其特征在于，所述方位角标识值确定单元包括：12. The device according to claim 11, wherein the azimuth identification value determining unit comprises:

13.如权利要求12所述的装置，其特征在于，所述方位角标识值确定单元包括：13. The device according to claim 12, wherein the azimuth identification value determining unit comprises:

14.如权利要求13所述的装置，其特征在于，所述目标机器学习模型的特征匹配公式为：14. The device according to claim 13, wherein the feature matching formula of the target machine learning model is: