CN111415678B

Movatterモバイル変換

Info

Publication number: CN111415678B
Application number: CN202010011029.5A
Authority: CN
Inventors: M·乔达里; A.库马; G·辛格; R·巴勒
Original assignee: STMicroelectronics International NV; STMicroelectronics lnc USA
Current assignee: Italian Semiconductor International Co
Priority date: 2019-01-07
Filing date: 2020-01-06
Publication date: 2024-02-27
Anticipated expiration: 2040-01-06
Also published as: CN111415678A; US10943602B2; EP3678136B1; US20200219528A1; EP3678136A1

Abstract

Embodiments of the present invention relate to classifying an open or closed space environment for a mobile device or a wearable device. A method and apparatus for classifying a spatial environment as open or closed is provided. In the method and apparatus, one or more microphones detect ambient sound in a spatial environment and output an audio signal representative of the ambient sound. The processor determines a Spatial Environment Impulse Response (SEIR) for the audio signal and extracts one or more features of the SEIR. The processor classifies the spatial environment as open or closed based on one or more features of the SEIR.

Description

Translated fromChinese

对移动设备或可穿戴设备进行开放或封闭空间环境分类Classify mobile or wearable devices into open or enclosed space environments

技术领域Technical field

本公开总体上涉及一种用于移动设备或可穿戴设备的开放或封闭空间环境分类的方法，并且特别地，本公开涉及一种用于使用被动记录的声音进行分类的方法。The present disclosure relates generally to a method for classification of open or closed space environments of mobile devices or wearable devices, and in particular, the present disclosure relates to a method for classification using passively recorded sounds.

背景技术Background technique

诸如移动电话、可穿戴设备或个人数字助理的现代消费电子设备通常配备有内置高保真数字麦克风或麦克风阵列，其输入声音以用于通信或语音指令。消费电子设备通常配备有具有执行复杂计算能力的处理器。这允许使用该设备对使用麦克风或麦克风阵列数字记录的声音执行计算密集型操作，并且从声音记录中收集信息。Modern consumer electronic devices such as mobile phones, wearable devices or personal digital assistants are often equipped with built-in high-fidelity digital microphones or microphone arrays that input sound for communication or voice commands. Consumer electronic devices are often equipped with processors that have the ability to perform complex calculations. This allows the device to be used to perform computationally intensive operations on sounds digitally recorded using a microphone or microphone array, and to collect information from the sound recordings.

发明内容Contents of the invention

提供一种用于对设备附近的空间的开放空间环境或封闭空间环境进行分类的方法和设备。设备可以是移动设备或可穿戴设备等。设备使用麦克风或麦克风的阵列得到环境中的声音信号，而无需通过扬声器主动发送任何已知信号。设备从存在于空间环境中的、被动记录的周围环境声音估计空间环境冲激响应(SEIR)，从而放弃主动音频传输。A method and apparatus are provided for classifying an open space environment or a closed space environment in a space adjacent to a device. The device can be a mobile device or a wearable device, etc. The device uses a microphone or array of microphones to pick up sound signals from the environment without actively sending any known signal through the speakers. The device estimates the spatial environment impulse response (SEIR) from passively recorded ambient sounds present in the spatial environment, thereby foregoing active audio transmission.

设备从SEIR提取特征。设备利用附加特征(诸如声音信号的梅尔频率倒频谱系数(MFCC)、delta MFCC和双delta MFCC)来增强从SEIR提取的特征。数字化麦克风信号的不同帧大小被使用以用于提取从SEIR得到的特征以及MFCC、delta MFCC和双delta MFCC特征。设备联接该些特征，并且将该些特征提供给模式分类器(例如，深度学习分类器)，以将空间环境分类为开放或封闭。The device extracts features from SEIR. The device utilizes additional features such as Mel Frequency Cepstrum Coefficients (MFCCs), delta MFCCs and dual delta MFCCs of the sound signal to enhance the features extracted from the SEIR. Different frame sizes of the digitized microphone signal are used for extracting features derived from SEIR as well as MFCC, delta MFCC and dual delta MFCC features. The device concatenates the features and provides the features to a pattern classifier (eg, a deep learning classifier) to classify the spatial environment as open or closed.

附图说明Description of drawings

图1示出了用于对空间环境进行分类的设备的框图。Figure 1 shows a block diagram of a device for classifying space environments.

图2示出了用于空间环境分类的方法的流程图。Figure 2 shows a flow chart of a method for spatial environment classification.

图3示出了针对开放空间的空间环境冲激响应(SEIR)包络的一个示例和针对封闭空间的一个示例SEIR包络。Figure 3 shows one example of a Space Environment Impulse Response (SEIR) envelope for an open space and an example SEIR envelope for an enclosed space.

图4图示了针对封闭空间的SEIR包络的细节。Figure 4 illustrates details of the SEIR envelope for an enclosed space.

图5示出了一种使用空间环境中存在的周围环境声音对开放空间环境和封闭空间环境进行SEIR估计和特征提取的技术。Figure 5 illustrates a technique for SEIR estimation and feature extraction of open space environments and closed space environments using ambient sounds present in the space environment.

图6示出了用于从空间环境的周围环境声音估计SEIR的方法的流程图。Figure 6 shows a flowchart of a method for estimating SEIR from ambient sound of a spatial environment.

图7示出了用于通过增强从信号窗口得到的特征矢量来生成复合特征矢量的方法的流程图。Figure 7 shows a flowchart of a method for generating composite feature vectors by enhancing feature vectors derived from signal windows.

图8A和图8B示出了用于联接不同维度的特征矢量以形成复合特征矢量的方法的流程图。8A and 8B illustrate a flowchart of a method for joining feature vectors of different dimensions to form a composite feature vector.

图9示出了DNN分类器的测试准确度的一个示例。Figure 9 shows an example of the test accuracy of a DNN classifier.

具体实施方式Detailed ways

本文提供的是用于在没有来自用户的明确输入的情况下，将设备周围的空间环境标识为开放或封闭的技术。移动设备或可穿戴设备的空间环境的准确分类是针对各种上下文感知应用的有用上下文输入。This article provides techniques for identifying the spatial environment surrounding a device as open or closed without explicit input from the user. Accurate classification of the spatial environment of a mobile or wearable device is a useful contextual input for various context-aware applications.

已经尝试利用各种其他传感器(诸如，全球定位系统(GPS)、室内定位系统(IPS)、Wi-Fi、射频(RF)测距、移动网络、无线电访问网络(RAN)、摄像头、扬声器和麦克风等)对移动设备或可穿戴设备的用户的开放空间环境或封闭空间环境进行分类。但是，这些技术具有其相关联的局限性。例如，Wi-Fi基础设施和移动网络的可用性不是通用的，并且由于基于位置和信号强度的检测的准确度，GPS信号可能具有模糊性。另外，IPS、Wi-Fi和基于RAN的分类需要单独的硬件。Attempts have been made to utilize various other sensors such as Global Positioning System (GPS), Indoor Positioning System (IPS), Wi-Fi, Radio Frequency (RF) ranging, mobile networks, Radio Access Networks (RAN), cameras, speakers and microphones. etc.) to classify open space environments or closed space environments for users of mobile devices or wearable devices. However, these techniques have their associated limitations. For example, Wi-Fi infrastructure and mobile network availability are not universal, and GPS signals can have ambiguities due to the accuracy of detection based on location and signal strength. Additionally, IPS, Wi-Fi and RAN-based classification require separate hardware.

使用相机对空间环境进行分类取决于周围的照明，增加功率使用，并且可能引起隐私问题。另外，使用主动测试信号的空间环境分类依赖于由环境中的对象反射的主动发射的信号的回声，这在环境中引入噪声。Classification of spatial environments using cameras depends on surrounding lighting, increases power usage, and may raise privacy concerns. Additionally, spatial environment classification using active test signals relies on echoes of the actively emitted signal being reflected by objects in the environment, which introduces noise into the environment.

为了对设备的空间环境进行分类，使用由麦克风或麦克风的阵列接收的周围环境声音信号来估计空间环境冲激响应(SEIR)，而无需显式输出已知的测试信号。设备从SEIR提取新颖的特征。设备可以利用其他特征(诸如来自麦克风信号的梅尔频率倒频谱系数(MFCC)、delta MFCC和双delta MFCC)来增强SEIR的特征。特征被输入到模式分类器(诸如，深度学习架构)中，以用于将空间环境分类为开放或封闭。To classify the spatial environment of a device, the spatial environment impulse response (SEIR) is estimated using ambient sound signals received by a microphone or array of microphones without explicitly outputting a known test signal. The device extracts novel features from SEIR. Devices may utilize other features such as Mel Frequency Cepstral Coefficients (MFCCs), delta MFCCs, and dual delta MFCCs from the microphone signal to enhance the characteristics of the SEIR. The features are input into a pattern classifier, such as a deep learning architecture, for classifying the spatial environment as open or closed.

图1示出了用于对空间环境进行分类的设备100的框图。设备100可以是移动设备或可穿戴设备等。设备100可以是智能电话、智能手表、个人数字助理(PDA)或便携式音频或语音信号记录器等。设备100包括一个或多个麦克风102、处理器104、存储器106、输出设备108和通信设备110。Figure 1 shows a block diagram of an apparatus 100 for classifying a space environment. Device 100 may be a mobile device, a wearable device, or the like. Device 100 may be a smartphone, smart watch, personal digital assistant (PDA), or portable audio or voice signal recorder, or the like. Device 100 includes one or more microphones 102, processor 104, memory 106, output device 108, and communication device 110.

设备100确定周围环境是开放空间环境还是封闭空间环境。如本文所描述的，设备100可以确定空间环境是开放空间还是封闭空间，而无需将音频信号主动发射到环境中。The device 100 determines whether the surrounding environment is an open space environment or a closed space environment. As described herein, device 100 can determine whether a spatial environment is an open space or an enclosed space without actively transmitting audio signals into the environment.

一个或多个麦克风102可以是单个麦克风，或彼此间隔开的多个麦克风，在它们之间具有麦克风间的间隔。多个麦克风可以具有任何几何形状，诸如线性、平面或立方体等。多个麦克风可以具有等距或非等距的间隔。一个或多个麦克风102可以在其附近或周围中定向或全向捕获音频(例如，原始音频)。一个或多个麦克风102可以将表示捕获的音频的数据输出到处理器104。一个或多个麦克风102可以具有足以捕获可用于标识空间环境的类型的音频的方向性、灵敏度、信噪比(SNR)响应或频率响应。One or more microphones 102 may be a single microphone, or a plurality of microphones spaced apart from each other with inter-microphone spacing therebetween. Multiple microphones can have any geometric shape, such as linear, planar, or cubic. Multiple microphones can have equidistant or non-equidistant spacing. One or more microphones 102 may capture audio (eg, raw audio) directionally or omnidirectionally in its vicinity or surroundings. One or more microphones 102 may output data representing the captured audio to processor 104 . One or more microphones 102 may have directivity, sensitivity, signal-to-noise ratio (SNR) response, or frequency response sufficient to capture types of audio that may be used to identify the spatial environment.

处理器104可以是被配置为执行存储在存储器106中的可执行指令的任何类型的设备。当可执行指令由处理器104执行时，可执行指令使处理器104执行本文描述的功能或技术。处理器104可以是控制器、微控制器或微处理器等，并且可以包括算术和逻辑单元(ALU)以及其他计算单元。处理器104可以执行本文描述的技术。处理器104可以是嵌入式片上系统(SoC)。处理器104可以包括中央处理单元(CPU)或图形处理单元(GPU)等。处理器104可以执行数值计算以对设备100或其用户的开放空间环境或封闭空间环境进行分类。处理器104从一个或多个麦克风102接收表示所捕获的音频的数据。处理器104处理该数据并且对该数据执行算法计算，并将设备100的空间环境分类为开放或封闭。在一个实施例中，处理器104可以将表示所捕获的原始音频的数据发送到另一个设备或处理器，以用于执行本文描述的技术。Processor 104 may be any type of device configured to execute executable instructions stored in memory 106 . The executable instructions, when executed by processor 104, cause processor 104 to perform the functions or techniques described herein. Processor 104 may be a controller, microcontroller, microprocessor, etc., and may include arithmetic and logic units (ALUs) and other computing units. Processor 104 can perform the techniques described herein. Processor 104 may be an embedded system-on-chip (SoC). Processor 104 may include a central processing unit (CPU), a graphics processing unit (GPU), or the like. The processor 104 may perform numerical calculations to classify the open space environment or the closed space environment of the device 100 or its user. Processor 104 receives data representative of the captured audio from one or more microphones 102 . The processor 104 processes the data and performs algorithmic calculations on the data and classifies the spatial environment of the device 100 as open or closed. In one embodiment, processor 104 may send data representing the captured raw audio to another device or processor for use in performing the techniques described herein.

处理器104可以最初对该数据执行预处理。然后，处理器104可以对经预处理的数据执行窗口化和/或数据帧化。可以根据试图从预处理数据得到的特征来选择帧大小。处理器104然后估计用于空间环境的空间环境冲激响应(SEIR)，并从中得到特征。处理器104可以利用其他特征来增强从SEIR得到的特征，以形成复合特征矢量。Processor 104 may initially perform preprocessing on the data. Processor 104 may then perform windowing and/or data framing on the preprocessed data. The frame size can be chosen based on the features you are trying to derive from the preprocessed data. The processor 104 then estimates a spatial environment impulse response (SEIR) for the spatial environment and derives features therefrom. The processor 104 may enhance the features derived from the SEIR with other features to form a composite feature vector.

处理器104然后可以基于复合特征矢量或其特征来执行空间环境分类。处理器104获得对设备100的开放空间环境或封闭空间环境的监督分类。处理器104获取事先已知的经训练的模型参数。例如，包括模型参数的模式库可以被存储在存储器106或另一个设备(诸如服务器)中。设备110可以使用通信设备110与服务器通信，并且可以从服务器获取模型参数。另外，设备100可以将可以是工厂设置的模型参数存储在外部或可扩展存储器上。在执行空间环境分类之后，处理器104可以对空间环境分类的输出执行后处理。The processor 104 may then perform spatial environment classification based on the composite feature vector or features thereof. The processor 104 obtains a supervisory classification of the open space environment or the closed space environment of the device 100 . The processor 104 obtains previously known trained model parameters. For example, a pattern library including model parameters may be stored in memory 106 or another device (such as a server). Device 110 may communicate with the server using communication device 110 and may obtain model parameters from the server. Additionally, the device 100 may store model parameters on external or expandable memory, which may be factory settings. After performing the spatial environment classification, the processor 104 may perform post-processing on the output of the spatial environment classification.

存储器106可以是任何非暂态计算机可读存储介质。存储器106可以被配置成存储可执行指令，该可执行指令在由处理器104执行时，使处理器104执行本文描述的操作、方法或技术。可执行指令可以是计算机程序或代码。存储器106可以包括随机存取存储器(RAM)和/或只读存储器(ROM)。存储器106可以存储可执行指令，该可执行指令使处理器104：从一个或多个麦克风102接收表示所捕获的音频的数据，对该数据进行预处理，对经预处理的数据执行窗口化和/或数据帧化，估计用于空间环境的SEIR，从SEIR得到特征，利用其他特征增强从SEIR得到的特征，执行空间环境分类，以及对空间环境分类的输出执行后处理，等。Memory 106 may be any non-transitory computer-readable storage medium. Memory 106 may be configured to store executable instructions that, when executed by processor 104, cause processor 104 to perform the operations, methods, or techniques described herein. Executable instructions may be computer programs or codes. Memory 106 may include random access memory (RAM) and/or read only memory (ROM). The memory 106 may store executable instructions that cause the processor 104 to: receive data representing captured audio from one or more microphones 102 , preprocess the data, perform windowing and processing on the preprocessed data. /or data framing, estimating SEIR for spatial environment, deriving features from SEIR, enhancing features obtained from SEIR with other features, performing spatial environment classification, and performing post-processing on the output of spatial environment classification, etc.

处理器104可以：存储空间环境分类，使用通信设备110将空间环境分类传送到另一个设备，或者将空间环境分类输出给用户。例如，处理器104可以存储该分类以供在设备上运行的上下文感知的应用使用，或者输出分类以供上下文感知的应用使用。The processor 104 may: store the spatial environment classification, transmit the spatial environment classification to another device using the communication device 110, or output the spatial environment classification to a user. For example, processor 104 may store the classification for use by context-aware applications running on the device or output the classification for use by context-aware applications.

输出设备108可以是被配置为向用户输出数据的任何类型的设备。例如，输出设备108可以是显示器或扬声器等。输出设备108可以向用户输出空间环境分类的结果等信息。Output device 108 may be any type of device configured to output data to a user. For example, the output device 108 may be a display or a speaker, or the like. The output device 108 can output information such as results of spatial environment classification to the user.

通信设备110可以是可操作以与另一个设备通信的任何类型的设备。通信设备110可以是发射器、接收器、收发器或调制解调器等。通信设备110可以被配置为使用任何类型的通信协议进行通信。该协议可以是诸如长期演进(LTE)的蜂窝通信协议，或者诸如电气与电子工程师协会(IEEE)802协议的无线通信协议等。设备100可以通过通信设备110与服务器通信。Communication device 110 may be any type of device operable to communicate with another device. Communication device 110 may be a transmitter, receiver, transceiver, modem, or the like. Communication device 110 may be configured to communicate using any type of communication protocol. The protocol may be a cellular communication protocol such as Long Term Evolution (LTE), or a wireless communication protocol such as Institute of Electrical and Electronics Engineers (IEEE) 802 protocol, or the like. Device 100 may communicate with the server through communication device 110 .

图2示出了用于空间环境分类的方法200的流程图。如本文所描述的，方法200可以用于确定设备100是在开放空间环境中还是在封闭空间环境中。方法依赖于麦克风音频捕获和深度学习。在方法200中，在202处设备100获得表示由一个或多个麦克风102捕获的音频的数据。该数据可以是时间的函数。在204处，设备100(或其处理器104)对该数据执行预处理。预处理可以包括对数据进行滤波以用于信号增强和对数据(或由数据表示的信号)进行下采样。Figure 2 shows a flow diagram of a method 200 for spatial environment classification. As described herein, method 200 may be used to determine whether device 100 is in an open space environment or a closed space environment. The method relies on microphone audio capture and deep learning. In method 200 , device 100 obtains data representing audio captured by one or more microphones 102 at 202 . This data can be a function of time. At 204, device 100 (or its processor 104) performs preprocessing on the data. Preprocessing may include filtering the data for signal enhancement and downsampling the data (or the signal represented by the data).

在206处，设备100对数据执行时间窗口化和/或帧化。在208处，设备100通过使用利用其他特征增强的SEIR来提取特征来形成复合特征矢量。设备100可以利用梅尔频率倒频谱系数(MFCC)、delta MFCC或双delta MFCC来增强SEIR以形成复合特征矢量。在210处，设备100对复合特征矢量执行模式分类。模式分类可以是深度学习分类并且可以被监督。如此，设备100可以使用具有模型参数的模式库来执行模式分类。At 206, device 100 performs time windowing and/or framing on the data. At 208, the device 100 forms a composite feature vector by extracting features using SEIR enhanced with other features. The device 100 may enhance the SEIR using Mel Frequency Cepstral Coefficients (MFCCs), delta MFCCs, or dual delta MFCCs to form composite feature vectors. At 210, device 100 performs pattern classification on the composite feature vector. Pattern classification can be deep learning classification and can be supervised. As such, the device 100 can perform pattern classification using a pattern library with model parameters.

模型参数可以是先验可用的，并且可以基于由观察组成的数据库而被训练。观察结果可以具有很宽的可变性，以利于分类。例如，对于开放空间环境，数据库可以包括针对海滩、体育场、街道和/或大自然的模型参数，并且对于封闭环境，数据库可以包括针对购物中心、办公室和/或家的模型参数。如本文所描述的，具有模型参数的模式库可以被存储在服务器中或由设备100存储。在对复合特征矢量执行模式分类之前，设备可以访问模式库以获得各种开放特殊环境和封闭特殊环境的模型参数。然后，设备100基于复合特征矢量和模式库执行模式分类。Model parameters can be available a priori and can be trained based on a database consisting of observations. Observations can have wide variability to facilitate classification. For example, for open space environments, the database may include model parameters for beaches, stadiums, streets, and/or nature, and for closed environments, the database may include model parameters for shopping malls, offices, and/or homes. As described herein, a pattern library with model parameters may be stored in a server or by device 100 . Before performing pattern classification on the composite feature vector, the device can access the pattern library to obtain model parameters for various open and closed special environments. Then, the device 100 performs pattern classification based on the composite feature vector and the pattern library.

在执行模式分类之后，在212处，设备100对模式分类的结果执行后处理。后处理可以包括对模式分类的输出进行中值滤波。在214处，设备100输出开放空间环境分类或封闭空间环境分类。After performing the pattern classification, at 212, the device 100 performs post-processing on the results of the pattern classification. Post-processing may include median filtering the output of the pattern classification. At 214, the device 100 outputs an open space environment classification or a closed space environment classification.

图3示出了用于开放式空间302的一个示例SEIR包络和用于封闭式空间304的一个示例SEIR包络。SEIR包络302、304可以是分别表示开放空间环境和封闭空间环境的时间包络的签名。开放空间环境和封闭空间环境的SEIR包络302、304具有不同的特性。封闭空间环境304的SEIR包络具有多次反射和混响，而开放空间环境302的SEIR包络包括与声音从源直接到达麦克风而没有后续反射或混响相关联的签名。SEIR包络的不同特征用于将设备100的空间环境分类为开放式或封闭式。FIG. 3 shows an example SEIR envelope for an open space 302 and an example SEIR envelope for an enclosed space 304 . SEIR envelopes 302, 304 may be signatures representing temporal envelopes of open space environments and closed space environments, respectively. The SEIR envelopes 302, 304 have different characteristics for open space environments and closed space environments. The SEIR envelope of a closed space environment 304 has multiple reflections and reverberation, while the SEIR envelope of an open space environment 302 includes a signature associated with sound traveling directly from the source to the microphone without subsequent reflections or reverberation. Different characteristics of the SEIR envelope are used to classify the spatial environment of the device 100 as open or closed.

图4图示了封闭式空间304的SEIR包络的细节。在第一时间实例402处最初产生声音信号(或冲激)。在传播延迟(其表示声音信号到从源行进到一个或多个麦克风102花费的时间)的时段之后，声音信号在第二时间实例404处到达一个或多个麦克风102。在一个或多个麦克风102处的声音的直接和无混响的到达导致SEIR包络的最大峰值。然后，SEIR幅度的幅度衰减，直到第三时间实例406。在第三时间实例406之后，作为声音信号的高密度后期反射的混响到达一个或多个麦克风102。混响均累积至局部最大值。混响随着时间的推移以减小的幅度出现，直到第四时间实例408为止。混响衰减与可以用作SEIR的表示性特征的衰减斜率相关联。在第四时间实例408之后，SEIR包络304呈现出本底噪声。Figure 4 illustrates details of the SEIR envelope of an enclosed space 304. An acoustic signal (or impulse) is initially generated at a first time instance 402. The sound signal reaches one or more microphones 102 at a second time instance 404 after a period of propagation delay, which represents the time it takes for the sound signal to travel from the source to the one or more microphones 102 . The direct and non-reverberant arrival of sound at one or more microphones 102 results in the maximum peak of the SEIR envelope. The magnitude of the SEIR amplitude then decays until the third time instance 406. After the third time instance 406, reverberation arrives at one or more microphones 102 as a high density of late reflections of the sound signal. Reverberation is accumulated to a local maximum. Reverberation occurs with decreasing amplitude over time until the fourth time instance 408. Reverberation decay is associated with a decay slope that can be used as a representative characteristic of SEIR. After the fourth time instance 408, the SEIR envelope 304 exhibits a noise floor.

封闭式空间环境的SEIR包络的特征在于混响，这可能是从墙壁或其他结构反射的结果。混响独特地标识了封闭式空间环境的SEIR包络，并且由于开放式空间环境具有较少反射声音的结构的事实，混响通常不存在于开放式空间环境的SEIR包络中。The SEIR envelope of a closed space environment is characterized by reverberation, which may be the result of reflections from walls or other structures. Reverberation uniquely identifies the SEIR envelope of closed space environments, and is generally not present in the SEIR envelope of open space environments due to the fact that open space environments have less reflective sound structures.

为了测量声学系统的冲激响应，可以传送已知的输入测试信号并且可以测量系统输出。系统输出可以相对于输入测试信号被解卷积以获得冲激响应。可以适当地选择输入信号(或激励信号)，并且解卷积方法可以是线性或环形的。To measure the impulse response of an acoustic system, a known input test signal can be transmitted and the system output can be measured. The system output can be deconvolved with respect to the input test signal to obtain the impulse response. The input signal (or excitation signal) can be chosen appropriately, and the deconvolution method can be linear or circular.

本文描述了用于基于记录的周围环境声音信号来被动地提取SEIR的技术。与作为发射的激励信号的反射相反，周围环境声音信号可以在环境中自然地生成。本文使用盲解卷积来估计空间环境的SEIR包络。This article describes techniques for passively extracting SEIR based on recorded ambient sound signals. As opposed to being reflections of emitted excitation signals, ambient sound signals can be generated naturally in the environment. This paper uses blind deconvolution to estimate the SEIR envelope of the spatial environment.

图5示出了一种使用空间环境中存在的周围环境声音对开放空间环境和封闭空间环境进行SEIR估计和特征提取的技术。在分离的时间，设备100可以被定位在封闭空间环境502和开放空间环境504中。在512处，设备100使用一个或多个麦克风102测量封闭空间环境502和开放空间环境504中的声音信号，并且存储声音信号。在514处，设备100对音频信号执行盲解卷积。在516处，设备100获得针对开放空间环境或封闭空间环境的SEIR。在518处，设备100从SEIR提取特征，并且将环境分类为开放或封闭。Figure 5 illustrates a technique for SEIR estimation and feature extraction of open space environments and closed space environments using ambient sounds present in the space environment. At separate times, the device 100 may be located in a closed space environment 502 and an open space environment 504 . At 512, device 100 measures sound signals in closed space environment 502 and open space environment 504 using one or more microphones 102 and stores the sound signals. At 514, device 100 performs blind deconvolution on the audio signal. At 516, the device 100 obtains the SEIR for an open space environment or a closed space environment. At 518, the device 100 extracts features from the SEIR and classifies the environment as open or closed.

图6示出了用于从空间环境的周围环境声音来估计空间环境冲激响应(SEIR)的方法的流程图。在602处，一个或多个麦克风102接收空间环境的周围环境声音信号。一个或多个麦克风102可以将表示周围环境声音的数据输出到处理器104。在604处，处理器104将周围环境声音信号划分成第一持续时间(表示为‘t₁’)的帧。帧可以彼此具有第一重叠持续时间(表示为‘Δt₁’)的重叠。在606处，处理器104确定每个帧的能量比率。可以通过计算帧的能量与前一个帧的能量之间的比率来执行确定能量比率，由此前一个帧可以紧接在该帧之前。Figure 6 shows a flowchart of a method for estimating a spatial environment impulse response (SEIR) from ambient sounds of the spatial environment. At 602, one or more microphones 102 receive ambient sound signals of the spatial environment. One or more microphones 102 may output data representative of surrounding environmental sounds to the processor 104 . At 604, the processor 104 divides the ambient sound signal into frames of a first duration (denoted 't₁ '). The frames may overlap each other with a first overlap duration (denoted 'Δt₁ '). At 606, processor 104 determines the energy ratio for each frame. Determining the energy ratio may be performed by calculating the ratio between the energy of the frame and the energy of the previous frame, whereby the previous frame may immediately precede the frame.

在608处，处理器104选择具有满足能量标准的能量比率的帧。例如，处理器104可以选择具有超过阈值的能量比率的帧。由于在帧开始之前的激励而产生的混响尾音可能在后续帧中具有残留。因此，期望选择具有相对较高能量的帧。例如，处理器104可以选择帧能量比率的分布的较高25百分位中的帧。At 608, the processor 104 selects frames with energy ratios that meet the energy criteria. For example, processor 104 may select frames with an energy ratio that exceeds a threshold. Reverberant tails due to excitation before the start of the frame may have residue in subsequent frames. Therefore, it is desirable to select frames with relatively high energy. For example, processor 104 may select frames in the upper 25 percentile of the distribution of frame energy ratios.

在610处，处理器104对所选择的帧执行指数窗口化。在指数窗口化之后，处理器104确定所选择的帧的倒频谱。指数窗口化将帧的所有极点和零点移动到z平面的单位圆内。倒频谱通常需要最小相位的信号。由于空间环境冲激响应通常是混合的相位，该混合的相位具有一些位于单位圆内的零点和位于单位圆外的其他零点，因此可能需要执行窗口化以将帧的所有极点和零点移动到单位圆内。最小相位信号是有利的，这是由于其具有明确的线性相位，因此不需要相位展开。At 610, processor 104 performs exponential windowing on the selected frames. After exponential windowing, processor 104 determines the cepstrum of the selected frame. Exponential windowing moves all poles and zeros of the frame into the unit circle in the z-plane. Cepstrum usually requires a minimum phase signal. Since the spatial environment impulse response is usually a mixed phase with some zeros located inside the unit circle and other zeros located outside the unit circle, it may be necessary to perform windowing to move all poles and zeros of the frame to unit inside the circle. A minimum phase signal is advantageous because it has a well-defined linear phase and therefore does not require phase unwrapping.

在指数窗口化之后，在612处，处理器104确定针对所选择的帧的倒频谱。针对帧的倒频谱(表示为‘c(n)’)被确定为：After exponential windowing, at 612, processor 104 determines the cepstrum for the selected frame. The cepstrum for a frame (denoted as ‘c(n)’) is determined as:

c(n)＝IDFT(log(DFT(y(n)))，等式(1)c(n)=IDFT(log(DFT(y(n))), equation (1)

其中y(n)表示帧，DFT表示离散傅里叶变换操作，log表示对数，并且IDFT表示逆离散傅里叶变换操作。where y(n) represents the frame, DFT represents the discrete Fourier transform operation, log represents the logarithm, and IDFT represents the inverse discrete Fourier transform operation.

在614处，处理器104确定所选择的帧的平均倒频谱。对倒频谱进行平均减小了帧的背景倒频谱水平的影响。处理器104可以确定第二持续时间(表示为‘t₂’)上的平均倒频谱。在616处，处理器104获得逆倒频谱时域信号。处理器104可以如下获得逆倒频谱：At 614, processor 104 determines the average cepstrum of the selected frame. Averaging the cepstrum reduces the effect of the background cepstrum level of the frame. Processor 104 may determine the average cepstrum over a second duration (denoted 't₂ '). At 616, the processor 104 obtains the inverse spectrum time domain signal. The processor 104 may obtain the inverse spectrum as follows:

h(n)＝IDFT(exp(DFT(c(n)))，等式(2)h(n)=IDFT(exp(DFT(c(n))), equation (2)

其中exp表示指数运算。where exp represents exponential operation.

在倒频谱操作之后，在618处，处理器104执行逆指数窗口化，以将极点和零点移回到它们相应的位置。执行逆指数窗口化可以包括将每个窗口乘以衰减指数。这不会在卷积关系中引入失真。因此，在第二持续时间上获得了SEIR(h(n))。After the cepstrum operation, at 618, the processor 104 performs inverse exponential windowing to move the poles and zeros back to their corresponding locations. Performing inverse exponential windowing may include multiplying each window by a decaying exponent. This does not introduce distortion in the convolution relationship. Therefore, SEIR(h(n)) is obtained over the second duration.

在一个实施例中，帧大小的第一持续时间可以是500毫秒(ms)，并且第一重叠持续时间(Δt₁)可以是90％重叠。另外，采样频率可以被设置为16千赫兹(kHz)。周围环境声音信号可以具有60秒的持续时间，并且指数窗口函数可以被表示为：In one embodiment, the first duration of the frame size may be 500 milliseconds (ms) and the first overlap duration (Δt₁ ) may be 90% overlap. Additionally, the sampling frequency can be set to 16 kilohertz (kHz). The ambient sound signal can have a duration of 60 seconds, and the exponential window function can be expressed as:

w(n)＝exp(-n/c)，等式(3)w(n)=exp(-n/c), equation (3)

其中c是被确定为帧的第一持续时间的五分之一(或0.1)的常数。SEIR可以包括关于能量衰减的信息，并且SEIR的幅度的绝对值可以被确定。另外，可以在60秒的持续时间上平均倒频谱，并且也可以在60秒上估计SEIR(h(n))。where c is a constant determined to be one-fifth (or 0.1) of the first duration of the frame. The SEIR can include information about energy attenuation, and the absolute value of the magnitude of the SEIR can be determined. Additionally, the cepstrum can be averaged over a duration of 60 seconds, and SEIR(h(n)) can also be estimated over 60 seconds.

图7示出了用于通过增强从信号窗口得到的特征矢量来生成复合特征矢量的方法的流程图。信号窗口可以具有不同的持续时间。在702处，设备100捕获音频信号。如本文所描述的，音频信号可以是空间环境的周围环境声音信号，并且可以由一个或多个麦克风102接收。一个或多个麦克风102可以将表示音频信号的数据输出到处理器104。在704处，处理器104对该音频信号进行预处理和时间窗口化。Figure 7 shows a flowchart of a method for generating composite feature vectors by enhancing feature vectors derived from signal windows. Signal windows can have different durations. At 702, device 100 captures an audio signal. As described herein, the audio signal may be an ambient sound signal of the spatial environment and may be received by one or more microphones 102 . One or more microphones 102 may output data representing audio signals to processor 104 . At 704, processor 104 pre-processes and time-windows the audio signal.

在706处，如本文所描述的，处理器104估计音频信号的SEIR。处理器104可以选择满足能量比率标准的帧，并且可以对所选的帧进行时间窗口化以计算倒频谱。该时间窗口可以具有500ms的第一持续时间(t₁)。然后，处理器104在第二持续时间上对倒频谱进行平均以获得SEIR。At 706, processor 104 estimates the SEIR of the audio signal as described herein. The processor 104 may select frames that meet energy ratio criteria and may time window the selected frames to calculate a cepstrum. The time window may have a first duration (t₁ ) of 500 ms. The processor 104 then averages the cepstrum over a second duration to obtain the SEIR.

在708处，处理器104提取SEIR的多个特征。多个特征可以具有第一数目(表示为‘N’)。处理器104在具有第二持续时间的时间窗口上提取多个特征，以针对等于第二持续时间的音频信号的持续时间获得N维特征矢量。At 708, the processor 104 extracts a plurality of features of the SEIR. The plurality of features may have a first number (denoted 'N'). The processor 104 extracts a plurality of features over a time window having a second duration to obtain an N-dimensional feature vector for a duration of the audio signal equal to the second duration.

在710处，处理器104从音频信号提取基于MFCC的特征。对于基于MFCC的特征的提取，可以使用与SEIR特征提取不同的持续时间将音频信号进行时间窗口化和帧化。对于基于MFCC的特征提取，可以利用与第二持续时间不同的第三持续时间(表示为‘t₃’)将音频信号进行时间窗口化。从音频信号提取基于MFCC的特征可以包括基于MFCC、delta MFCC或双delta MFCC来提取特征。在712处，处理器104通过利用在第三持续时间上提取的基于MFCC的特征，来增强在第二持续时间上从SEIR提取的多个特征来形成复合矢量。处理器104可以联接在不同的时间窗口持续时间上提取的特征，以产生复合特征矢量。At 710, the processor 104 extracts MFCC-based features from the audio signal. For the extraction of MFCC-based features, the audio signal can be time windowed and framed using different durations from SEIR feature extraction. For MFCC-based feature extraction, the audio signal can be time windowed using a third duration (denoted as 't₃ ') that is different from the second duration. Extracting MFCC-based features from the audio signal may include extracting features based on MFCC, delta MFCC, or dual delta MFCC. At 712, the processor 104 forms a composite vector by enhancing the plurality of features extracted from the SEIR over the second duration with the MFCC-based features extracted over the third duration. The processor 104 may concatenate features extracted over different time window durations to produce a composite feature vector.

图8A和图8B示出了用于联接不同维度的特征矢量以形成复合特征矢量的方法的流程图。在802处，设备100接收音频信号。在804处，设备100或其处理器104根据不同的持续时间对该音频信号执行窗口化。如本文所描述的，每个经窗口化的音频信号可以分别包括重叠的帧。如本文所描述的，可以根据第一持续时间将用于SEIR特征提取的音频信号窗口化，并且可以根据第三持续时间将用于基于MFCC的特征提取的音频信号窗口化。8A and 8B illustrate a flowchart of a method for joining feature vectors of different dimensions to form a composite feature vector. At 802, device 100 receives an audio signal. At 804, the device 100 or its processor 104 performs windowing on the audio signal according to different durations. As described herein, each windowed audio signal may include overlapping frames, respectively. As described herein, the audio signal for SEIR feature extraction may be windowed according to a first duration, and the audio signal for MFCC-based feature extraction may be windowed according to a third duration.

对于SEIR特征提取，设备100在806处形成具有第一持续时间的帧，并且在808处在第二持续时间上估计如本文所述的SEIR。第二持续时间可以是60秒以及其他持续时间。用于估计SEIR的基于倒频谱的盲解卷积可以定位冲激的时间原点，并且它们的相对振幅也可以被保留。基于估计针对不同空间环境的SEIR，已经观察到用于SEIR的基于倒频谱的盲解卷积包括真实SEIR的初始强反射，直到大约100ms。For SEIR feature extraction, the device 100 forms a frame of a first duration at 806 and estimates the SEIR as described herein at 808 over a second duration. The second duration can be 60 seconds as well as other durations. Cepstrum-based blind deconvolution for estimating SEIR can locate the temporal origin of impulses, and their relative amplitudes can also be preserved. Based on estimating SEIR for different spatial environments, it has been observed that cepstrum-based blind deconvolution for SEIR includes an initial strong reflection of the true SEIR until approximately 100 ms.

在方法800中，假设以音频信号的16kHz采样速率获得对应于62.5ms的1000个样本的SEIR。在810处，处理器104从SEIR提取特征，该些特征在将设备100的空间环境分类为开放式或封闭式时有用。在特征提取之前，可以使SEIR穿过具有大约10的数量级的移动平均滤波器。In method 800, it is assumed that a SEIR corresponding to 1000 samples of 62.5 ms is obtained at a 16kHz sampling rate of the audio signal. At 810, the processor 104 extracts features from the SEIR that are useful in classifying the spatial environment of the device 100 as open or enclosed. Before feature extraction, the SEIR can be passed through a moving average filter with a magnitude of approximately 10.

在图8A和图8B的示例中，从SEIR提取了13个特征，以组成13维矢量。在812处，处理器104获得SEIR的五个频带中的SEIR幅度的能量，以形成五维矢量。SEIR的五个频带中的SEIR幅度的能量可以如下获得：In the example of Figures 8A and 8B, 13 features are extracted from the SEIR to form a 13-dimensional vector. At 812, the processor 104 obtains the energy of the SEIR amplitude in the five frequency bands of the SEIR to form a five-dimensional vector. The energy of SEIR amplitude in the five frequency bands of SEIR can be obtained as follows:

在814处，处理器104对SEIR的多个最大值指标进行平均以产生一个特征。经平均的最大指标可以是SEIR幅度的前十个最大指标。在816处，处理器104获得SEIR的时间峰度以产生一个特征。SEIR的时间峰度可以如下获得：At 814, the processor 104 averages multiple maximum indicators of the SEIR to generate a feature. The averaged largest indicator may be the top ten largest indicators of SEIR magnitude. At 816, the processor 104 obtains the temporal kurtosis of the SEIR to generate a feature. The temporal kurtosis of SEIR can be obtained as follows:

其中μ是SEIR的平均值，并且σ是SEIR的标准偏差。where μ is the mean value of SEIR and σ is the standard deviation of SEIR.

在818处，处理器104获得在中心频率处的频谱标准偏差(SSD)，以获得SEIR的一维特征。对于1000个样本的SEIR，中心频率(fc)可以是500Hz。处理器104可以将SSD确定为：At 818, the processor 104 obtains the spectral standard deviation (SSD) at the center frequency to obtain a one-dimensional signature of the SEIR. For a SEIR of 1000 samples, the center frequency (fc) can be 500Hz. Processor 104 may determine the SSD as:

SDD_[f1,f2][H(f)]＝E_[f1,f2][H²(f)]-E_[f1,f2]²[H(f)]，等式(6)SDD_[f1,f2] [H(f)]＝E_[f1,f2] [H² (f)]-E_[f1,f2]² [H(f)], Equation (6)

其中H(f)表示SEIR的傅立叶变换，并且E_[f1,f2]表示从第一频率(f1)到第二频率(f2)范围的频带上的变元(argument)的平均。第一频率和第二频率可以分别被设置为f1＝fc*(2^0.5)和f2＝fc/(2^0.5)。where H(f) represents the Fourier transform of the SEIR, and E_[f1,f2] represents the average of the arguments over the frequency band ranging from the first frequency (f1) to the second frequency (f2). The first frequency and the second frequency may be set to f1=fc*(2^0.5 ) and f2=fc/(2^0.5 ), respectively.

在820处，处理器104获得初始SEIR样本的斜率(一维特征)。处理器104通过获得最大信号值来确定斜率。最大信号值可以是SEIR的初始样本的短间隔的最大幅度。例如，初始样本的间隔可以是SEIR的第一40个样本到第一120个样本。处理器104可以将斜率确定为最大信号值与初始样本的短间隔的最大幅度之间的差。At 820, processor 104 obtains the slope (one-dimensional feature) of the initial SEIR sample. Processor 104 determines the slope by obtaining the maximum signal value. The maximum signal value may be the maximum amplitude of a short interval of initial samples of the SEIR. For example, the initial sample interval may be the first 40 samples to the first 120 samples of the SEIR. The processor 104 may determine the slope as the difference between the maximum signal value and the maximum amplitude of the short interval of initial samples.

在822处，处理器获得MFCC特征以用于与SEIR特征进行增强以进行分类。MFCC特征可以包括delta MFCC和双delta MFCC的特征。可以针对SEIR特征和MFCC特征使用不同的窗口大小。例如，时间窗口化持续时间(第三持续时间t₃)可以是500ms。可以针对500ms的帧大小确定MFCC、delta MFCC和双delta MFCC，其中连续帧之间有50％的重叠。At 822, the processor obtains the MFCC features for augmentation with SEIR features for classification. MFCC characteristics may include characteristics of delta MFCCs and dual delta MFCCs. Different window sizes can be used for SEIR features and MFCC features. For example, the time windowing duration (third duration_t3 ) may be 500ms. MFCC, delta MFCC and dual delta MFCC can be determined for a frame size of 500ms with 50% overlap between consecutive frames.

处理器104在824a-824e处获得五个帧，并且对于每个帧，处理器104在826aa-826ec处获得13维MFCC特征、13维delta MFCC特征和13维双delta MFCC特征。因此，从每个帧获得39维特征。在828处，处理器104通过联接来自五个连续帧的特征来生成MFCC特征矢量，以获得改进的分类。在830处，处理器104生成复合特征矢量。处理器104可以通过将SEIR特征(九个特征或维度)与基于MFCC的特征(195个特征或维度)联接来生成复合特征矢量。The processor 104 obtains five frames at 824a-824e, and for each frame, the processor 104 obtains 13-dimensional MFCC features, 13-dimensional delta MFCC features, and 13-dimensional dual delta MFCC features at 826aa-826ec. Therefore, 39-dimensional features are obtained from each frame. At 828, the processor 104 generates an MFCC feature vector by concatenating features from five consecutive frames to obtain an improved classification. At 830, processor 104 generates a composite feature vector. The processor 104 may generate a composite feature vector by concatenating SEIR features (nine features or dimensions) with MFCC-based features (195 features or dimensions).

已经发现，从声音信号的帧得到的MFCC、delta MFCC和双delta MFCC可以最佳地使能环境分类，声音信号的帧与来针对自总共195个特征的前四个帧的特征联接。It has been found that MFCCs, delta MFCCs and dual delta MFCCs derived from frames of sound signals concatenated with features from the first four frames out of a total of 195 features can best enable environmental classification.

处理器104将复合特征矢量输入到模式分类器(例如，深度学习分类器)。模式分类器可以使用深度神经网络(DNN)作为学习架构，以将空间环境分类为开放或封闭。例如，DNN可以利用五个隐藏层实施，并且每个层具有256个神经元，并带有Adam优化器。Processor 104 inputs the composite feature vector to a pattern classifier (eg, a deep learning classifier). Pattern classifiers can use deep neural networks (DNN) as a learning architecture to classify spatial environments as open or closed. For example, a DNN can be implemented with five hidden layers and 256 neurons each with an Adam optimizer.

图9示出了DNN分类器的测试准确度的一个示例。针对被输入到经训练的DNN的各种特征矢量示出了测试准确度。使用从SEIR提取的特征和基于MFCC的特征形成的复合特征矢量给出了99.9％的最高准确度。相比之下，仅SEIR的9维矢量具有78.5％的准确度，并且MFCC 65维矢量具有79.3％的准确度。195个特征的MFCC、delta MFCC和双delta MFCC矢量具有96.3％的准确度。Figure 9 shows an example of the test accuracy of a DNN classifier. Test accuracy is shown for various feature vectors input to the trained DNN. A composite feature vector formed using features extracted from SEIR and MFCC-based features gave the highest accuracy of 99.9%. In comparison, only SEIR's 9-dimensional vector has an accuracy of 78.5%, and MFCC's 65-dimensional vector has an accuracy of 79.3%. 195-feature MFCC, delta MFCC and double delta MFCC vectors with 96.3% accuracy.

在一个实施例中，倒频谱平均值减法(CMS)可以用于针对在不同设备中使用的不同麦克风特性来补偿信号。根据能量比率标准选择的帧的平均倒频谱可以通过在设备的麦克风或麦克风阵列上的各种周围环境声音记录获得。该平均或平均值倒频谱表示麦克风的特性，并且从测试信号的各个输入帧的倒频谱中减去。在减去倒频谱平均值之后获得的倒频谱用于获得基于MFCC的特征，该基于MFCC的特征被提供为DNN的输入。当对基于MFCC的特征执行倒频谱平均值减法时，准确度会得到改善，特别是当由于麦克风换能器特性的差异导致训练和测试条件之间存在不匹配时。In one embodiment, cepstral mean subtraction (CMS) can be used to compensate signals for different microphone characteristics used in different devices. The average cepstrum of frames selected based on energy ratio criteria can be obtained from various ambient sound recordings at the device's microphone or microphone array. This mean or mean cepstrum represents the characteristics of the microphone and is subtracted from the cepstrum of the individual input frames of the test signal. The cepstrum obtained after subtracting the cepstral mean is used to obtain MFCC-based features, which are provided as inputs to the DNN. Accuracy improves when performing cepstral mean subtraction on MFCC-based features, especially when there is a mismatch between training and testing conditions due to differences in microphone transducer characteristics.

在一个实施例中，可以利用从设备100的其他传感器得到的上下文来增强设备100的开放空间环境或封闭空间环境的上下文，从而有助于用户的整体上下文感知。In one embodiment, context derived from other sensors of device 100 may be utilized to enhance the context of the open space environment or closed space environment of device 100, thereby contributing to the user's overall contextual awareness.

上述各种实施例可以被组合以提供另外的实施例。The various embodiments described above may be combined to provide additional embodiments.

可以根据以上详细描述对实施例进行这些和其他改变。通常，在所附权利要求中，所使用的术语不应当被解释为将权利要求限制为说明书和权利要求中公开的特定实施例，而是应当解释为包括所有可能的实施例以及这种权利要求被赋予的等同物的全部范围。因此，权利要求不受公开内容的限制。These and other changes may be made to the embodiments in light of the above detailed description. In general, in the appended claims, the terms used should not be construed as limiting the claims to the specific embodiments disclosed in the specification and claims, but rather as including all possible embodiments as well as such claims. The full range of equivalents assigned. Therefore, the claims are not limited by the disclosure.