CN110751958A

Movatterモバイル変換

Info

Publication number: CN110751958A
Application number: CN201910913616.0A
Authority: CN
Inventors: 蓝天; 吕忆蓝; 李森; 刘峤; 惠国强; 钱宇欣; 叶文政; 彭川; 李萌
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-09-25
Filing date: 2019-09-25
Publication date: 2020-02-04

Abstract

Translated fromChinese

本发明公开一种基于RCED网络的降噪方法，包括以下步骤：S1：构建RCED；S2：将目标增强帧和其两侧的部分帧进行拼接，然后通过RCED进行卷积操作；S3：将RCED中的编码器输出和相应解码器输出进行拼接，然后输入到下一个卷积层中执行后续操作；S4：引入shortcut机制，将所有编码器和所有解码器分别组合成一个Dense Block，在层之间增加短路路径。本发明使用只包含卷积层的RCED，丢弃了池化层和与其对应的上采样层；并在其上引入不同的shortcut机制，性能良好且有泛化性，可以重复利用信息，从而使用更少的数据来提取出更多有用的特征；易于训练、减少梯度消亡、减少参数，同时在小数据集上克服过拟合的问题。

The present invention discloses a noise reduction method based on RCED network, comprising the following steps: S1: constructing RCED; S2: splicing a target enhanced frame and partial frames on both sides thereof, and then performing a convolution operation through RCED; S3: combining RCED The output of the encoder and the output of the corresponding decoder are spliced, and then input to the next convolutional layer to perform subsequent operations; S4: Introduce the shortcut mechanism to combine all the encoders and all the decoders into a Dense Block. Add short-circuit paths between them. The present invention uses RCED which only contains the convolution layer, discards the pooling layer and its corresponding upsampling layer; and introduces different shortcut mechanisms on it, which has good performance and generalization, and can reuse information, so as to use more Extract more useful features with less data; easy to train, reduce gradient demise, reduce parameters, and overcome the problem of overfitting on small datasets.

Description

Translated fromChinese

一种基于RCED网络的降噪方法A Noise Reduction Method Based on RCED Network

技术领域technical field

本发明属于语音降噪技术领域，尤其涉及一种基于RCED网络的降噪方法。The invention belongs to the technical field of speech noise reduction, and in particular relates to a noise reduction method based on an RCED network.

背景技术Background technique

实际生活中，噪声无处不在，基本不存在完全纯净的语音，所以使用语音增强来提高带噪语音的可懂度和质量，目前，已经广泛应用于语音识别、助听器等领域。常见的语音增强技术可分为传统方法和基于深度学习的方法两大类。传统方法主要包含谱减法、维纳滤波法、基于统计模型的方法、基于子空间的方法。它们基于噪声平稳的假设，因此无法处理非平稳噪声。In real life, noise is everywhere, and there is basically no completely pure speech. Therefore, speech enhancement is used to improve the intelligibility and quality of noisy speech. At present, it has been widely used in speech recognition, hearing aids and other fields. Common speech enhancement techniques can be divided into two categories: traditional methods and deep learning-based methods. Traditional methods mainly include spectral subtraction, Wiener filtering, statistical model-based methods, and subspace-based methods. They are based on the assumption that the noise is stationary and thus cannot handle non-stationary noise.

发明内容SUMMARY OF THE INVENTION

本发明提供一种基于RCED网络的降噪方法，旨在解决上述存在的问题。The present invention provides a noise reduction method based on an RCED network, aiming at solving the above-mentioned problems.

本发明是这样实现的，一种基于RCED网络的降噪方法，包括以下步骤：The present invention is realized in this way, a kind of noise reduction method based on RCED network, comprises the following steps:

S1：构建RCED；S1: build RCED;

S2：将目标增强帧和其两侧的部分帧进行拼接，然后通过RCED进行卷积操作；S2: Splicing the target enhanced frame and some frames on both sides of it, and then performing the convolution operation through RCED;

S3：引入shortcut机制，将RCED中的编码器输出和相应解码器输出进行拼接，然后输入到下一个卷积层中执行后续操作；S3: Introduce the shortcut mechanism, splicing the encoder output in RCED with the corresponding decoder output, and then inputting it into the next convolutional layer to perform subsequent operations;

S4：引入shortcut机制，将所有编码器和所有解码器分别组合成一个DenseBlock，在层之间增加短路路径。S4: Introduce the shortcut mechanism, combine all encoders and all decoders into a DenseBlock, and add short-circuit paths between layers.

进一步的，所述RCED包括多个相同模块A，所述模块A包含卷积层、块归一化层和ReLU激活层。Further, the RCED includes a plurality of identical modules A, and the module A includes a convolution layer, a block normalization layer and a ReLU activation layer.

进一步的，所述RCED还包括位于末端的一个模块B，所述模块B包含卷积层，并输出增强帧。Further, the RCED further includes a module B at the end, the module B includes a convolution layer, and outputs an enhanced frame.

进一步的，所述RCED前后各拼接帧数为7。Further, the number of spliced frames before and after the RCED is 7.

进一步的，在步骤S3中，应用公式为：Further, in step S3, the application formula is:

x_decoder+1＝f(x_decoder，x_encoder，)x_decoder+1 = f(x_decoder , x_encoder ,)

用逗号连接表示在深度上拼接；Connecting with commas means splicing in depth;

其中，f(.)为卷积、Batch Normalization和ReLU的集合。Among them, f(.) is the set of convolution, Batch Normalization and ReLU.

x_encoder和x_decoder分别为encoder和decoder中对称的层，将这两个层拼接过后输入(decoder+1)层。The x_encoder and x_decoder are the symmetrical layers in the encoder and the decoder, respectively, and the two layers are spliced and input into the (decoder+1) layer.

进一步的，在步骤S4中，应用公式为：Further, in step S4, the application formula is:

f(.)为卷积、Batch Normalization和ReLU的集合；f(.) is the set of convolution, Batch Normalization and ReLU;

x_t(t＝0，1，...l-1)为dense block中的第t层。x_t (t=0, 1,...l-1) is the t-th layer in the dense block.

与现有技术相比，本发明的有益效果是：本发明使用只包含卷积层的 RCED，丢弃了池化层和与其对应的上采样层；并在其上引入不同的shortcut 机制，性能良好且有泛化性，可以重复利用信息，从而使用更少的数据来提取出更多有用的特征；易于训练、减少梯度消亡、减少参数，同时在小数据集上克服过拟合的问题。Compared with the prior art, the beneficial effects of the present invention are: the present invention uses RCED that only includes convolutional layers, discards the pooling layer and its corresponding upsampling layer; and introduces different shortcut mechanisms on it, with good performance It is generalizable and can reuse information to extract more useful features with less data; easy to train, reduce gradient demise, reduce parameters, and overcome the problem of overfitting on small data sets.

附图说明Description of drawings

图1为本发明实施例图一；Fig. 1 is Fig. 1 of the embodiment of the present invention;

图2为本发明实施例图二；Fig. 2 is Fig. 2 of the embodiment of the present invention;

图3为本发明实施例图三；Fig. 3 is Fig. 3 of the embodiment of the present invention;

图4为本发明实施例图四。FIG. 4 is a fourth embodiment of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

请参阅图1-2，本发明提供一种技术方案：一种基于RCED网络的降噪方法，包括以下步骤：1-2, the present invention provides a technical solution: a noise reduction method based on an RCED network, comprising the following steps:

S1：构建RCED；S1: build RCED;

S3：将RCED中的编码器输出和相应解码器输出进行拼接，然后输入到下一个卷积层中执行后续操作；S3: Concatenate the encoder output in RCED with the corresponding decoder output, and then input it into the next convolutional layer to perform subsequent operations;

本实施方式中，在SC-FCN中，本发明使用RCED作为基础模型，并进行改进。In this embodiment, in SC-FCN, the present invention uses RCED as a basic model and makes improvements.

RCED如图1所示：其由多个相同模块组成。RCED is shown in Figure 1: it consists of multiple identical modules.

除了最后一个模块，每个模块包含卷积层、块归一化层(batch normalization)和ReLU激活层。Except for the last module, each module contains convolutional layers, batch normalization layers, and ReLU activation layers.

最后一个模块只包含卷积层，并输出增强帧。由于卷积层的输入是图，所以RCED将目标增强帧和其两侧的部分帧进行拼接，然后进行卷积操作，这样也可以有效利用上下文信息。The last module contains only convolutional layers and outputs enhanced frames. Since the input of the convolutional layer is a graph, RCED splices the target enhanced frame and the partial frames on both sides of it, and then performs the convolution operation, which can also effectively utilize the context information.

但是实验发现，拼接少量的上下文帧可以获得不错的增强效果，但随着拼接帧数的增加，降噪效果会明显下降。However, the experiment found that a good enhancement effect can be obtained by splicing a small number of context frames, but with the increase of the number of spliced frames, the noise reduction effect will decrease significantly.

这表明增加上下文帧的同时也会引入部分冗余信息，导致神经网络的训练受到干扰。This indicates that adding some context frames will also introduce some redundant information, which will interfere with the training of the neural network.

由于pesq取值范围在[-0.5，4.5]，stoi取值范围为[0，1]，所以通过式(1) 来对扩展不同帧的模型打分：Since the value range of pesq is [-0.5, 4.5], and the value range of stoi is [0, 1], the model that extends different frames is scored by formula (1):

(1)score＝pesq/5+stoi(1) score=pesq/5+stoi

请参阅图3，图3中纵轴表示SNR，横轴表示score的值，从图3中可以看出，拼接帧数在7左右时，RCED的性能达到峰值。同时考虑模型的性能和计算效率，本发明在RCED前后各拼接7帧作为基础模型。Please refer to Figure 3. In Figure 3, the vertical axis represents SNR, and the horizontal axis represents the value of score. It can be seen from Figure 3 that when the number of stitched frames is around 7, the performance of RCED reaches its peak. Considering the performance and computational efficiency of the model at the same time, the present invention splices 7 frames before and after RCED as the basic model.

其中，如图1所示，将RCED中的编码器输出和相应解码器输出进行拼接，然后输入到下一个卷积层中执行后续操作。由于RCED中不含池化层，所以无需进行裁剪，直接拼接即可。这种机制可以重复利用信息，从而使用更少的数据来提取出更多有用的特征。此外，提供短路路径也使得训练更容易。Among them, as shown in Figure 1, the encoder output and the corresponding decoder output in RCED are spliced, and then input to the next convolutional layer to perform subsequent operations. Since RCED does not contain a pooling layer, it can be directly stitched without cropping. This mechanism reuses information to extract more useful features using less data. Additionally, providing short-circuit paths also makes training easier.

在DC-FCN中引入DenseNet中的Dense Block，Dense Block结构如图2 实线框中部分所示。一般来说，包含L个卷积层的网络中存在L个连接，但在包含L个卷积层的DenseBlock中存在L*(L+1)/2个连接。即每个卷积层的输入都是由其之前所有层的输出拼接得到的，该层的输出也会拼接到其后所有层的输入中。在提出的DC-FCN中，将所有编码器和所有解码器分别组合成一个 Dense Block。The Dense Block in DenseNet is introduced into DC-FCN, and the structure of the Dense Block is shown in the solid line box in Figure 2. In general, there are L connections in a network containing L convolutional layers, but there are L*(L+1)/2 connections in a DenseBlock containing L convolutional layers. That is, the input of each convolutional layer is obtained by splicing the outputs of all previous layers, and the output of this layer is also spliced into the inputs of all subsequent layers. In the proposed DC-FCN, all encoders and all decoders are combined into a Dense Block respectively.

具体地，SC-FCN公式为：Specifically, the SC-FCN formula is:

x_decoder+1＝f(x_decoder，x_encoder，)，用逗号连接表示在深度上拼接；x_decoder+1 = f(x_decoder , x_encoder , ), which is connected by commas to indicate depth splicing;

f(.)为卷积、Batch Normalization和ReLU的集合。f(.) is a set of convolution, Batch Normalization and ReLU.

x_encoder和x_decoder分别为encoder和decoder中对称的层。将这两个层拼接过后输入(decoder+1)层。The x_encoder and x_decoder are symmetric layers in the encoder and decoder, respectively. After splicing these two layers, input the (decoder+1) layer.

其中，encoder：编码器，模型的左半部分；decoder:解码器，模型的右半部分。Among them, encoder: encoder, the left half of the model; decoder: decoder, the right half of the model.

DC-FCN的公式:The formula of DC-FCN:

x_l＝f(x₀，x₁，...，x_t，...，x_l-1)，用逗号连接表示在深度上拼接；x_l = f(x₀ , x₁ , ..., x_t , ..., x_l-1 ), connected by commas to indicate depth splicing;

试验例Test example

请参阅图4，实验采用TIMIT作为干净语音数据集，Nonspeech和noisex92 作为噪声数据集。Please refer to Figure 4, the experiment adopts TIMIT as the clean speech dataset and Nonspeech and noisex92 as the noise dataset.

每一轮训练时，将干净语音训练集中的所有语音依次取出，并和从 Nonspeech训练集中的随机选取的一条噪声以0dB进行混合。测试集也使用相同混合方法得到。In each round of training, all speeches in the clean speech training set are taken out in turn, and mixed with a piece of noise randomly selected from the Nonspeech training set at 0dB. The test set was also obtained using the same mixing method.

为了评估模型的性能以及泛化能力，本发明分别使用已知噪声和未知噪声在-10、-5、0、5、10dB下进行测试。In order to evaluate the performance and generalization ability of the model, the present invention uses known noise and unknown noise to test at -10, -5, 0, 5, and 10 dB, respectively.

选取训练时使用过的噪声作为已知噪声，noisex92中的babble、f16和 factory2作为未知噪声。Select the noise used during training as the known noise, and babble, f16 and factory2 in noisex92 as the unknown noise.

将测试集中的干净语音分别和已知噪声(seen)及noisex92中的未知噪声(unseen)以-10，-5，0，5，10db进行混合，并对得到的带噪语音进行测试。The clean speech in the test set is mixed with known noise (seen) and unknown noise (unseen) in noisex92 at -10, -5, 0, 5, 10db, and the resulting noisy speech is tested.

模型使用的语音信号被提前下采样为8kHz。The speech signal used by the model is downsampled to 8kHz in advance.

本发明使用包含256个采样点的hamming窗以及128的帧移的STFT来计算幅度向量。由于256-point STFT magnitude vector是对称的，所以只使用其一半，共计129个点。The present invention uses a hamming window containing 256 sample points and a frame-shifted STFT of 128 to calculate the magnitude vector. Since the 256-point STFT magnitude vector is symmetric, only half of it is used, for a total of 129 points.

使用的RCED网络包含10个卷积层，对于前9层中的每一层分别进行卷积、ReLU激活和批归一化(Batch Normalization)处理，最后一层只进行卷积并得到增强帧。The RCED network used contains 10 convolutional layers, convolution, ReLU activation and Batch Normalization are performed separately for each of the first 9 layers, and the last layer only performs convolution and gets enhanced frames.

使用的卷积核数量分别为12-16-20-24-32-24-20-16-12-1。每一层使用1D卷积，过滤器的长度和输入帧数相同，过滤器的宽度分别为 13-11-9-7-7-7-9-11-13-1。The number of convolution kernels used are 12-16-20-24-32-24-20-16-12-1, respectively. Each layer uses 1D convolution, the length of the filter is the same as the number of input frames, and the width of the filter is 13-11-9-7-7-7-9-11-13-1 respectively.

使用Adam optimizer来优化模型，初始学习率设置为0.00005。如果连续5 轮系统性能没有提升，则学习率被重设为0.00001。使用STOI、PESQ和SSNR 作为评价指标，如下表所示：The model was optimized using Adam optimizer with the initial learning rate set to 0.00005. If there is no improvement in system performance for 5 consecutive epochs, the learning rate is reset to 0.00001. STOI, PESQ and SSNR are used as evaluation metrics, as shown in the following table:

TABLEⅠ T_HE PESQ,STOI_AND SSNR COMPARISON OF DIFFERENT M_{ODELS AT}-5,0,5A_ND10_DB_TABLEⅠTHE PESQ,STOI_AND SSNR COMPARISON OF DIFFERENT M_{ODELS AT} -5,0,5A_ND 10_DB

本发明使用只包含卷积层的RCED，丢弃了池化层和与其对应的上采样层；并在其上引入不同的shortcut机制，性能良好且有泛化性，可以重复利用信息，从而使用更少的数据来提取出更多有用的特征；易于训练、减少梯度消亡、减少参数，同时在小数据集上克服过拟合的问题。The present invention uses RCED which only contains the convolution layer, discards the pooling layer and the corresponding upsampling layer; and introduces different shortcut mechanisms on it, which has good performance and generalization, and can reuse information, so as to use more Extract more useful features from less data; easy to train, reduce gradient demise, reduce parameters, and overcome overfitting on small datasets.

以上仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等，均应包含在本发明的保护范围之内。The above are only preferred embodiments of the present invention and are not intended to limit the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention shall be included in the protection scope of the present invention. Inside.

Claims

1. A noise reduction method based on an RCED network is characterized by comprising the following steps:

s1: constructing an RCED;

s2: splicing the target enhancement frame and partial frames at two sides of the target enhancement frame, and then performing convolution operation through RCED;

s3: introducing a shortcut mechanism, splicing the output of an encoder in the RCED with the output of a corresponding decoder, and then inputting the spliced output into the next convolutional layer to execute subsequent operation;

s4: all encoders and all decoders are combined into one sense Block, respectively, and short-circuit paths are added between layers.

2. The noise reduction method according to claim 1, characterized in that: the RCED includes a plurality of identical modules A that contain a convolutional layer, a block normalization layer, and a ReLU activation layer.

3. The noise reduction method according to claim 2, characterized in that: the RCED also includes a module B at the end, which contains the convolutional layer and outputs an enhancement frame.

4. The noise reduction method according to claim 1, characterized in that: the number of the splicing frames before and after the RCED is 7.

5. The noise reduction method according to claim 1, wherein in step S3, the formula is applied as:

x_decoder+1＝f(x_decoder，x_encoder，)

splicing in depth is indicated by comma connection;

where f (.) is the set of convolution, Batch Normalization, and ReLU.

x_encoderAnd x_decoderSymmetric layers in the encoder and the decoder are respectively input into the (decoder +1) layer after the two layers are spliced.

6. The method according to claim 1, wherein in step S4, the formula is applied as follows:

x_l＝f(x₀，x₁，...，x_t，...，x_l-1)

splicing in depth is indicated by comma connection;

f () is the set of convolution, Batch Normalization, and ReLU;

x_tl-1 is the t-th layer in a dense block.