CN118506798B

Movatterモバイル変換

Info

Publication number: CN118506798B
Application number: CN202410722615.9A
Authority: CN
Inventors: 李华威; 高绪涛; 李海洋; 王汝生; 李长霖; 马捷径
Original assignee: Beijing Deck Intelligent Technology Co ltd
Current assignee: Beijing Deck Intelligent Technology Co ltd
Priority date: 2024-06-05
Filing date: 2024-06-05
Publication date: 2025-03-14
Anticipated expiration: 2044-06-05
Also published as: CN118506798A

Abstract

The application relates to a voice enhancement method and a voice enhancement system, which belong to the technical field of voice enhancement, wherein the voice enhancement method comprises the steps of constructing a voice enhancement network, wherein the voice enhancement network comprises an encoder, a noise separator and a decoder which are sequentially connected, acquiring audio data to be processed, inputting the audio data into the voice enhancement network, extracting L time domain features with different dimensions from the audio data through the encoder, respectively extracting the L time domain features through the noise separator, mapping the extraction results into a preset feature set, and carrying out fusion processing on each extraction result in the time domain feature with the highest dimension and the corresponding preset feature set through the decoder so as to obtain denoising enhanced voice and/or background sound. In an unstable noise environment, the voice enhancement network can quickly adapt to the change of noise characteristics, and noise suppression and voice enhancement are performed in real time.

Description

Voice enhancement method and system

Technical Field

The present invention relates to the field of speech enhancement technologies, and in particular, to a speech enhancement method and system.

Background

Speech enhancement refers to extracting as clean as possible original speech from noisy speech to improve speech quality and reduce the negative effects caused by noise. The speech enhancement has important application in the fields of speech recognition, speech coding and the like, and is a front-end preprocessing module in a speech interaction system.

Currently, for speech enhancement, a frequency domain analysis mode is generally adopted, that is, a Fast Fourier Transform (FFT) is performed on a speech signal to convert a time domain signal into a frequency domain signal, and the frequency domain signal is analyzed and processed, for example, a filter is used to filter a signal in a specific frequency band, so as to improve the quality of the speech signal. However, the method of filtering signals in a specific frequency band significantly reduces the performance when dealing with some non-stationary noise, resulting in the enhancement of noise residuals and the emphasis of speech distortion due to the restrictive assumption made about the characteristics of the noise signals. How to improve the performance of speech enhancement is a problem to be solved at present.

Disclosure of Invention

In order to improve the voice enhancement performance, the application provides a voice enhancement method and a voice enhancement system.

In a first aspect, the present application provides a method for enhancing speech, which adopts the following technical scheme:

A method of speech enhancement, comprising:

The method comprises the steps of constructing a voice enhancement network, wherein the voice enhancement network comprises an encoder, a noise separator and a decoder which are sequentially connected;

The method comprises the steps of obtaining audio data to be processed, inputting the audio data into a voice enhancement network, extracting L time domain features with different dimensions from the audio data through an encoder, respectively extracting the L time domain features through a noise separator, mapping extraction results into a preset feature set, and carrying out fusion processing on the time domain feature with the highest dimension and each extraction result in the corresponding preset feature set through a decoder to obtain denoising enhanced voice and/or background sound.

By adopting the technical scheme, the encoder accurately extracts L time domain features from the audio data, the time domain features cover multidimensional information of the audio, the time domain features are separated and extracted through the noise separator, the extracted features are mapped into a preset feature set, and finally the decoder fuses the time domain features with the highest dimension and corresponding extraction results in the set, so that the denoised enhanced voice and/or background sound are obtained, the encoder, the noise separator and the decoder are used for effectively processing the noisy voice, so that in an unstable noise environment, a voice enhancement network can be quickly adapted to the change of noise characteristics, noise suppression and voice enhancement are performed in real time, and the stability and performance of voice enhancement are improved.

Optionally, the encoder includes L first convolution modules connected in sequence, each first convolution module including a one-dimensional convolution layer, a normalization layer, an activation function layer, and a downsampling pooling layer;

The first convolution modules are used for inputting audio data, the output of each first convolution module serves as the input of the next first convolution module connected with the first convolution module, each first convolution module outputs a time domain feature, the channel number of one-dimensional convolution layers in each first convolution module is sequentially increased from the input end of the encoder to the output end of the encoder, and the time domain feature with sequentially increased dimensionality is obtained;

The output end of each first convolution module is also connected with the noise separator so as to output L time domain features to the noise separator.

Through adopting above-mentioned technical scheme, through setting up one-dimensional convolution layer, normalization layer, activation function layer and downsampling pooling layer in the encoder, realized the preliminary feature extraction to audio data, and the channel number of every first convolution module increases gradually from the input to the output to extract the time domain characteristic of different dimensionalities, thereby make the encoder can catch richer and various characteristic information in the audio data.

Optionally, the noise separator comprises L separation modules, wherein each separation module comprises a plurality of second convolution modules which are sequentially connected and at least one attention module which is connected with the output end of the last second convolution module, each attention module corresponds to a preset feature set, and the preset feature set comprises a voice feature set and/or a noise feature set;

the input end of a first second convolution module in each separation module is correspondingly connected with the output end of the first convolution module to obtain time domain features, and the second convolution module is used for extracting the time domain features to obtain an extraction result;

Each attention module maps the extraction result to a corresponding preset feature set;

Wherein each second convolution module comprises a one-dimensional convolution layer, a normalization layer and an activation function.

By adopting the technical scheme, the plurality of second convolution modules respectively process the input time domain features to obtain the extraction result, and the attention module enables the extraction result to be accurately mapped into a preset voice feature set or a noise feature set, so that the effect of separating voice from noise in an unstable noise environment is realized.

Optionally, the attention module includes a self-attention layer or a multi-headed self-attention layer.

By adopting the technical scheme, the self-attention layer is simpler than the multi-head self-attention layer, and can be flexibly selected according to actual computing resources.

Optionally, the decoder comprises a voice decoder and/or a noise decoder, wherein the voice decoder and the noise decoder comprise L third convolution modules and an output module, wherein the L third convolution modules and the output module are sequentially connected, the output module is connected with the third convolution module at the tail part, each third convolution module is connected with an attention module, and the first third convolution module is also connected with an encoder so as to acquire the time domain feature with the highest dimensionality;

The first third convolution module performs fusion and addition on the time domain feature with the highest dimension and the extraction result in the corresponding preset feature set to obtain a fusion feature, and outputs the fusion feature to the next third convolution module connected with the fusion feature;

And each of the remaining third convolution modules performs fusion and addition on fusion features output by the previous third convolution module and extraction results in the corresponding preset feature set, and outputs the fusion features output by the third convolution module at the tail to the output module so as to enable the output module to output noise-removing and enhancing voice and/or background sound.

By adopting the technical scheme, firstly, the time domain feature with the highest dimension is obtained from the encoder, finer feature information is extracted by using the attention module, the first third convolution module carries out fusion and addition on the time domain feature with the highest dimension and the extraction result in the corresponding preset feature set to obtain the representative fusion feature, then, each third convolution module is utilized to continuously fuse the fusion feature output by the previous module with the extraction result in the preset feature set, the feature expression is further enhanced, and finally, the fusion features are converted into clear denoising enhanced voice and/or background sound by using the output module, so that the restoring effect of the decoder on the voice and the noise is realized.

Optionally, the third convolution modules include a channel fusion layer, an up-sampling layer, a one-dimensional convolution layer, a normalization layer and an activation function layer which are sequentially connected, the channel number of the one-dimensional convolution layer in each third convolution module decreases from the input end to the output end of the decoder, the output module includes a one-dimensional convolution layer, a normalization layer and an activation function layer, and the channel number of the one-dimensional convolution layer in the output module is 1.

By adopting the technical scheme, the channel fusion layer, the up-sampling layer, the one-dimensional convolution layer, the normalization layer and the activation function layer in the third convolution modules are sequentially connected, so that feature extraction and dimension transformation can be normally performed, the channel number of the one-dimensional convolution layer in each third convolution module is sequentially decreased from the input end to the output end of the decoder, the feature dimension is reduced gradually, the requirements of the output module are met, and the output module is finally utilized to output audio data with the same dimension as the input audio data.

Optionally, the constructing the voice enhancement network includes training a preset neural network model to obtain a training step of the voice enhancement network;

the training step comprises the following steps:

The training data set comprises a noisy original voice signal, an ideal noiseless voice signal and ideal noise;

inputting the original voice signal with noise into an encoder, and extracting L time domain features with different dimensions from the audio data through the encoder;

Extracting L time domain features through a noise separator respectively, and mapping the extraction result into a preset feature set, wherein the preset feature set further comprises a noise feature set;

carrying out fusion processing on each extraction result in the time domain feature and the voice feature set with the highest dimensionality through a first decoder to obtain actual denoising voice;

The method comprises the steps that a second decoder is used for carrying out fusion processing on each extraction result in a time domain feature and noise feature set with the highest dimensionality to obtain actual noise;

and carrying out iterative updating on model parameters of a preset neural network model according to the comparison result of the actual denoising voice and the ideal noiseless voice signal and the comparison result of the actual noise and the ideal noise so as to obtain the voice enhancement network.

By adopting the technical scheme, the noisy original voice signal, the ideal noiseless voice signal and the ideal noise are obtained from the training data set, the noisy original voice signal is input into the encoder, L time domain features with different dimensions are successfully extracted from the audio data through the processing of the encoder, the features are further processed by the noise separator and mapped into a preset feature set, the highest-dimensional time domain feature is fused with the extraction result in the voice feature set by the first decoder, the actual denoising voice is generated, the actual noise is generated by the second decoder, the actual denoising voice and the actual noise are compared with the corresponding ideal signals, the parameters of the neural network model are updated, and the voice enhancement network with better performance can be obtained through repeated iterative updating.

In a second aspect, the present application provides a speech enhancement system, which adopts the following technical scheme:

A speech enhancement system comprising:

the system comprises a network construction unit, a voice enhancement network, a voice enhancement unit and a voice enhancement unit, wherein the network construction unit is used for constructing a voice enhancement network;

The voice enhancement unit is used for inputting the audio data to be processed into a voice enhancement network, extracting L time domain features with different dimensions from the audio data through an encoder, respectively extracting the L time domain features through a noise separator, mapping the extraction results into a preset feature set, and carrying out fusion processing on each extraction result in the time domain feature with the highest dimension and the corresponding preset feature set through a decoder so as to obtain denoising enhanced voice and/or background sound.

In a third aspect, the present application provides an electronic device, which adopts the following technical scheme:

An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing a computer program as any one of the methods described above.

In a fourth aspect, the present application provides a computer readable storage medium, which adopts the following technical solutions:

a computer readable storage medium comprising a computer program stored thereon that can be loaded by a processor and executed in any of the methods described above.

Drawings

Fig. 1 is a flowchart of a voice enhancement method according to an embodiment of the present application.

Fig. 2 is a block diagram of an encoder according to an embodiment of the present application.

Fig. 3 is a block diagram of a separation module according to an embodiment of the present application.

Fig. 4 is a block diagram of a voice decoder according to an embodiment of the present application.

Fig. 5 is a flowchart of a training method of a voice enhancement network according to an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

The embodiment of the application discloses a voice enhancement method. Referring to fig. 1, a voice enhancement method includes:

step S101, constructing a voice enhancement network, wherein the voice enhancement network comprises an encoder, a noise separator and a decoder which are sequentially connected;

Step S102, obtaining audio data to be processed, inputting the audio data into a voice enhancement network, extracting L time domain features with different dimensions from the audio data through an encoder, respectively extracting the L time domain features through a noise separator, mapping the extraction results into a preset feature set, and carrying out fusion processing on the time domain feature with the highest dimension and each extraction result in the corresponding preset feature set through a decoder to obtain denoising enhanced voice and/or background sound.

It should be appreciated that the speech enhancement network is a time domain analysis network and does not involve a process of frequency domain analysis, and does not require fourier transformation, so that on one hand, the amount of data to be processed is reduced, and on the other hand, the time domain analysis can better adapt to unstable background sounds, so as to better enhance speech.

In the above embodiment, the encoder extracts the L time domain features from the audio data accurately, the time domain features cover the multidimensional information of the audio, then the time domain features are separated and extracted by the noise separator, the extracted features are mapped into the preset feature set, and finally the decoder fuses the highest dimensional time domain features and the corresponding extraction results in the set, so as to obtain the denoised enhanced voice and/or background sound, and the encoder, the noise separator and the decoder realize the effective processing of the noisy voice, so that in the unstable noise environment, the voice enhancement network can adapt to the change of the noise characteristic rapidly, and noise suppression and voice enhancement are performed in real time, thereby improving the stability and performance of the voice enhancement.

Referring to fig. 2, as an embodiment of the encoder, the encoder includes L sequentially connected first convolution modules, each including a one-dimensional convolution layer, a normalization layer, an activation function layer, and a downsampling pooling layer sequentially connected;

It should be appreciated that each of the first convolution modules outputs a corresponding one of the time-domain features.

Wherein, the number of channels of the one-dimensional convolution layer in each first convolution module increases from the input end of the encoder to the output end of the encoder in sequence, namely the number of channels of the one-dimensional convolution layer from the input end of the encoder to the encoder is {𝑁, 2 × 𝑁, ⋯ , 𝐿× 𝑁}, respectively. In this embodiment, the convolution kernel of the first convolution module is 𝑘_𝑒𝑛𝑐, the step size is 𝑠_𝑒𝑛𝑐, and the downsampling distance d_𝑒𝑛𝑐. And the number of the first convolution modules, the size of the convolution kernel, the step size and the downsampling interval can be set according to practical situations, for example, 𝐿= 6, 𝑁= 64 𝑘_𝑒𝑛𝑐= 21, 𝑠_𝑒𝑛𝑐= 1, d_𝑒𝑛𝑐=2 can be set.

Specifically, in conjunction with Figure 2, the first convolution module is set with L, namely: CONV_a1, CONV_a2, ⋯, CONV_aL; each first convolution module includes a one-dimensional convolution layer, a normalization layer, an activation function layer, and a downsampling pooling layer that are sequentially connected, that is, the encoder is equipped with a one-dimensional convolution layer {𝐶𝑜𝑛𝑣₁, 𝐶𝑜𝑛𝑣₂, ⋯ , 𝐶𝑜𝑛𝑣_𝐿}, a normalization layer {𝐵𝑁₁, 𝐵𝑁₂, ⋯ , 𝐵𝑁_𝐿}、{𝑅𝑒𝐿𝑈₁, 𝑅𝑒𝐿𝑈₂, ⋯ , 𝑅𝑒𝐿𝑈_𝐿}, and a downsampling pooling layer {𝐷𝑆₁, 𝐷𝑆₂, ⋯ , 𝐷𝑆_𝐿}, the audio data is input from the one-dimensional convolutional layer of the first convolutional module, and then passes through the normalization layer and activation function layer, and finally outputs the time-domain features 𝑒𝑛𝑐_v^₁ through the downsampling pooling layer, if the number of channels in the one-dimensional convolutional layer of the first convolution module is N, then the dimension of the output time-domain feature 𝑒𝑛𝑐_v^₁ is N; then input the time-domain features 𝑒𝑛𝑐_v^₁ into the second first convolution module CONV_a2. Similarly, if the downsampling pooling layer of the second first convolution module outputs time-domain features 𝑒𝑛𝑐_v^₂, and the one-dimensional convolution layer of the second first convolution module has 2N channels, then the dimension of the output time-domain features is 2N. By analogy, the L time-domain features output by the encoder are respectively {𝑒𝑛𝑐_v^₁, 𝑒𝑛𝑐_v^₂, ⋯ ,𝑒𝑛𝑐_v^_L}, with corresponding dimensions of {𝑁, 2 × 𝑁, ⋯ , 𝐿× 𝑁}, thus achieving the effect of the encoder outputting time-domain features of different dimensions.

It should be further understood that the more channels, the more time domain features can be extracted, but at the same time, the encoder is easy to have the problem of over fitting, and by adopting the time domain features combined by different dimensions, the more abundant features are obtained for enhancing the voice, the time domain features with fewer channels are reserved, and the possibility of over fitting is reduced.

In the above embodiment, by setting the one-dimensional convolution layer, the normalization layer, the activation function layer and the downsampling pooling layer in the encoder, preliminary feature extraction of audio data is achieved, and the number of channels of each first convolution module increases from the input end to the output end in sequence, so that time domain features of different dimensions are extracted, and the encoder can capture more abundant and various feature information from the audio data.

Referring to fig. 3, as an embodiment of the noise separator, the noise separator includes L separation modules, each of which includes a plurality of second convolution modules connected in sequence and at least one attention module connected to an output end of the last second convolution module;

each attention module corresponds to a preset feature set, wherein the preset feature set comprises a voice feature set and/or a noise feature set;

Wherein the attention module comprises a self-attention layer or a multi-head self-attention layer.

Wherein each second convolution module comprises a one-dimensional convolution layer, a normalization layer, and an activation function, the second convolution modules do not have a pooling layer, such that the dimensions of the input and output remain consistent.

It should be appreciated that the number of separation modules should be consistent with the number of first convolution modules in the encoder, and that each separation module corresponds to one first convolution module.

Specifically, based on Figure 3, the separation modules are 𝑆𝐸𝑃_𝑖, so the L separation modules are: {𝑆𝐸𝑃₁, 𝑆𝐸𝑃₂, ⋯ , 𝑆𝐸𝑃_𝐿}, aach separation module is equipped with M second convolution modules and two attention modules. The M second convolution modules are: CONV_b1, CONV_b2, ⋯, CONV_bM; in this embodiment, there are two attention modules, namely the speech attention module 𝑆𝐴^𝑖_s and the noise attention module 𝑆𝐴^𝑖_n, there are a total of L speech attention modules set up in the noise separator, namely {𝑆𝐴¹_s, 𝑆𝐴²_s, ⋯ , 𝑆𝐴ⁿ_s}, and the preset feature set corresponding to these L speech attention modules is the speech feature set, and the extraction result output by each speech attention module is an element 𝑠𝑒𝑝_𝑠 in the speech feature set. Therefore, the speech feature set mapped by all speech attention modules in the noise separator is {𝑠𝑒𝑝¹_s, 𝑠𝑒𝑝²_s, ⋯ , 𝑠𝑒𝑝^L_s}. Similarly, if the extraction result output by each noise attention module is an element 𝑠𝑒𝑝_𝑛 in the noise feature set, then the noise feature set mapped by all noise attention modules in the noise separator is {𝑠𝑒𝑝¹_n, 𝑠𝑒𝑝²_n, ⋯ ,𝑠𝑒𝑝^L_n}.

Note that the number and types of the attention modules may be set according to actual needs, which is not limited. For example, the speech attention module may be provided with a plurality to separate different human voices, and the type of attention module may also be a music attention module to separate background music.

In the above embodiment, the plurality of second convolution modules respectively process the input time domain features to obtain the extraction result, and then the attention module enables the extraction result to be accurately mapped into a preset voice feature set or a noise feature set, thereby realizing the effect of separating voice from noise in an unstable noise environment.

Referring to fig. 4, as an embodiment of the decoder, the decoder includes a speech decoder and/or a noise decoder, where each of the speech decoder and the noise decoder includes L third convolution modules sequentially connected and an output module connected to the third convolution module at the tail, each of the third convolution modules is connected to an attention module, and the first third convolution module is further connected to an encoder to obtain a time domain feature with a highest dimension;

Specifically, referring to fig. 4, the third convolution modules are provided with L, respectively CONV_c1、CONV_c2、......、CONV_cL, the input of the first third convolution module is obtained by adding and fusing the extracted features sep^L_s in the speech feature set obtained by mapping the time domain feature enc_v^_L output by the L-th first convolution module and the L-th speech attention module, and the input of the other i e [2, L ] third convolution modules is obtained by adding and fusing the extracted features sep_s^L+1-i in the speech feature set of the i-1 th layer third convolution module output and the l+1-i th layer.

In the above embodiment, firstly, the time domain feature with the highest dimension is obtained from the encoder, the attention module is utilized to extract finer feature information, the first third convolution module fuses and adds the time domain feature with the highest dimension and the extraction result in the corresponding preset feature set to obtain the representative fusion feature, then, each third convolution module is utilized to continuously fuse the fusion feature output by the previous module with the extraction result in the preset feature set, the feature expression is further enhanced, and finally, the output module is utilized to convert the fusion features into clear denoising enhanced voice and/or background sound, so that the restoring effect of the decoder on the voice and the noise is realized.

As one implementation mode of the third convolution modules, the third convolution modules comprise channel fusion layers, up-sampling layers, one-dimensional convolution layers, normalization layers and activation function layers which are sequentially connected, the channel number of the one-dimensional convolution layers in each third convolution module is sequentially decreased from the input end to the output end of the decoder, the output modules comprise one-dimensional convolution layers, normalization layers and activation function layers, and the channel number of the one-dimensional convolution layers in the output modules is 1.

Specifically, the number of channels of the one-dimensional convolution layer in the third convolution module is {𝐿× 𝑁, （𝐿− 1）× 𝑁, ⋯ , 𝑁}, that is, the decreasing manner of the number of channels of the one-dimensional convolution layer in the third convolution module is opposite to the increasing manner of the number of channels of the one-dimensional convolution layer in the first convolution module, the convolution kernel is 𝑘_d𝑒𝑐, the step size is 𝑠_d𝑒𝑐, the up-sampling is amplified by one time, and parameters can be customized, such as setting 𝐿= 6, 𝑁= 64, 𝑘_d𝑒𝑐= 7, s_d𝑒𝑐 =1.

Specifically, taking a voice decoder as an example, in connection with fig. 4, the third convolution module is provided with L, respectively CONV_c1、CONV_c2、……、CONV_cL, the voice decoder is commonly provided with the voice decoder and the channel fusion layer is commonly provided with the voice decoder

{𝑈𝑆₁, 𝑈𝑆₂, ⋯ , 𝑈𝑆_𝐿}, one-dimensional convolution layer {𝐶𝑜𝑛𝑣₁, 𝐶𝑜𝑛𝑣₂, ⋯ , 𝐶𝑜𝑛𝑣_𝐿}, normalization layer {𝐵𝑁₁, 𝐵𝑁₂, ⋯ , 𝐵𝑁_𝐿}, and activation function {𝑅𝑒𝐿𝑈₁, 𝑅𝑒𝐿𝑈₂, ⋯ , 𝑅𝑒𝐿𝑈_𝐿}, adding and fusing the time domain feature 𝑒𝑛𝑐_v^_L with the highest dimension and the L-th extraction result in the voice feature set, inputting the result from the channel fusion layer of the first third convolution module, sequentially passing through the one-dimensional convolution layer, the normalization layer and the activation function layer, adding and fusing the output result with the L-1-th extraction result in the voice feature set, inputting the result into the second third convolution module, and so on until the output of the tail third convolution module CONV_cL is obtained, and inputting the output of the tail third convolution module CONV_cL into the output module to obtain the denoising enhanced voice.

In the above embodiment, the channel fusion layer, the upsampling layer, the one-dimensional convolution layer, the normalization layer and the activation function layer in the third convolution modules are sequentially connected, so that feature extraction and dimension transformation can be performed normally, the number of channels of the one-dimensional convolution layer in each third convolution module decreases from the input end to the output end of the decoder in sequence, which is conducive to gradually reduce feature dimensions, adapt to requirements of the output module, and finally output with the output module to have the same dimensions as input audio data.

Referring to fig. 5, as an embodiment of step S101, constructing a voice enhancement network includes a training step of training a preset neural network model to obtain the voice enhancement network, where the training step includes:

Step S1011, acquiring a training data set, wherein the training data set comprises a noisy original voice signal, an ideal noiseless voice signal and ideal noise;

Step S1012, inputting the noisy original voice signal into an encoder, and extracting L time domain features with different dimensions from the audio data through the encoder;

step S1013, extracting L time domain features through a noise separator respectively, and mapping the extraction result into a preset feature set, wherein the preset feature set further comprises a noise feature set;

step S1014, carrying out fusion processing on each extraction result in the time domain feature and the voice feature set with the highest dimensionality through a first decoder to obtain actual denoising voice;

Step S1015, performing fusion processing on each extraction result in the time domain feature and noise feature set with the highest dimensionality through a second decoder to obtain actual noise;

it should be noted that, step S1014 and step S1015 may be performed in parallel, and the number of the first decoder and the second decoder may be set according to actual situations.

And step 1016, carrying out iterative updating on model parameters of a preset neural network model according to a comparison result of the actual denoising voice and the ideal noiseless voice signal and a comparison result of the actual noise and the ideal noise so as to obtain the voice enhancement network.

In the above embodiment, the noisy original speech signal, the ideal noiseless speech signal and the ideal noise are obtained from the training data set, the noisy original speech signal is input into the encoder, the L time domain features with different dimensions are successfully extracted from the audio data through the processing of the encoder, the features are further processed by the noise separator and mapped into the preset feature set, the highest-dimensional time domain feature and the extraction result in the speech feature set are fused by the first decoder, the actual denoising speech is generated, the actual noise is generated by the second decoder, the actual denoising speech and the actual noise are compared with the corresponding ideal signals, the parameters of the neural network model are updated through multiple iterations, and the voice enhancement network with better performance can be obtained.

The embodiment of the application discloses a voice enhancement system, which comprises:

The voice enhancement system provided by the application can realize the voice enhancement method, and the specific working process of the voice enhancement system can refer to the corresponding process in the embodiment of the method.

In the foregoing embodiments, the descriptions of the embodiments are focused on, and for those portions of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.

Based on the same technical concept, the invention also discloses an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the computer program of any one of the methods.

The invention also discloses a computer readable storage medium comprising a computer program stored with instructions executable by a processor to load and execute any of the methods described above.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.

In addition, functional modules in the embodiments of the present application may be integrated together to form a single part, or each module may exist alone, or two or more modules may be integrated to form a single part.

The foregoing description of the preferred embodiments of the application is not intended to limit the scope of the application in any way, including the abstract and drawings, in which case any feature disclosed in this specification (including abstract and drawings) may be replaced by alternative features serving the same, equivalent purpose, unless expressly stated otherwise. That is, each feature is one example only of a generic series of equivalent or similar features, unless expressly stated otherwise.

Claims

1. A method of speech enhancement, comprising:

the method comprises the steps of obtaining audio data to be processed, inputting the audio data into a voice enhancement network, extracting L time domain features with different dimensions from the audio data through an encoder, respectively extracting the L time domain features through a noise separator, and mapping extraction results into a preset feature set;

The encoder comprises L first convolution modules which are sequentially connected, wherein the output end of each first convolution module is connected with the noise separator so as to output L time domain features to the noise separator;

the decoder comprises a voice decoder and/or a noise decoder, wherein the voice decoder and the noise decoder comprise L third convolution modules and output modules, wherein the L third convolution modules and the output modules are sequentially connected, the output modules are connected with the third convolution modules at the tail part, each third convolution module is connected with an attention module, and the first third convolution module is also connected with an encoder so as to acquire the time domain feature with the highest dimensionality;

2. The method according to claim 1, characterized in that:

Each first convolution module comprises a one-dimensional convolution layer, a normalization layer, an activation function layer and a downsampling pooling layer;

The first convolution modules are used for inputting audio data, the output of each first convolution module serves as the input of the next first convolution module connected with the first convolution module, each first convolution module outputs a time domain feature, the channel number of one-dimensional convolution layers in each first convolution module is sequentially increased from the input end of the encoder to the output end of the encoder, and the time domain feature with sequentially increased dimensionality is obtained.

3. The method according to claim 2, characterized in that:

The noise separator comprises L separation modules, wherein each separation module comprises a plurality of second convolution modules which are sequentially connected and at least one attention module which is connected with the output end of the last second convolution module, each attention module corresponds to a preset feature set, the input end of the first second convolution module in each separation module is correspondingly connected with the output end of the first convolution module to obtain time domain features, and the second convolution modules are used for extracting time domain features to obtain extraction results;

4. A method according to claim 3, wherein the attention module comprises a self-attention layer or a multi-headed self-attention layer.

5. The method of claim 1, wherein the third convolution modules comprise a channel fusion layer, an upsampling layer, a one-dimensional convolution layer, a normalization layer and an activation function layer which are sequentially connected, the number of channels of the one-dimensional convolution layer in each third convolution module decreases sequentially from an input end to an output end of the decoder, the output module comprises a one-dimensional convolution layer, a normalization layer and an activation function layer, and the number of channels of the one-dimensional convolution layer in the output module is 1.

6. The method of claim 4, wherein the constructing a speech enhancement network comprises training a predetermined neural network model to obtain the training step of the speech enhancement network;

the training step comprises the following steps:

7. A speech enhancement system for performing a speech enhancement method according to any of claims 1-6, comprising:

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the computer program of the method according to any of claims 1-6.

9. A computer readable storage medium comprising a computer program stored thereon that can be loaded by a processor and executed by a method according to any of claims 1-6.