Disclosure of Invention
The invention aims to solve the technical problems by providing a method, a device, equipment and a medium for separating voice from background sound based on a Codec, which take the Codec technology as a core, and avoid the calculation burden and separation error caused by direct processing at a waveform or frequency spectrum level by separating voice from background sound in a characterization space, thereby successfully solving the problem that the traditional separation method is difficult to consider separation performance and audio quality under the condition of low bit rate compression.
In a first aspect, the present invention provides a method for separating speech from background sound based on Codec, including:
The separation model construction process comprises the steps of constructing a Codec-based coding and decoding separation model, comprising an encoder, a separation module, two audio representation quantizers and a decoder, wherein the encoder is used for encoding audio into continuous audio representations, the separation module is used for separating the continuous audio representations into a hidden representation containing voice and a hidden representation containing background sound, the two audio representation quantizers are respectively used for carrying out discretization on the hidden representation containing voice and the hidden representation containing background sound and compressing the discrete representation containing voice and the hidden representation containing background sound to a low bit rate level to obtain a compressed discrete voice representation and a compressed discrete background sound representation, and the decoder is used for restoring the compressed discrete voice representation and the compressed discrete background sound representation into a separated voice signal and a background sound signal so as to realize audio separation in a characteristic representation space;
The model training process comprises the steps of randomly selecting a pure voice sample, overlapping background sounds, generating mixed input audio, carrying out audio separation by a Codec-based Codec separation model, obtaining a voice signal and a background sound signal, mixing the voice signal and the background sound signal to obtain a predicted mixed audio, respectively evaluating the reconstruction quality of the voice and the background sounds after separation by using a first discrimination module and a second discrimination module, and simultaneously evaluating the reconstruction quality of the predicted mixed audio by using a third discrimination module, thereby training the Codec-based Codec separation model until the completion;
And in the audio separation process, a trained Codec-based Codec separation model is used for separating target audio, and the speech quality of the model separation output is quantitatively evaluated.
Further, the separation module comprises two sets of nonlinear layers, each set of nonlinear layers comprising a plurality of nonlinear layers, respectively.
Further, the process of generating the mixed input audio is specifically to obtain an audio signal containing single or multiple voiceAnd an audio signal containing only background soundsThrough volume normalization and signal superposition, the audio signals are mixed into audio signals mixed by voice and background sounds, namely mixed input audio, and the formula is as follows:;;
wherein T is the number of sampling points of the audio.
Further, the two audio representation quantizers each adopt an FSQ quantizer, the quantization process is specifically that the hidden representation including the voice and the hidden representation including the background sound are respectively discretized and compressed to a low bit rate level, each scalar included in the X tensor is converted into a range from 0 to L-1, so that all vectors composed of the scalar are limited, and the formula is as follows:;
Wherein, theIn order to round the function of the rounding,L is the discretization level, d is the dimension of each vector in X, thenIs the vocabulary size of the post-discrete vector.
Further, at an input audio sample rate of 24k, the quantization vocabulary of the FSQ quantizer is set to [8,8,8,8,5,5,5,5,5,5,5,5,5,5,5,5], compressing the hidden token to a 3 kbps bit rate level.
Further, the first discrimination module, the second discrimination module and the third discrimination module respectively comprise a multi-scale Fourier discriminator and a multi-resolution short-time Fourier transform discriminator.
Further, the voice quality of the model separation output is quantitatively evaluated by adopting the perceived voice quality evaluation index.
In a second aspect, the invention provides a Codec-based speech and background sound separation device, which comprises a Codec-based Codec separation model, wherein the Codec-based Codec separation model comprises an encoder, a separation module, two audio representation quantizers and a decoder, the encoder is used for encoding audio into continuous audio representations, the separation module is used for separating the continuous audio representations into a hidden representation containing speech and a hidden representation containing background sound, the two audio representation quantizers are respectively used for carrying out discretization on the hidden representation containing speech and the hidden representation containing background sound and compressing to a low bit rate level to obtain a compressed discrete speech representation and a compressed discrete background sound representation, and the decoder is used for reducing the compressed discrete speech representation and the compressed discrete background sound representation into a separated speech signal and a background sound signal, so that audio separation in a feature representation space is realized.
In a third aspect, the invention provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of the first aspect when executing the program.
In a fourth aspect, the present invention provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the method of the first aspect.
The technical scheme provided by the embodiment of the invention has at least the following technical effects:
The voice and background sound separation method for the low-code rate environment expands the end-to-end nerve audio coding and decoding frame Codec into a voice separation task for the first time, not only reserves the modeling advantage of a nerve coding and decoding model under a high compression ratio, but also realizes cooperative modeling of the compression and separation task through systematic optimization of an internal module, remarkably improves separation quality and robustness under a low code rate, and solves the problem that the traditional separation method is difficult to consider separation performance and audio quality under a low-bit rate compression condition. Compared with the prior method, the invention has at least the following advantages:
1. from the design of model architecture, the problems of easy confusion of voice and background sound and limited feature expression under the condition of low bit rate are innovatively solved, and a solution with more practical value is provided for application scenes such as remote communication, voice recognition preprocessing, hearing aid design and the like.
2. Under the condition of a low code rate of 3kbps, although the PESQ of the separated voice is 1.853 and is slightly lower than the single-audio reconstruction result (1.901) under the same code rate, compared with the complexity of a task, the voice quality of the method is kept good, which shows that the method still has practicability and robustness under a resource limited scene.
3. Aiming at the separation scene under the low code rate, the targeted structural improvement is carried out on the Codec model, and the application boundary of the Codec model in the field of audio processing is expanded.
4. The method has important theoretical significance and wide application prospect in the aspect of fusion of audio compression and multitasking voice processing, and has remarkable innovation and popularization value.
The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present invention more readily apparent.
Detailed Description
The embodiment of the invention provides a method, a device, equipment and a medium for separating voice from background sound based on Codec, which take the Codec technology as a core, and avoid calculation burden and separation error caused by direct processing at a waveform or frequency spectrum level by separating voice from background sound in a characterization space, thereby successfully solving the problem that the traditional separation method is difficult to consider separation performance and audio quality under the condition of low bit rate compression.
The technical scheme in the embodiment of the invention has the following overall thought:
The Codec model mainly focuses on the end-to-end neural audio coding and decoding technology, and can realize high-fidelity audio reconstruction at an extremely low bit rate. The existing Codec system is mainly used for compressing and reconstructing audio, and the input and the output of the Codec system are the same section of audio. The system is generally composed of four modules, as shown in fig. 1, and comprises 1) an audio encoder for extracting audio features and encoding the audio features into a high-dimensional representation space and simultaneously downsampling the audio, 2) a quantizer for discretizing the high-dimensional representation and generating a discrete representation composed of a limited number of vectors, 3) an audio decoder for restoring the discrete representation into original audio and realizing compression and reconstruction functions, and 4) a discriminator for evaluating audio reconstruction quality from multiple dimensions such as time sequence sampling points, a spectrogram, a perception evaluation of voice quality (PESQ) and the like. In the model training process, the design and performance of any module can influence the overall system effect, and the model selection and the final performance of each module are closely related. The use of current Codec systems is limited to audio compression and reconstruction, and while efficient compression and restoration can be achieved, the impact of the above-described modules on the performance of the audio reconstruction task at low code rates, as well as the specific impact of the internal modules (especially the quantizer of the audio representation) on the model performance, have not been investigated systematically nor their potential for separation of speech from background sounds.
The invention provides a voice and background sound separation method facing to a low-code rate environment under the background, and introduces an end-to-end neural audio coding and decoding frame Codec into the field of voice and background sound separation for the first time. By constructing a unified evaluation framework and combining a series of modularized ablation experiments, the system explores the action mechanism of the combination of key modules such as an encoder, a quantizer, a decoder, a discriminator and the like on the system performance, and the optimal structural configuration suitable for audio compression and separation tasks is defined. The scheme not only reserves the modeling advantage of the nerve coding and decoding model under the high compression ratio, but also realizes the collaborative modeling of compression and separation tasks through systematic optimization of internal modules (such as encoder architecture, discriminator design and audio quantizer), and remarkably improves the separation quality and robustness under the low code rate. The embodiment of the invention provides a speech and background sound separation method based on a Codec technology, which aims to improve the high-efficiency separation capability of speech and background sound while realizing low code rate compression. Aiming at the limitation of the existing audio coding and decoding technology in the voice separation task, the invention systematically explores the applicability of various Codec frameworks in the voice and background sound separation task, and further researches the specific influence of different modules, particularly the audio quantization module, on the separation performance. On the basis, the invention provides an encoding and decoding separation model based on the Codec with optimal design, and the encoder structure, the discriminator design and the quantizer model are cooperatively improved, so that the encoding and decoding separation model still has good voice and background sound separation performance under the condition of low bit rate compression. The invention is the first systematic research of the influence of each module of the Codec (comprising a quantizer, a discriminator and an encoder framework) on the separation performance of voice and background sound, verifies the expandability of the Codec in the multi-task audio processing, carries out targeted structural improvement on the new task scene, and expands the application boundary of the Codec model in the audio processing field. Instead of selecting a Codec model based architecture implementation, it can be implemented based on other speech codecs (e.g., opus, AMR, or the like).
Compared with the prior art, the method is based on model architecture design, creatively solves the problem that voice and background sound are easy to be confused under the condition of low bit rate and the characteristic expression is limited, and provides a solution with practical value for application scenes such as remote communication, voice recognition preprocessing, hearing aid design and the like. Therefore, the scheme has important theoretical significance and wide application prospect in the aspect of fusion of audio compression and multitasking voice processing, and has remarkable innovation and popularization value.
Example 1
The embodiment provides a method for separating speech from background sound based on Codec, as shown in fig. 2, including:
S1, a separation model construction process is carried out, wherein a Codec-based coding and decoding separation model is constructed, as shown in fig. 3, the separation model comprises an encoder, a separation module, two audio representation quantizers and a decoder, the encoder is used for encoding audio into continuous audio representations, the separation module is used for separating the continuous audio representations into a hidden representation containing voice and a hidden representation containing background sound, the two audio representation quantizers are respectively used for carrying out discretization on the hidden representation containing voice and the hidden representation containing background sound and compressing the hidden representation containing background sound to a low bit rate level to obtain compressed discrete voice representations and compressed discrete background sound representations, and the decoder is used for reducing the compressed discrete voice representations and the compressed discrete background sound representations into separated voice signals and background sound signals, so that audio separation in a characteristic representation space is achieved.
S2, model training process, namely randomly selecting a pure voice sample to overlap background sound to generate mixed input audio, carrying out audio separation by a Codec-based Codec separation model to obtain a voice signal and a background sound signal, mixing the voice signal and the background sound signal to obtain predicted mixed audio, respectively evaluating the reconstruction quality of the model after separation on the voice and the background sound by using a first discrimination module and a second discrimination module, and simultaneously evaluating the reconstruction quality of the predicted mixed audio by using a third discrimination module, so as to train the Codec-based Codec separation model until completion.
S3, an audio separation process, namely separating target audio by using a Codec-based Codec separation model which is completed through training, and quantitatively evaluating the voice quality of model separation output by adopting a perceived voice quality evaluation index (PESQ).
In the model training process, the design and performance of any module can influence the overall system effect, and the model selection and the final performance of each module are closely related. At present, the influence of the modules on the performance of the audio reconstruction task under the condition of low code rate is not systematically studied. In one embodiment, the impact of different module designs on audio compression and reconstruction performance under different neural encoder architectures (e.g., encodec and HILCodec) is systematically explored. Specifically, the study includes comparative experiments of various discriminant combinations, such as the combination of a Multi-Scale Feature discriminant (MFBD, multi-Scale Feature-Based Discriminator) and a Multi-Scale Short-time Fourier transform discriminant (MSTFTD, multi-Scale Short-Time Fourier Transform Discriminator), or the combination of a Multi-period discriminant (MFD, multi-Period Discriminator) and a Multi-Scale period discriminant (MPD, multi-Scale Period Discriminator), and the use of four discriminants in combination, different types of quantizer designs, including vector Residual quantizer (RVQ), residual structure RVQ (Residual RVQ), fully connected structure quantizer (FSQ), binary quantizer (BSQ), low Frequency Quantizer (LFQ) and Residual low frequency quantizer (Residual LFQ), and the configuration combinations of different codebook sizes and codebook numbers.
The research is carried out under the ultra-low code rate compression scene of 3kbps, the system evaluates the trade-off performance of various designs between the audio compression rate and the reconstruction fidelity, and an optimization scheme is further provided for low code rate voice processing application. The study flow chart of this section is detailed in fig. 5.
The final selected Codec-based Codec separation model is shown in fig. 3, and is based on HILCodec model architecture, comprising four main parts of encoder, separation module, two audio representation quantizer and decoder. The separation module comprises two groups of nonlinear layers, wherein each group of nonlinear layers comprises a plurality of nonlinear layers with different dimensions, and specifically comprises a 64-dimensional nonlinear layer, a 32-dimensional nonlinear layer and a 16-dimensional nonlinear layer. The first discrimination module, the second discrimination module and the third discrimination module respectively comprise a multi-scale Fourier discriminator (MFBD) and a multi-resolution short-time Fourier transform discriminator (MSTFTD). The audio representation quantizer employs a fully-connected structure quantizer (FSQ).
The specific steps of the model training process are as follows:
1. The model was trained using multiple open source data sets including 1) a voice data set, DNS-CHALLENGE, LIBRITTS, VCTK, 2) a noise data set, DNS-change noise set, 3) music and other sound data sets, MTG-Jamendo and AudioSet (audio_event). The process of generating mixed input audio for training is specifically to randomly acquire a section of audio signal containing single or multiple voiceAnd an audio signal containing only background soundsThrough volume normalization and signal superposition, the audio signals are mixed into audio signals mixed by voice and background sounds, namely mixed input audio, and the formula is as follows:;;
wherein T is the number of sampling points of the audio.
2. Encoding audio by an encoder into successive audio representationsWhere N is the characterization length. The formula is as follows:;
Wherein, theIs an audio encoder function.
3. The output of the audio encoder is passed through a separation module Sep (x) consisting of a plurality of non-linear layers to output a hidden representation containing speechAnd hidden representations including background soundsThe formula is as follows:;
4. the hidden representation containing speech and the hidden representation containing background sounds are separately discretized by an FSQ quantizer and compressed to a bit rate level of 3 kbps. The FSQ computes each scalar such that all vectors of scalar composition are finite, the FSQ quantizer principle is shown in fig. 6. The specific formula is as follows:;
Wherein, theIn order to round the function of the rounding,L is the discretization level, d is the dimension of each vector in X, thenIs the vocabulary size of the post-discrete vector.
FSQ converts each scalar contained in the X tensor into a range of 0 to L-1. This embodiment is performed at an input audio sampling rate of 24k, setting the quantization vocabulary of the FSQ quantizer to [8,8,8,8,5,5,5,5,5,5,5,5,5,5,5,5].
5. The compressed discrete audio representation is fed into a Decoder (x) which further restores it to a separated multi-speaker speech signal and background sound signal, thereby achieving audio separation in the feature representation space. The formula is as follows:。
MFBD (multi-scale Fourier discriminator) and MSTFTD (multi-resolution short-time Fourier transform discriminator) are adopted as discriminating modules for evaluating the quality of the models after audio is separated and reconstructed in a characterization space, and the combination of the discriminators can evaluate the performance of the audio reconstruction task under the condition of low code rate. Specifically, the reconstruction quality of the model on the voice and the background sound after separation is evaluated through the first discrimination module and the second discrimination module respectively, and meanwhile, the reconstruction quality of the whole mixed audio is discriminated through the third discrimination module (the predicted mixed audio is compared with the mixed input audio), so that the separation effect is improved, and the information loss is reduced as much as possible.
Finally, in the audio separation process, a discrimination module is not needed, and the speech quality of model separation output can be quantitatively evaluated by adopting a PESQ (perceived speech quality evaluation) index.
The method of the embodiment takes the Codec technology as the core, and successfully solves the problem that the traditional separation method is difficult to consider the separation performance and the audio quality under the condition of low bit rate compression by separating the voice from the background sound in the characterization space. Specifically, the Codec model can effectively extract compact and high-information-density intermediate representations of speech and background sounds, so that the separation module can accurately distinguish different sound sources on the compressed representation, and the risk of information loss is remarkably reduced. Meanwhile, the scheme avoids calculation burden and separation error caused by direct processing at a waveform or frequency spectrum level, and the method has the advantages that under the condition of low code rate of 3kbps, although the PESQ of separated voice is 1.853 which is slightly lower than a single-audio reconstruction result (1.901) under the same code rate, compared with the complexity of a task, the voice quality is kept good, and the method has practicability and robustness under a resource-limited scene.
Based on the same inventive concept, the application also provides a device corresponding to the method in the first embodiment, and the details of the second embodiment are shown.
Example two
In this embodiment, a Codec-based speech and background sound separation device is provided, which includes a Codec-based Codec separation model (which may be a model architecture as shown in fig. 3), where the Codec-based Codec separation model includes an encoder, a separation module, two audio representation quantizers and a decoder, the encoder is configured to encode audio into a continuous audio representation, the separation module is configured to separate the continuous audio representation into a hidden representation including speech and a hidden representation including background sound, the two audio representation quantizers are configured to perform discretization processing on the hidden representation including speech and the hidden representation including background sound, respectively, and compress the two audio representation quantizers to a low bit rate level to obtain a compressed discrete speech representation and a compressed discrete background sound representation, and the decoder is configured to restore the compressed discrete speech representation and the compressed discrete background sound representation into a separated speech signal and a background sound signal, thereby implementing audio separation in a feature representation space.
Since the device described in the second embodiment of the present invention is a device for implementing the method described in the first embodiment of the present invention, based on the method described in the first embodiment of the present invention, a person skilled in the art can understand the specific structure and the deformation of the device, and thus the detailed description thereof is omitted herein. All devices used in the method according to the first embodiment of the present invention are within the scope of the present invention.
Based on the same inventive concept, the application provides an electronic device embodiment corresponding to the first embodiment, and the details of the third embodiment are shown in the specification.
Example III
The present embodiment provides an electronic device, as shown in fig. 7, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where any implementation of the first embodiment may be implemented when the processor executes the computer program.
Since the electronic device described in this embodiment is a device for implementing the method in the first embodiment of the present application, those skilled in the art will be able to understand the specific implementation of the electronic device and various modifications thereof based on the method described in the first embodiment of the present application, so how the electronic device implements the method in the embodiment of the present application will not be described in detail herein. The apparatus used to implement the methods of embodiments of the present application will be within the scope of the intended protection of the present application.
Based on the same inventive concept, the application provides a storage medium corresponding to the first embodiment, and the detail of the fourth embodiment is shown in the specification.
Example IV
The present embodiment provides a computer readable storage medium, as shown in fig. 8, on which a computer program is stored, which when executed by a processor, can implement any implementation of the first embodiment.
Since the computer readable storage medium described in this embodiment is a computer readable storage medium used to implement the method in the first embodiment of the present application, those skilled in the art can understand the specific implementation of the computer readable storage medium and various modifications thereof according to the method described in the first embodiment of the present application, so how the computer readable storage medium implements the method in the embodiment of the present application will not be described in detail herein. As long as the computer-readable storage medium employed by one skilled in the art to implement the methods of embodiments of the present application is within the intended scope of the present application.
The invention provides a voice and background sound separation method for a low-code rate environment, which expands an end-to-end nerve audio coding and decoding frame Codec into a voice separation task for the first time, not only reserves the modeling advantage of a nerve coding and decoding model under a high compression ratio, but also realizes cooperative modeling of the compression and separation task through systematic optimization of an internal module, and remarkably improves separation quality and robustness under a low code rate. Compared with the prior art, the method is based on model architecture design, creatively solves the problem that voice and background sound are easy to be confused under the condition of low bit rate and the characteristic expression is limited, and provides a solution with practical value for application scenes such as remote communication, voice recognition preprocessing, hearing aid design and the like. The scheme has important theoretical significance and wide application prospect in the aspect of fusion of audio compression and multitasking voice processing, and has remarkable innovation and popularization value. The technical scheme of the invention mainly relates to a voice and background sound separation system under the condition of low code rate, and can call voice component service of the scheme on a power related system after the scheme is mature.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While specific embodiments of the invention have been described above, it will be appreciated by those skilled in the art that the specific embodiments described are illustrative only and not intended to limit the scope of the invention, and that equivalent modifications and variations of the invention in light of the spirit of the invention will be covered by the claims of the present invention.