Movatterモバイル変換


[0]ホーム

URL:


CN120452467A - A method, device, equipment and medium for separating speech and background sound based on Codec - Google Patents

A method, device, equipment and medium for separating speech and background sound based on Codec

Info

Publication number
CN120452467A
CN120452467ACN202510962659.3ACN202510962659ACN120452467ACN 120452467 ACN120452467 ACN 120452467ACN 202510962659 ACN202510962659 ACN 202510962659ACN 120452467 ACN120452467 ACN 120452467A
Authority
CN
China
Prior art keywords
audio
separation
representation
background sound
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202510962659.3A
Other languages
Chinese (zh)
Other versions
CN120452467B (en
Inventor
关城
林羽静
张松磊
张良嵩
董衍旭
乐艺泽
杨启帆
林志华
叶勋
陈红
郑烜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujian Yili Information Technology Co ltd
Information and Telecommunication Branch of State Grid Fujian Electric Power Co Ltd
Original Assignee
Fujian Yili Information Technology Co ltd
Information and Telecommunication Branch of State Grid Fujian Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujian Yili Information Technology Co ltd, Information and Telecommunication Branch of State Grid Fujian Electric Power Co LtdfiledCriticalFujian Yili Information Technology Co ltd
Priority to CN202510962659.3ApriorityCriticalpatent/CN120452467B/en
Publication of CN120452467ApublicationCriticalpatent/CN120452467A/en
Application grantedgrantedCritical
Publication of CN120452467BpublicationCriticalpatent/CN120452467B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Landscapes

Abstract

Translated fromChinese

本发明公开了一种基于Codec的语音与背景音分离方法、装置、设备及介质,涉及语音噪音分离技术领域。所述方法包括分离模型构建过程、模型训练过程以及音频分离过程。通过构建并训练基于Codec的编解码分离模型,包括编码器、分离模块、两个音频表示量化器和解码器,实现在特征表征空间中的音频分离。本发明提供的一种基于Codec的语音与背景音分离方法、装置、设备及介质,以Codec技术为核心,通过在表征空间中进行语音与背景音的分离,避免了在波形或频谱层面直接处理带来的计算负担与分离误差,成功解决了传统分离方法在低比特率压缩条件下难以兼顾分离性能与音频质量的问题。

The present invention discloses a method, device, equipment and medium for separating speech and background sound based on Codec, and relates to the technical field of speech noise separation. The method includes a separation model construction process, a model training process and an audio separation process. By constructing and training a codec-based encoding and decoding separation model, including an encoder, a separation module, two audio representation quantizers and a decoder, audio separation in the feature representation space is achieved. The present invention provides a method, device, equipment and medium for separating speech and background sound based on Codec, which takes Codec technology as the core. By separating speech and background sound in the representation space, the computational burden and separation error caused by direct processing at the waveform or spectrum level are avoided, and the problem that traditional separation methods are difficult to balance separation performance and audio quality under low bit rate compression conditions is successfully solved.

Description

Codec-based voice and background sound separation method, device, equipment and medium
Technical Field
The present invention relates to the field of speech noise separation technologies, and in particular, to a method, an apparatus, a device, and a medium for separating speech and background sound based on Codec.
Background
In the fields of speech processing and audio signal processing, separation of speech from background sounds has been an important topic of research. Humans often come into contact with mixed audio in daily life, which contains both human voice (foreground sound) and ambient noise, background music, etc. (background sound). An efficient and accurate separation of human voice and background sound is critical for many application scenarios, such as voice communication, voice recognition, hearing aid devices, music production, and audio enhancement. In the remote video conference and the voice call, clear voice extraction can effectively improve voice intelligibility and communication quality, background noise is removed in hearing aid equipment to help hearing impaired people to more accurately hear voice, and voice and background noise can be accurately separated in music production and film and television post-processing to help mixing and noise reduction. However, conventional speech enhancement and separation methods still present challenges when faced with complex background environments, such as variability in background noise, speaker diversity, and computational resource limitations.
In recent years, with the development of deep learning technology, a voice processing method based on a neural network has been significantly advanced. Currently, speech separation mainly relies on deep learning methods such as time-frequency masking (TF-masking), self-supervised learning, and end-to-end speech separation models. The method has good effect under the condition of high code rate, and the performance of tasks such as voice separation, voice enhancement and the like is continuously improved. However, in the low-code-rate compression and transmission scenarios, the separation performance of the traditional neural network method is often reduced due to information loss, so that the existing method has difficulty in considering the voice quality and the separation performance under the condition of limited resources. Meanwhile, the existing neural network voice separation model is mainly constructed and evaluated under the condition of high code rate, and the applicability processing of the compressed voice characteristics is lacking, so that the practicability is limited under the condition of low bit rate.
Disclosure of Invention
The invention aims to solve the technical problems by providing a method, a device, equipment and a medium for separating voice from background sound based on a Codec, which take the Codec technology as a core, and avoid the calculation burden and separation error caused by direct processing at a waveform or frequency spectrum level by separating voice from background sound in a characterization space, thereby successfully solving the problem that the traditional separation method is difficult to consider separation performance and audio quality under the condition of low bit rate compression.
In a first aspect, the present invention provides a method for separating speech from background sound based on Codec, including:
The separation model construction process comprises the steps of constructing a Codec-based coding and decoding separation model, comprising an encoder, a separation module, two audio representation quantizers and a decoder, wherein the encoder is used for encoding audio into continuous audio representations, the separation module is used for separating the continuous audio representations into a hidden representation containing voice and a hidden representation containing background sound, the two audio representation quantizers are respectively used for carrying out discretization on the hidden representation containing voice and the hidden representation containing background sound and compressing the discrete representation containing voice and the hidden representation containing background sound to a low bit rate level to obtain a compressed discrete voice representation and a compressed discrete background sound representation, and the decoder is used for restoring the compressed discrete voice representation and the compressed discrete background sound representation into a separated voice signal and a background sound signal so as to realize audio separation in a characteristic representation space;
The model training process comprises the steps of randomly selecting a pure voice sample, overlapping background sounds, generating mixed input audio, carrying out audio separation by a Codec-based Codec separation model, obtaining a voice signal and a background sound signal, mixing the voice signal and the background sound signal to obtain a predicted mixed audio, respectively evaluating the reconstruction quality of the voice and the background sounds after separation by using a first discrimination module and a second discrimination module, and simultaneously evaluating the reconstruction quality of the predicted mixed audio by using a third discrimination module, thereby training the Codec-based Codec separation model until the completion;
And in the audio separation process, a trained Codec-based Codec separation model is used for separating target audio, and the speech quality of the model separation output is quantitatively evaluated.
Further, the separation module comprises two sets of nonlinear layers, each set of nonlinear layers comprising a plurality of nonlinear layers, respectively.
Further, the process of generating the mixed input audio is specifically to obtain an audio signal containing single or multiple voiceAnd an audio signal containing only background soundsThrough volume normalization and signal superposition, the audio signals are mixed into audio signals mixed by voice and background sounds, namely mixed input audio, and the formula is as follows:;;
wherein T is the number of sampling points of the audio.
Further, the two audio representation quantizers each adopt an FSQ quantizer, the quantization process is specifically that the hidden representation including the voice and the hidden representation including the background sound are respectively discretized and compressed to a low bit rate level, each scalar included in the X tensor is converted into a range from 0 to L-1, so that all vectors composed of the scalar are limited, and the formula is as follows:;
Wherein, theIn order to round the function of the rounding,L is the discretization level, d is the dimension of each vector in X, thenIs the vocabulary size of the post-discrete vector.
Further, at an input audio sample rate of 24k, the quantization vocabulary of the FSQ quantizer is set to [8,8,8,8,5,5,5,5,5,5,5,5,5,5,5,5], compressing the hidden token to a 3 kbps bit rate level.
Further, the first discrimination module, the second discrimination module and the third discrimination module respectively comprise a multi-scale Fourier discriminator and a multi-resolution short-time Fourier transform discriminator.
Further, the voice quality of the model separation output is quantitatively evaluated by adopting the perceived voice quality evaluation index.
In a second aspect, the invention provides a Codec-based speech and background sound separation device, which comprises a Codec-based Codec separation model, wherein the Codec-based Codec separation model comprises an encoder, a separation module, two audio representation quantizers and a decoder, the encoder is used for encoding audio into continuous audio representations, the separation module is used for separating the continuous audio representations into a hidden representation containing speech and a hidden representation containing background sound, the two audio representation quantizers are respectively used for carrying out discretization on the hidden representation containing speech and the hidden representation containing background sound and compressing to a low bit rate level to obtain a compressed discrete speech representation and a compressed discrete background sound representation, and the decoder is used for reducing the compressed discrete speech representation and the compressed discrete background sound representation into a separated speech signal and a background sound signal, so that audio separation in a feature representation space is realized.
In a third aspect, the invention provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of the first aspect when executing the program.
In a fourth aspect, the present invention provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the method of the first aspect.
The technical scheme provided by the embodiment of the invention has at least the following technical effects:
The voice and background sound separation method for the low-code rate environment expands the end-to-end nerve audio coding and decoding frame Codec into a voice separation task for the first time, not only reserves the modeling advantage of a nerve coding and decoding model under a high compression ratio, but also realizes cooperative modeling of the compression and separation task through systematic optimization of an internal module, remarkably improves separation quality and robustness under a low code rate, and solves the problem that the traditional separation method is difficult to consider separation performance and audio quality under a low-bit rate compression condition. Compared with the prior method, the invention has at least the following advantages:
1. from the design of model architecture, the problems of easy confusion of voice and background sound and limited feature expression under the condition of low bit rate are innovatively solved, and a solution with more practical value is provided for application scenes such as remote communication, voice recognition preprocessing, hearing aid design and the like.
2. Under the condition of a low code rate of 3kbps, although the PESQ of the separated voice is 1.853 and is slightly lower than the single-audio reconstruction result (1.901) under the same code rate, compared with the complexity of a task, the voice quality of the method is kept good, which shows that the method still has practicability and robustness under a resource limited scene.
3. Aiming at the separation scene under the low code rate, the targeted structural improvement is carried out on the Codec model, and the application boundary of the Codec model in the field of audio processing is expanded.
4. The method has important theoretical significance and wide application prospect in the aspect of fusion of audio compression and multitasking voice processing, and has remarkable innovation and popularization value.
The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present invention more readily apparent.
Drawings
The invention will be further described with reference to examples of embodiments with reference to the accompanying drawings.
FIG. 1 is a schematic diagram of a prior art overall frame of a Codec;
FIG. 2 is a flow chart of a method according to a first embodiment of the invention;
FIG. 3 is a schematic diagram of a Codec separation model according to a first embodiment of the present invention;
FIG. 4 is a schematic diagram of a model training process according to a first embodiment of the present invention;
FIG. 5 is a diagram illustrating an architecture of an optimal model for searching in accordance with a first embodiment of the present invention;
FIG. 6 is a schematic diagram of the FSQ quantization principle in the first embodiment of the present invention;
fig. 7 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention;
fig. 8 is a schematic structural diagram of a medium in a fourth embodiment of the present invention.
Detailed Description
The embodiment of the invention provides a method, a device, equipment and a medium for separating voice from background sound based on Codec, which take the Codec technology as a core, and avoid calculation burden and separation error caused by direct processing at a waveform or frequency spectrum level by separating voice from background sound in a characterization space, thereby successfully solving the problem that the traditional separation method is difficult to consider separation performance and audio quality under the condition of low bit rate compression.
The technical scheme in the embodiment of the invention has the following overall thought:
The Codec model mainly focuses on the end-to-end neural audio coding and decoding technology, and can realize high-fidelity audio reconstruction at an extremely low bit rate. The existing Codec system is mainly used for compressing and reconstructing audio, and the input and the output of the Codec system are the same section of audio. The system is generally composed of four modules, as shown in fig. 1, and comprises 1) an audio encoder for extracting audio features and encoding the audio features into a high-dimensional representation space and simultaneously downsampling the audio, 2) a quantizer for discretizing the high-dimensional representation and generating a discrete representation composed of a limited number of vectors, 3) an audio decoder for restoring the discrete representation into original audio and realizing compression and reconstruction functions, and 4) a discriminator for evaluating audio reconstruction quality from multiple dimensions such as time sequence sampling points, a spectrogram, a perception evaluation of voice quality (PESQ) and the like. In the model training process, the design and performance of any module can influence the overall system effect, and the model selection and the final performance of each module are closely related. The use of current Codec systems is limited to audio compression and reconstruction, and while efficient compression and restoration can be achieved, the impact of the above-described modules on the performance of the audio reconstruction task at low code rates, as well as the specific impact of the internal modules (especially the quantizer of the audio representation) on the model performance, have not been investigated systematically nor their potential for separation of speech from background sounds.
The invention provides a voice and background sound separation method facing to a low-code rate environment under the background, and introduces an end-to-end neural audio coding and decoding frame Codec into the field of voice and background sound separation for the first time. By constructing a unified evaluation framework and combining a series of modularized ablation experiments, the system explores the action mechanism of the combination of key modules such as an encoder, a quantizer, a decoder, a discriminator and the like on the system performance, and the optimal structural configuration suitable for audio compression and separation tasks is defined. The scheme not only reserves the modeling advantage of the nerve coding and decoding model under the high compression ratio, but also realizes the collaborative modeling of compression and separation tasks through systematic optimization of internal modules (such as encoder architecture, discriminator design and audio quantizer), and remarkably improves the separation quality and robustness under the low code rate. The embodiment of the invention provides a speech and background sound separation method based on a Codec technology, which aims to improve the high-efficiency separation capability of speech and background sound while realizing low code rate compression. Aiming at the limitation of the existing audio coding and decoding technology in the voice separation task, the invention systematically explores the applicability of various Codec frameworks in the voice and background sound separation task, and further researches the specific influence of different modules, particularly the audio quantization module, on the separation performance. On the basis, the invention provides an encoding and decoding separation model based on the Codec with optimal design, and the encoder structure, the discriminator design and the quantizer model are cooperatively improved, so that the encoding and decoding separation model still has good voice and background sound separation performance under the condition of low bit rate compression. The invention is the first systematic research of the influence of each module of the Codec (comprising a quantizer, a discriminator and an encoder framework) on the separation performance of voice and background sound, verifies the expandability of the Codec in the multi-task audio processing, carries out targeted structural improvement on the new task scene, and expands the application boundary of the Codec model in the audio processing field. Instead of selecting a Codec model based architecture implementation, it can be implemented based on other speech codecs (e.g., opus, AMR, or the like).
Compared with the prior art, the method is based on model architecture design, creatively solves the problem that voice and background sound are easy to be confused under the condition of low bit rate and the characteristic expression is limited, and provides a solution with practical value for application scenes such as remote communication, voice recognition preprocessing, hearing aid design and the like. Therefore, the scheme has important theoretical significance and wide application prospect in the aspect of fusion of audio compression and multitasking voice processing, and has remarkable innovation and popularization value.
Example 1
The embodiment provides a method for separating speech from background sound based on Codec, as shown in fig. 2, including:
S1, a separation model construction process is carried out, wherein a Codec-based coding and decoding separation model is constructed, as shown in fig. 3, the separation model comprises an encoder, a separation module, two audio representation quantizers and a decoder, the encoder is used for encoding audio into continuous audio representations, the separation module is used for separating the continuous audio representations into a hidden representation containing voice and a hidden representation containing background sound, the two audio representation quantizers are respectively used for carrying out discretization on the hidden representation containing voice and the hidden representation containing background sound and compressing the hidden representation containing background sound to a low bit rate level to obtain compressed discrete voice representations and compressed discrete background sound representations, and the decoder is used for reducing the compressed discrete voice representations and the compressed discrete background sound representations into separated voice signals and background sound signals, so that audio separation in a characteristic representation space is achieved.
S2, model training process, namely randomly selecting a pure voice sample to overlap background sound to generate mixed input audio, carrying out audio separation by a Codec-based Codec separation model to obtain a voice signal and a background sound signal, mixing the voice signal and the background sound signal to obtain predicted mixed audio, respectively evaluating the reconstruction quality of the model after separation on the voice and the background sound by using a first discrimination module and a second discrimination module, and simultaneously evaluating the reconstruction quality of the predicted mixed audio by using a third discrimination module, so as to train the Codec-based Codec separation model until completion.
S3, an audio separation process, namely separating target audio by using a Codec-based Codec separation model which is completed through training, and quantitatively evaluating the voice quality of model separation output by adopting a perceived voice quality evaluation index (PESQ).
In the model training process, the design and performance of any module can influence the overall system effect, and the model selection and the final performance of each module are closely related. At present, the influence of the modules on the performance of the audio reconstruction task under the condition of low code rate is not systematically studied. In one embodiment, the impact of different module designs on audio compression and reconstruction performance under different neural encoder architectures (e.g., encodec and HILCodec) is systematically explored. Specifically, the study includes comparative experiments of various discriminant combinations, such as the combination of a Multi-Scale Feature discriminant (MFBD, multi-Scale Feature-Based Discriminator) and a Multi-Scale Short-time Fourier transform discriminant (MSTFTD, multi-Scale Short-Time Fourier Transform Discriminator), or the combination of a Multi-period discriminant (MFD, multi-Period Discriminator) and a Multi-Scale period discriminant (MPD, multi-Scale Period Discriminator), and the use of four discriminants in combination, different types of quantizer designs, including vector Residual quantizer (RVQ), residual structure RVQ (Residual RVQ), fully connected structure quantizer (FSQ), binary quantizer (BSQ), low Frequency Quantizer (LFQ) and Residual low frequency quantizer (Residual LFQ), and the configuration combinations of different codebook sizes and codebook numbers.
The research is carried out under the ultra-low code rate compression scene of 3kbps, the system evaluates the trade-off performance of various designs between the audio compression rate and the reconstruction fidelity, and an optimization scheme is further provided for low code rate voice processing application. The study flow chart of this section is detailed in fig. 5.
The final selected Codec-based Codec separation model is shown in fig. 3, and is based on HILCodec model architecture, comprising four main parts of encoder, separation module, two audio representation quantizer and decoder. The separation module comprises two groups of nonlinear layers, wherein each group of nonlinear layers comprises a plurality of nonlinear layers with different dimensions, and specifically comprises a 64-dimensional nonlinear layer, a 32-dimensional nonlinear layer and a 16-dimensional nonlinear layer. The first discrimination module, the second discrimination module and the third discrimination module respectively comprise a multi-scale Fourier discriminator (MFBD) and a multi-resolution short-time Fourier transform discriminator (MSTFTD). The audio representation quantizer employs a fully-connected structure quantizer (FSQ).
The specific steps of the model training process are as follows:
1. The model was trained using multiple open source data sets including 1) a voice data set, DNS-CHALLENGE, LIBRITTS, VCTK, 2) a noise data set, DNS-change noise set, 3) music and other sound data sets, MTG-Jamendo and AudioSet (audio_event). The process of generating mixed input audio for training is specifically to randomly acquire a section of audio signal containing single or multiple voiceAnd an audio signal containing only background soundsThrough volume normalization and signal superposition, the audio signals are mixed into audio signals mixed by voice and background sounds, namely mixed input audio, and the formula is as follows:;;
wherein T is the number of sampling points of the audio.
2. Encoding audio by an encoder into successive audio representationsWhere N is the characterization length. The formula is as follows:;
Wherein, theIs an audio encoder function.
3. The output of the audio encoder is passed through a separation module Sep (x) consisting of a plurality of non-linear layers to output a hidden representation containing speechAnd hidden representations including background soundsThe formula is as follows:;
4. the hidden representation containing speech and the hidden representation containing background sounds are separately discretized by an FSQ quantizer and compressed to a bit rate level of 3 kbps. The FSQ computes each scalar such that all vectors of scalar composition are finite, the FSQ quantizer principle is shown in fig. 6. The specific formula is as follows:;
Wherein, theIn order to round the function of the rounding,L is the discretization level, d is the dimension of each vector in X, thenIs the vocabulary size of the post-discrete vector.
FSQ converts each scalar contained in the X tensor into a range of 0 to L-1. This embodiment is performed at an input audio sampling rate of 24k, setting the quantization vocabulary of the FSQ quantizer to [8,8,8,8,5,5,5,5,5,5,5,5,5,5,5,5].
5. The compressed discrete audio representation is fed into a Decoder (x) which further restores it to a separated multi-speaker speech signal and background sound signal, thereby achieving audio separation in the feature representation space. The formula is as follows:
MFBD (multi-scale Fourier discriminator) and MSTFTD (multi-resolution short-time Fourier transform discriminator) are adopted as discriminating modules for evaluating the quality of the models after audio is separated and reconstructed in a characterization space, and the combination of the discriminators can evaluate the performance of the audio reconstruction task under the condition of low code rate. Specifically, the reconstruction quality of the model on the voice and the background sound after separation is evaluated through the first discrimination module and the second discrimination module respectively, and meanwhile, the reconstruction quality of the whole mixed audio is discriminated through the third discrimination module (the predicted mixed audio is compared with the mixed input audio), so that the separation effect is improved, and the information loss is reduced as much as possible.
Finally, in the audio separation process, a discrimination module is not needed, and the speech quality of model separation output can be quantitatively evaluated by adopting a PESQ (perceived speech quality evaluation) index.
The method of the embodiment takes the Codec technology as the core, and successfully solves the problem that the traditional separation method is difficult to consider the separation performance and the audio quality under the condition of low bit rate compression by separating the voice from the background sound in the characterization space. Specifically, the Codec model can effectively extract compact and high-information-density intermediate representations of speech and background sounds, so that the separation module can accurately distinguish different sound sources on the compressed representation, and the risk of information loss is remarkably reduced. Meanwhile, the scheme avoids calculation burden and separation error caused by direct processing at a waveform or frequency spectrum level, and the method has the advantages that under the condition of low code rate of 3kbps, although the PESQ of separated voice is 1.853 which is slightly lower than a single-audio reconstruction result (1.901) under the same code rate, compared with the complexity of a task, the voice quality is kept good, and the method has practicability and robustness under a resource-limited scene.
Based on the same inventive concept, the application also provides a device corresponding to the method in the first embodiment, and the details of the second embodiment are shown.
Example two
In this embodiment, a Codec-based speech and background sound separation device is provided, which includes a Codec-based Codec separation model (which may be a model architecture as shown in fig. 3), where the Codec-based Codec separation model includes an encoder, a separation module, two audio representation quantizers and a decoder, the encoder is configured to encode audio into a continuous audio representation, the separation module is configured to separate the continuous audio representation into a hidden representation including speech and a hidden representation including background sound, the two audio representation quantizers are configured to perform discretization processing on the hidden representation including speech and the hidden representation including background sound, respectively, and compress the two audio representation quantizers to a low bit rate level to obtain a compressed discrete speech representation and a compressed discrete background sound representation, and the decoder is configured to restore the compressed discrete speech representation and the compressed discrete background sound representation into a separated speech signal and a background sound signal, thereby implementing audio separation in a feature representation space.
Since the device described in the second embodiment of the present invention is a device for implementing the method described in the first embodiment of the present invention, based on the method described in the first embodiment of the present invention, a person skilled in the art can understand the specific structure and the deformation of the device, and thus the detailed description thereof is omitted herein. All devices used in the method according to the first embodiment of the present invention are within the scope of the present invention.
Based on the same inventive concept, the application provides an electronic device embodiment corresponding to the first embodiment, and the details of the third embodiment are shown in the specification.
Example III
The present embodiment provides an electronic device, as shown in fig. 7, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where any implementation of the first embodiment may be implemented when the processor executes the computer program.
Since the electronic device described in this embodiment is a device for implementing the method in the first embodiment of the present application, those skilled in the art will be able to understand the specific implementation of the electronic device and various modifications thereof based on the method described in the first embodiment of the present application, so how the electronic device implements the method in the embodiment of the present application will not be described in detail herein. The apparatus used to implement the methods of embodiments of the present application will be within the scope of the intended protection of the present application.
Based on the same inventive concept, the application provides a storage medium corresponding to the first embodiment, and the detail of the fourth embodiment is shown in the specification.
Example IV
The present embodiment provides a computer readable storage medium, as shown in fig. 8, on which a computer program is stored, which when executed by a processor, can implement any implementation of the first embodiment.
Since the computer readable storage medium described in this embodiment is a computer readable storage medium used to implement the method in the first embodiment of the present application, those skilled in the art can understand the specific implementation of the computer readable storage medium and various modifications thereof according to the method described in the first embodiment of the present application, so how the computer readable storage medium implements the method in the embodiment of the present application will not be described in detail herein. As long as the computer-readable storage medium employed by one skilled in the art to implement the methods of embodiments of the present application is within the intended scope of the present application.
The invention provides a voice and background sound separation method for a low-code rate environment, which expands an end-to-end nerve audio coding and decoding frame Codec into a voice separation task for the first time, not only reserves the modeling advantage of a nerve coding and decoding model under a high compression ratio, but also realizes cooperative modeling of the compression and separation task through systematic optimization of an internal module, and remarkably improves separation quality and robustness under a low code rate. Compared with the prior art, the method is based on model architecture design, creatively solves the problem that voice and background sound are easy to be confused under the condition of low bit rate and the characteristic expression is limited, and provides a solution with practical value for application scenes such as remote communication, voice recognition preprocessing, hearing aid design and the like. The scheme has important theoretical significance and wide application prospect in the aspect of fusion of audio compression and multitasking voice processing, and has remarkable innovation and popularization value. The technical scheme of the invention mainly relates to a voice and background sound separation system under the condition of low code rate, and can call voice component service of the scheme on a power related system after the scheme is mature.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While specific embodiments of the invention have been described above, it will be appreciated by those skilled in the art that the specific embodiments described are illustrative only and not intended to limit the scope of the invention, and that equivalent modifications and variations of the invention in light of the spirit of the invention will be covered by the claims of the present invention.

Claims (10)

Translated fromChinese
1.一种基于Codec的语音与背景音分离方法,其特征在于,包括:1. A method for separating speech and background sound based on a Codec, comprising:分离模型构建过程:构建基于Codec的编解码分离模型,包括编码器、分离模块、两个音频表示量化器和解码器;编码器用于将音频编码为连续的音频表示;所述分离模块用于将连续的音频表示分离为包含语音的隐藏表征和包含背景音的隐藏表征,两个音频表示量化器分别用于将包含语音的隐藏表征和包含背景音的隐藏表征进行离散化处理,并压缩至低比特率水平,得到压缩的离散语音表示和压缩的离散背景音表示;所述解码器用于将压缩的离散语音表示和压缩的离散背景音表示还原为分离后的语音信号与背景音信号,从而实现在特征表征空间中的音频分离;Separation model construction process: Construct a codec-based codec separation model, including an encoder, a separation module, two audio representation quantizers and a decoder; the encoder is used to encode audio into a continuous audio representation; the separation module is used to separate the continuous audio representation into a hidden representation containing speech and a hidden representation containing background sound, and the two audio representation quantizers are used to discretize the hidden representation containing speech and the hidden representation containing background sound, respectively, and compress them to a low bit rate level to obtain a compressed discrete speech representation and a compressed discrete background sound representation; the decoder is used to restore the compressed discrete speech representation and the compressed discrete background sound representation to the separated speech signal and background sound signal, thereby realizing audio separation in the feature representation space;模型训练过程:随机选取纯净语音样本叠加背景音,生成混合输入音频;通过基于Codec的编解码分离模型进行音频分离后,得到语音信号与背景音信号,将语音信号与背景音信号混合得到预测混合音频;使用第一判别模块和第二判别模块分别评估模型在分离后对语音与背景音的重建质量,同时使用第三判别模块评估预测混合音频的重建质量,从而对基于Codec的编解码分离模型进行训练直至完成;Model training process: Randomly select pure speech samples and superimpose background sound to generate mixed input audio. After audio separation using the codec-based codec separation model, the speech signal and background sound signal are obtained, and the speech signal and background sound signal are mixed to obtain the predicted mixed audio. The first and second discriminant modules are used to evaluate the model's reconstruction quality of the speech and background sound after separation, respectively. The third discriminant module is also used to evaluate the reconstruction quality of the predicted mixed audio. The codec-based codec separation model is trained until completion.音频分离过程:使用训练完成的基于Codec的编解码分离模型对目标音频进行分离,并对模型分离输出的语音质量进行定量评估。Audio separation process: Use the trained Codec-based codec separation model to separate the target audio and quantitatively evaluate the voice quality of the model separation output.2.根据权利要求1所述的方法,其特征在于:所述分离模块包括两组非线性层,每一组非线性层分别包括多个不同维度的非线性层,两组非线性层分别用于提取语音信号与背景音信号。2. The method according to claim 1 is characterized in that: the separation module includes two groups of nonlinear layers, each group of nonlinear layers includes multiple nonlinear layers of different dimensions, and the two groups of nonlinear layers are used to extract speech signals and background sound signals respectively.3.根据权利要求1所述的方法,其特征在于:生成混合输入音频的过程具体为:获取一段包含单人或多人语音的音频信号和一段只包含背景音的音频信号,通过音量归一化及信号叠加,将其混合为语音和背景音混杂的音频信号,即混合输入音频,公式如下:3. The method according to claim 1, wherein the process of generating the mixed input audio is as follows: obtaining an audio signal containing a single person or multiple people's voices and an audio signal containing only background sound , through volume normalization and signal superposition, it is mixed into an audio signal of voice and background sound, that is, mixed input audio. The formula is as follows: ; ;其中,T为音频的采样点个数。Where T is the number of audio sampling points.4.根据权利要求1所述的方法,其特征在于,所述两个音频表示量化器均采用FSQ量化器,其量化过程具体为:对包含语音的隐藏表征和包含背景音的隐藏表征分别进行离散化处理,并压缩至低比特率水平,将X张量中包含的每个标量都转换为0到L-1的范围内,使得标量组成的所有向量为有限个,公式如下:4. The method according to claim 1, wherein both of the audio representation quantizers employ FSQ quantizers, and the quantization process is as follows: discretizing the hidden representation containing speech and the hidden representation containing background sound, respectively, and compressing them to a low bit rate level; converting each scalar contained in the X tensor to a range from 0 to L-1, so that the total number of vectors composed of the scalars is finite, as shown in the following formula: ;其中,为四舍五入函数,,L为离散化水平,d为X中每个向量的维度大小,则为离散后向量的词表大小。in, is the rounding function, , L is the discretization level, d is the dimension of each vector in X, then is the vocabulary size of the discretized vector.5.根据权利要求4所述的方法,其特征在于:在输入音频采样率为24k时,将FSQ量化器的量化词表设为[8,8,8,8,5,5,5,5,5,5,5,5,5,5,5,5],将隐藏表征压缩至3 kbps比特率水平。5. The method according to claim 4, characterized in that: when the input audio sampling rate is 24k, the quantization vocabulary of the FSQ quantizer is set to [8,8,8,8,5,5,5,5,5,5,5,5,5,5,5,5], and the hidden representation is compressed to a bit rate level of 3 kbps.6.根据权利要求1所述的方法,其特征在于:第一判别模块、第二判别模块以及第三判别模块分别包括多尺度傅里叶判别器和多分辨率短时傅里叶变换判别器。6. The method according to claim 1, wherein the first discrimination module, the second discrimination module and the third discrimination module respectively include a multi-scale Fourier discriminator and a multi-resolution short-time Fourier transform discriminator.7.根据权利要求1所述的方法,其特征在于:采用感知语音质量评估指标对模型分离输出的语音质量进行定量评估。7. The method according to claim 1 is characterized in that: a perceptual speech quality evaluation index is used to quantitatively evaluate the speech quality of the model separation output.8.一种Codec的语音与背景音分离装置,其特征在于:包括基于Codec的编解码分离模型,所述基于Codec的编解码分离模型包括编码器、分离模块、两个音频表示量化器和解码器;编码器用于将音频编码为连续的音频表示;所述分离模块用于将连续的音频表示分离为包含语音的隐藏表征和包含背景音的隐藏表征,两个音频表示量化器分别用于将包含语音的隐藏表征和包含背景音的隐藏表征进行离散化处理,并压缩至低比特率水平,得到压缩的离散语音表示和压缩的离散背景音表示;所述解码器用于将压缩的离散语音表示和压缩的离散背景音表示还原为分离后的语音信号与背景音信号,从而实现在特征表征空间中的音频分离。8. A device for separating speech and background sound of a Codec, characterized in that: it includes a codec-based codec separation model, the codec-based codec separation model includes an encoder, a separation module, two audio representation quantizers and a decoder; the encoder is used to encode audio into a continuous audio representation; the separation module is used to separate the continuous audio representation into a hidden representation containing speech and a hidden representation containing background sound, the two audio representation quantizers are used to discretize the hidden representation containing speech and the hidden representation containing background sound, and compress them to a low bit rate level to obtain a compressed discrete speech representation and a compressed discrete background sound representation; the decoder is used to restore the compressed discrete speech representation and the compressed discrete background sound representation to the separated speech signal and background sound signal, thereby realizing audio separation in the feature representation space.9.一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其特征在于,所述处理器执行所述程序时实现如权利要求1至7任一项所述的方法。9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the method according to any one of claims 1 to 7 when executing the program.10.一种计算机可读存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现如权利要求1至7任一项所述的方法。10. A computer-readable storage medium having a computer program stored thereon, wherein when the program is executed by a processor, the method according to any one of claims 1 to 7 is implemented.
CN202510962659.3A2025-07-142025-07-14Codec-based voice and background sound separation method, device, equipment and mediumActiveCN120452467B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202510962659.3ACN120452467B (en)2025-07-142025-07-14Codec-based voice and background sound separation method, device, equipment and medium

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202510962659.3ACN120452467B (en)2025-07-142025-07-14Codec-based voice and background sound separation method, device, equipment and medium

Publications (2)

Publication NumberPublication Date
CN120452467Atrue CN120452467A (en)2025-08-08
CN120452467B CN120452467B (en)2025-09-16

Family

ID=96615700

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202510962659.3AActiveCN120452467B (en)2025-07-142025-07-14Codec-based voice and background sound separation method, device, equipment and medium

Country Status (1)

CountryLink
CN (1)CN120452467B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US6167375A (en)*1997-03-172000-12-26Kabushiki Kaisha ToshibaMethod for encoding and decoding a speech signal including background noise
WO2016162165A1 (en)*2015-04-102016-10-13Thomson LicensingMethod and device for encoding multiple audio signals, and method and device for decoding a mixture of multiple audio signals with improved separation
US20190066713A1 (en)*2016-06-142019-02-28The Trustees Of Columbia University In The City Of New YorkSystems and methods for speech separation and neural decoding of attentional selection in multi-speaker environments
CN115862659A (en)*2022-09-272023-03-28北京大学 Method and device for iterative pitch estimation and speech separation based on two-way cascade framework
CN118248159A (en)*2024-03-252024-06-25上海师范大学 A joint training method for speech enhancement model based on frequency subband
CN119252225A (en)*2024-10-292025-01-03平安科技(深圳)有限公司 Speech synthesis method, device, equipment and storage medium based on decoupled VQ-VAE
CN119360878A (en)*2024-12-232025-01-24安徽大学 Speech separation method and system combining acoustic features and state space model

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US6167375A (en)*1997-03-172000-12-26Kabushiki Kaisha ToshibaMethod for encoding and decoding a speech signal including background noise
WO2016162165A1 (en)*2015-04-102016-10-13Thomson LicensingMethod and device for encoding multiple audio signals, and method and device for decoding a mixture of multiple audio signals with improved separation
US20190066713A1 (en)*2016-06-142019-02-28The Trustees Of Columbia University In The City Of New YorkSystems and methods for speech separation and neural decoding of attentional selection in multi-speaker environments
CN115862659A (en)*2022-09-272023-03-28北京大学 Method and device for iterative pitch estimation and speech separation based on two-way cascade framework
CN118248159A (en)*2024-03-252024-06-25上海师范大学 A joint training method for speech enhancement model based on frequency subband
CN119252225A (en)*2024-10-292025-01-03平安科技(深圳)有限公司 Speech synthesis method, device, equipment and storage medium based on decoupled VQ-VAE
CN119360878A (en)*2024-12-232025-01-24安徽大学 Speech separation method and system combining acoustic features and state space model

Also Published As

Publication numberPublication date
CN120452467B (en)2025-09-16

Similar Documents

PublicationPublication DateTitle
CN112071330B (en)Audio data processing method and device and computer readable storage medium
CN114360491B (en)Speech synthesis method, device, electronic equipment and computer readable storage medium
EP3649642A1 (en)Method and system for enhancing a speech signal of a human speaker in a video using visual information
CN113205793B (en)Audio generation method and device, storage medium and electronic equipment
CN114360493A (en)Speech synthesis method, apparatus, medium, computer device and program product
CN113539231A (en) Audio processing method, vocoder, apparatus, device and storage medium
CN116189034A (en)Head posture driving method and device, equipment, medium and product thereof
CN112289338B (en)Signal processing method and device, computer equipment and readable storage medium
CN118248159A (en) A joint training method for speech enhancement model based on frequency subband
Lian et al.Whisper to normal speech conversion using sequence-to-sequence mapping model with auditory attention
CN116434734A (en)Voice editing method, electronic device and storage medium
CN119252268A (en) Audio decoding, encoding method, device, electronic device and storage medium
CN117746874A (en)Audio data processing method and device and readable storage medium
Parthasarathi et al.Wordless sounds: Robust speaker diarization using privacy-preserving audio representations
WO2022082607A1 (en)Vocal track removal by convolutional neural network embedded voice finger printing on standard arm embedded platform
Kalkhorani et al.Time-domain transformer-based audiovisual speaker separation
CN116959464A (en) Training method, audio generation method and device of audio generation network
CN118398004B (en)Construction and training method of large voice model, audio output method and application
CN111883105B (en) Training method and system for context information prediction model for video scenes
CN120452467B (en)Codec-based voice and background sound separation method, device, equipment and medium
CN115240648B (en)Controller voice enhancement method and device facing voice recognition
CN116665638A (en)Speech synthesis method, speech synthesis device, electronic device, and storage medium
CN104715756A (en)Audio data processing method and device
Bovbjerg et al.Noise-Robust Target-Speaker Voice Activity Detection Through Self-Supervised Pretraining
CN116612747B (en)Speech phoneme recognition method, device, equipment and storage medium

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp