Movatterモバイル変換


[0]ホーム

URL:


CN119517069A - Signal extraction method, device, electronic device and storage medium - Google Patents

Signal extraction method, device, electronic device and storage medium
Download PDF

Info

Publication number
CN119517069A
CN119517069ACN202311078260.6ACN202311078260ACN119517069ACN 119517069 ACN119517069 ACN 119517069ACN 202311078260 ACN202311078260 ACN 202311078260ACN 119517069 ACN119517069 ACN 119517069A
Authority
CN
China
Prior art keywords
model
signal
separator
loss
discriminator
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311078260.6A
Other languages
Chinese (zh)
Inventor
刘路路
雷延强
江建亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Shiyuan Electronics Thecnology Co Ltd
Guangzhou Shikun Electronic Technology Co Ltd
Original Assignee
Guangzhou Shiyuan Electronics Thecnology Co Ltd
Guangzhou Shikun Electronic Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Shiyuan Electronics Thecnology Co Ltd, Guangzhou Shikun Electronic Technology Co LtdfiledCriticalGuangzhou Shiyuan Electronics Thecnology Co Ltd
Priority to CN202311078260.6ApriorityCriticalpatent/CN119517069A/en
Publication of CN119517069ApublicationCriticalpatent/CN119517069A/en
Pendinglegal-statusCriticalCurrent

Links

Classifications

Landscapes

Abstract

The embodiment of the application provides a signal extraction method, a signal extraction device, electronic equipment and a storage medium, and belongs to the technical field of signal processing. The method comprises the steps of inputting a mixed signal to be separated into a separator model to extract a first extracted signal, wherein the separator model is obtained based on the following training steps of inputting the first mixed signal into the separator model to obtain a first output signal, inputting the first output signal and a target signal into a discriminator model to discriminate, calculating the current separator model loss and the current discriminator model loss, and carrying out alternating countermeasure training on the separator model and the discriminator model based on the current separator model loss and the current discriminator model loss. The application can improve the separation performance of the separator model, thereby extracting purer human voice signals.

Description

Signal extraction method, signal extraction device, electronic equipment and storage medium
Technical Field
The present application relates to the field of signal processing technologies, and in particular, to a signal extraction method, a signal extraction device, an electronic device, and a storage medium.
Background
In audio use scenes such as voice enhancement, K songs, stereo-to-5.1-channel surround sound, voice and accompaniment are extracted from the mixed music signals, and then the extracted voice and accompaniment are processed respectively according to requirements, so that the audio quality can be improved, and the desired sound effect can be achieved. Therefore, the separation of the human voice and accompaniment in the mixed music signal provides a foundation for improving the tone quality and adjusting the sound effect.
In general, the human voice and accompaniment in a mixed music signal may be separated based on a non-negative matrix factorization technique. With the development of deep learning, the separation task of human voice and accompaniment in the mixed music signal can be realized based on a deep neural network.
However, existing deep neural networks do not separate the human voice and accompaniment in the mixed music signal cleanly. In other words, the existing separation method of vocal accompaniment is poor in separation effect, and purer vocal accompaniment or purer accompaniment cannot be extracted.
Disclosure of Invention
The embodiment of the application provides a signal extraction method, a device, electronic equipment and a storage medium, which can solve the problems that the separation effect of voice accompaniment is poor and pure voice or accompaniment cannot be extracted. In order to achieve the above object, the technical solution provided by the embodiments of the present application is as follows:
In a first aspect, an embodiment of the present application provides a signal extraction method, where the signal extraction method includes:
inputting a mixed signal to be separated into a separator model, and extracting a first extracted signal, wherein the separator model is obtained based on the following training steps:
Inputting the first mixed signal into the separator model to obtain a first output signal;
Inputting the first output signal and the target signal into a discriminator model for discrimination, and calculating the current separator model loss and the current discriminator model loss;
Alternating countermeasure training is performed on the separator model and the discriminant model based on the current separator model loss and the current discriminant model loss.
In a second aspect, an embodiment of the present application provides a signal extraction apparatus, including:
the device comprises a separation module, a first output signal and a second output signal, wherein the separation module is used for inputting a mixed signal to be separated into a separator model to extract a first extracted signal;
The loss calculation module is used for inputting the first output signal and the target signal into a discriminator model to discriminate, and calculating the current separator model loss and the current discriminator model loss;
And the countermeasure training module is used for alternately performing countermeasure training on the separator model and the discriminator model based on the current separator model loss and the current discriminator model loss.
In a third aspect, an embodiment of the present application provides an electronic device, including a processor and a memory, where the memory stores a computer program adapted to be loaded by the processor and to perform the signal extraction method according to the first aspect.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the signal extraction method according to the first aspect.
In the embodiment of the application, the separator model and the discriminant model can be combined, and the loss function is utilized to conduct countermeasure training on the separator model and the discriminant model, so that the separator model and the discriminant model can mutually promote, and an ideal output signal is extracted based on the trained separator model. It can be understood that when the present application is employed to separate the vocal and accompaniment of a mixed music signal, based on the alternate countermeasure training of the separator model and the discriminator model, the separator model and the discriminator model promote each other and game, and the separator model after the completion of the countermeasure training can output a signal as close as possible to a target signal (e.g., a pure vocal signal), making it difficult for the discriminator model to determine the signal output by the separator model as false. Therefore, after the mixed music signals are input into the separator model for the countermeasure training, purer voice signals can be output, and the separation effect of the signal extraction model on voice and accompaniment is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic flow chart of a signal extraction method according to an embodiment of the present application;
fig. 2 is a schematic diagram of a signal processing flow in a signal extraction method according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a separator model according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a residual block in a separator model according to an embodiment of the present application;
Fig. 5 is a schematic structural diagram of a signal extraction device according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the following detailed description of the embodiments of the present application will be given with reference to the accompanying drawings.
It should be understood that the described embodiments are merely some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the application as detailed in the accompanying claims.
In the description of the present application, it should be understood that the terms "first" and "second," etc. are used merely to distinguish similar objects from each other and are not necessarily used to describe a particular order or sequence, nor should they be construed to indicate or imply relative importance. The specific meaning of the above terms in the present application can be understood by those of ordinary skill in the art according to the specific circumstances. Furthermore, in the description of the present application, unless otherwise indicated, "a plurality" means two or more. "and/or" describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate that there are three cases of a alone, a and B together, and B alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.
At present, the mixed signal separation method based on the deep neural network has the problem of poor separation effect. For example, when separating a voice and an accompaniment in a song, there is a case where the voice and the accompaniment are not separated cleanly. Particularly, when only accompaniment is performed in a certain song, after separation processing is performed on the voice accompaniment of the song, the audio amplitude (i.e. the amplitude of the sound wave) of the obtained voice signal is not zero, and some accompaniment leaks. That is, when the existing deep neural network is used for separating the vocal accompaniment of the song, the separated vocal accompaniment is still mixed with part of the accompaniment signals, or the separated accompaniment is still mixed with part of the vocal accompaniment signals.
Based on the above, the application provides a signal extraction method, a signal extraction device, electronic equipment and a storage medium. The separator model and the discriminant model are combined, and the loss function is utilized to conduct countermeasure training on the separator model and the discriminant model, so that the separator model and the discriminant model can mutually promote, an ideal output signal is extracted based on the trained separator model, and the signal separation effect of the separator model is improved.
It should be noted that the signal extraction method provided by the present application can be applied to any scenario in which multiple signals in a mixed signal are separated. For example, the above-mentioned signal separation scene may be that a human voice signal and/or an accompaniment signal is extracted from a song, a personal voice signal is extracted from a multi-person mixed voice, a clean voice signal is extracted from a voice signal containing noise, or the like. Hereinafter, the present application will be described in detail by taking a voice signal and an accompaniment signal extracted from a song as an example.
In practice, the signal extraction method may include inputting the mixed signal into a splitter model, through which the desired kind of signal is extracted. Wherein the separator model is derived based on a training step. For convenience of description, an actual mixed signal to be separated, which is processed by the separator model in the inference stage, may be referred to as a mixed signal to be separated, and a signal output by the separator model for the mixed signal to be separated may be referred to as a first extracted signal. In addition, any one of the mixed signals from the training data set processed by the separator model in the training phase may be referred to as a first mixed signal, and a signal output by the separator model for the first mixed signal may be referred to as a first output signal.
In one embodiment, referring to fig. 1, a signal extraction method provided by the present application may include the following steps:
S100, inputting the first mixed signal into a separator model to be trained, and obtaining a first output signal.
In practice, the splitter model may randomly generate an output signal after the first input of the mixed music signal obtained from the training dataset into the splitter model to be trained. The separator model is trained several times based on several mixed signals in the training data set, and a trained separator model can be constructed. The first extracted signal of the desired kind may then be extracted from the mixed signal to be separated based on the trained separator model. For ease of description, the separator model prior to completion of training may be referred to as the separator model to be trained.
In one embodiment, the first mixed signal may be a mixed music signal including a human voice signal and an accompaniment signal, and the first output signal may be a human voice signal.
In one embodiment, step S100 may specifically include steps S110 and S120 (not shown in the figure).
S110, the mixed signal is input into a separator model after data preprocessing, and a mask matrix is obtained.
S120, after the mask matrix is transformed corresponding to the data preprocessing, a first output signal is obtained.
In the implementation, the mixed signal may be input into the separator model after the data conversion process, and correspondingly, when the separator model outputs the signal, the output signal may be subjected to a corresponding inverse process to obtain the required first output signal.
S200, inputting the first output signal and the target signal into a discriminator model for discrimination, and calculating the current separator model loss and the current discriminator model loss.
The pre-prepared training data set may contain a large number of mixed signals, and a clean target signal corresponding to each mixed signal. The type of the target signal may determine the type of the signal output by the splitter model. That is, the target signal may be prepared in advance based on the kind of signal that eventually needs to be extracted. For example, if it is necessary to extract a clean human voice signal from a mixed music signal, a target signal prepared in advance in the training data set is a human voice signal.
In practice, the first output signal may be input to the discriminator model together with the target signal corresponding to the first output signal to determine whether the signal is true or false. For example, the first mixed signal may be any mixed music signal obtained from a training database, and the first output signal may be a human voice signal output by the splitter model to be trained based on the first mixed signal, where the human voice signal output by the splitter model needs to be as close to a pure human voice signal as possible. Thus, the target signal corresponding to the first output signal may be a clean human voice signal obtained from a training database. In the model training phase, a certain gap exists between the first output signal and the target signal. Thus, after the first output signal and the target signal are input to the discriminator, the discrimination result of the discriminator for the first output signal may be false, and the discrimination result for the target signal may be true. Wherein the identifier network may use tags to tag the recognition results. For example, a signal identified as true is marked as 1, and a signal identified as false is marked as 0.
It will be appreciated that the human voice signal output by the splitter model may be inconsistent to some extent with the true clean human voice signal. At this time, the degree of inconsistency between the first output signal of the splitter model and the target signal may be measured by the discrimination result of the discriminator, that is, the loss of the splitter model with respect to the current signal extraction process (may be referred to as the current splitter model loss) is calculated by the discriminator, and then the current splitter model loss is used for training the splitter model to improve the signal splitting effect of the splitter model. Of course, the loss of the discriminator model in the current signal true and false discriminating process (which may be referred to as the current discriminator model loss) can also be calculated by the discriminator, and the current discriminator model loss is used for training the discriminator model so as to improve the discriminating effect of the discriminator model.
It is worth mentioning that the training of the separator model aims at that the output signal can be discriminated as true by the discriminator model. The training of the classifier model aims at classifying the true target signal as true and classifying the signal output from the separator model as false. Therefore, when the loss of the current separator model is calculated, the human voice signal output by the current separator model can be marked as true, and when the loss of the current discriminator model is calculated, the target signal can be marked as true, and the human voice signal output by the current separator model can be marked as false.
S300, based on the current separator model loss and the current discriminant model loss, alternating countermeasure training is performed on the separator model and the discriminant model.
In practice, the initial splitter model does not perform well for the separation of mixed signals. After the mixed music signals obtained from the training data set are input into the initial separator model, the difference between the extracted human voice signals and the real pure human voice signals is large, and the human voice signals are easily judged to be false by the discriminator model. Thus, the first output signal output by the separator model and the target signal can be directly input into the discriminator model for training at the beginning. Then, by performing alternating countermeasure training on the separator model and the discriminator model, the separator model with a better separation effect can be obtained quickly.
It should be noted that, in the process of alternating countermeasure training by the separator model and the discriminator model, if the training samples of the training data set are sufficient, after any one model updates the model parameters based on the countermeasure training, a new mixed signal is obtained to train the other model next time. If there are fewer training samples in the training data set, the training data set may be reused for alternate countermeasure training for the separator model and the discriminant model after each training sample in the training data set has been used once.
In one embodiment, step S300 may specifically include fixing model parameters of the classifier model and back-propagating the current classifier model loss to the classifier model to adjust model parameters of the classifier model when the current classifier model loss is calculated, and fixing model parameters of the classifier model and back-propagating the current classifier model loss to the classifier model to adjust model parameters of the classifier model when the current classifier model loss is calculated.
In implementations, after calculating the value (loss) of the loss function of the separator model or the discriminant model, the separator model and the discriminant model may each back-propagate loss to update the weight coefficients of the corresponding model. Specifically, the loss back propagation of the separator model may update the weight coefficients of the separator model once, and then the loss back propagation of the discriminant model may update the weight coefficients of the discriminant model once. The separator model and the discriminant model are thus trained alternately and cyclically until the models converge.
S400, inputting the mixed signal to be separated into a trained separator model, and extracting a first extracted signal.
It will be appreciated that by the countermeasure training described above, the separator model may output a human voice signal as close as possible to a clean human voice signal. Thus, after the training of the splitter model is finished, when the mixed music signal is input to the splitter model, the first extracted signal "spurious" can be output.
In one embodiment, step S400 may specifically include steps S410 and S420 (not shown in the figures).
S410, the mixed signal is input into a separator model after data preprocessing, and a mask matrix is obtained.
S420, performing transformation corresponding to data preprocessing on the mask matrix to obtain a first extracted signal.
In one embodiment, it may be determined when the separator model is complete with training based on the loss function value of the separator. Correspondingly, before the step S400, the signal extraction method can further comprise the steps of calculating a loss difference value between the current separator model loss and the last calculated historical separator model loss, and determining the separator model to be trained as the separator model which is completely trained when the loss difference value is not higher than a preset difference value.
In practice, when the loss of the separator model tends to be stable, i.e. the loss of the separator model is not basically reduced, the separator model has a better separation effect, and the training of the separator can be stopped at this time, and the separator model is used as a trained separator model, so that the finally required output signal is extracted from the mixed signal to be separated. Of course, it is also possible to determine when the separator model has completed training based on other iteration termination strategies. For example, the training may be terminated when the number of training times of the separator model reaches a preset number of training times, or when the training time period of the separator reaches a preset training time period, etc., which is not limited by the present application.
It is worth mentioning that the separator model and the discriminant model are continuously subjected to alternate countermeasure training, and can promote and hold down each other. In this way, the loss of the separator model may decrease and then increase during the challenge training process, and may fluctuate over a range. Thus, the loss of the separator model is substantially no longer decreasing, which is understood to mean that the decreasing amplitude is small or that the loss of the separator model fluctuates in a small range after decreasing to a certain value.
In one embodiment, in order to obtain other signals (which may be referred to as second extracted signals) in the mixed signal to be separated, the first extracted signal may be subtracted from the mixed signal to be separated to extract the second extracted signal, after step S400.
Referring to fig. 2, the present embodiment provides specific steps for countermeasure training for the separator model and the discriminant model. Wherein the separator model may be a masking network model and the arbiter model may be a GAN (GENERATIVE ADVERSARIAL network, generate an antagonism network) model. Steps S111 to S114 are specific processing of step S110, steps S121 to S123 are specific processing of step S120, and steps S201 to S202 are specific processing of step S200.
S111, inputting the mixed music signal xmix into the separator model, and performing short-time Fourier transform analysis on the mixed music signal xmix to obtain a complex spectrogram of the mixed signal.
In one embodiment, the separator model may specifically employ a U-shaped residual network model. Referring to fig. 3, the separator model may include several encoding modules (Encoder Block), several decoding modules (Decoder blocks), and a middle layer for connection, which may employ a convolution module (Conv Block). For example, the embodiment shown in FIG. 3 employs 4 Encoder modules and 4 Decoder modules. The Encoder module consists of a residual convolution and an average pooling layer, the Decoder module consists of a transposed convolution and a residual convolution, and the middle Conv module consists of a residual convolution. In fig. 3, the symbol "- >" shows the number of passes per layer module. For example, "1- >32" indicates that the number of input channels is1 and the number of output channels is 32, and "32- >1" indicates that the number of input channels is 32 and the number of output channels is 1.
In this embodiment, the residual convolution structures in Encoder, decoder and intermediate Conv modules can be seen in fig. 4, which includes a normalization layer Batch Norm, an activation function layer leak_ relu and a convolution layer Conv2d. And, the convolution kernels of the residual convolutions all take a size of 3 x 3. In addition, the average pooling layer of Encoder modules uses a convolution kernel of size 2×2 and the transpose layer of the decoder modules uses a convolution kernel of size 3×3.
It should be noted that, the respective modules may also use other convolution kernel sizes, such as 1x1, 5x5, 7x7, etc., which is not limited by the present application.
S112, obtaining the amplitude spectrum Xmix and the original phase of the mixed signal according to the complex frequency spectrum
In practice, the complex spectrum of a signal (complex spectrum for short) comprises two parts, namely a magnitude spectrum and a phase spectrum. The amplitude spectrum is understood as a relation curve between amplitude and frequency obtained after amplitude is taken from the complex spectrum, and the phase spectrum is understood as a relation curve between phase and frequency obtained after phase is taken from the complex spectrum.
S113, inputting the magnitude spectrum Xmix into a separator in a separator model.
S114, the separator outputs a mask matrix M.
S121, multiplying the mask matrix M with the amplitude spectrum Xmix of the original signal to obtain an output amplitude spectrum, namely X1=Xmix M
S122, using the output amplitude spectrum X1 and the original phaseThe composed complex spectrum serves as the output complex spectrum of the separator.
S123, performing short-time Fourier inverse transformation on the output complex spectrum to obtain a first output signal x1 of the separator.
S201, the output signal x1 of the separator and the target signal x0 are input to the discriminator.
In one embodiment, the discriminator model may specifically employ HIFIGAN discriminator models. HIFIGAN the discriminant model may include two discriminants, one being a Multi-period discriminant (Multi-Period Discriminator, MPD) that can be used to identify signals of different periods and the other being a Multi-scale discriminant (Multi-Scale Discriminator, MSD) that can be used to cope with very long data.
In practice, the multi-cycle arbiter may include a plurality of sub-arbiters, each having a different cycle p, which convert audio into 2-D (two-dimensional) data according to the value of the cycle p, and then process the data using CNN (Convolutional Neural Network ) with a convolution kernel of kx1, that is, CNN with a convolution kernel height of k and a convolution kernel width of 1. By using different periods p, patterns of different periodic signals can be obtained. In addition, the multi-size discriminator can also comprise a plurality of sub-discriminators, each discriminator is processed by using an average pool with different length, and then the signal input into the discriminator model is discriminated by using the CNN.
S202, the discriminator discriminates the input signal.
In practice, the classifier model may classify the output signals of the separator from two different angles based on a multi-cycle classifier and a multi-scale classifier, respectively. Pure human voice signals can be judged as true, and unclean human voice signals can be judged as false.
S300, based on the current separator model loss and the current discriminant model loss, alternating countermeasure training is carried out on the separator model to be trained and the discriminant model to be trained.
In practice, the Loss of separator model may be composed of three parts, including generating a countermeasures Loss (GAN Loss), mel-spectrum Loss (Mel-Spectrogram Loss), and feature matching Loss (Feature Match Loss). Specifically, the above-mentioned losses can be calculated using the following loss functions, respectively:
Where S denotes a separator, D denotes a discriminator, x0 denotes a target signal of the separator, xmix denotes an input signal of the separator, χ (°) denotes an operation of converting a time-domain signal into a mel-spectrum, T denotes the number of layers of extracted features in the discriminator, Di denotes features in the i-th layer of the discriminator, and Ni denotes the number of features in the i-th layer of the discriminator.
In practice, the loss of the arbiter model may be calculated using the following loss function:
Based on the above-mentioned loss function, the discriminators are further decomposed into sub-discriminators, and the overall losses of the discriminators model and the separator model can be expressed as:
Where Dj represents the j-th sub-arbiter of the multi-cycle arbiter and the multi-size arbiter, and λmel and λfm represent adjustable hyper-parameters.
In practice, for ease of understanding, the process of alternating countermeasure training for the separator model and the discriminant model to be trained may be specifically as follows:
The mixed music signal obtained from the training data set is input into the separator model for the first time, after the human voice signal is output, the human voice signal output by the separator model and the target signal (namely the real pure human voice signal) can be directly input into the discriminator model, and the initial discriminator model is trained. The arbiter model is required to discriminate the human voice signal as false and the target signal as true.
Then, a new mixed signal is obtained from the training data set, the mixed music signal is input into the separator model, the human voice signal output by the separator model and the target signal corresponding to the mixed signal are discriminated by adopting the discriminator model, and the loss of the separator model is calculatedThen, a back propagation algorithm can be used to performThe parameters of the separator model are updated by propagating from the output layer to the input layer of the separator model, so that the human voice signal output by the separator model can be judged as true by the discriminator model.
At the next training, the loss of the discriminant model can be calculatedThen using a back propagation algorithm, willPropagating from the output layer to the input layer of the discriminator model updates the parameters of the discriminator model so that it can distinguish between the human voice signal and the target signal output by the splitter model.
Therefore, the separator model and the discriminant model can be subjected to repeated alternate iterative training, and respective model parameters can be continuously updated, so that respective loss functions can be converged, and an accurate separator model structure is constructed. Then, the actual music signal xmix to be separated is input into the trained separator model, and the final human voice signal x1 can be extracted. The final human voice signal x1 may then be subtracted from the mixed music signal xmix to obtain the final accompaniment signal x2.
Of course, in another embodiment, the first output signal may be an accompaniment signal, and the target signal corresponding to the first output signal may be a pure accompaniment signal obtained from the training database. The first output signal may be a signal of a certain instrument in the accompaniment signals, and the target signal corresponding to the first output signal may be a clean instrument signal obtained from the training database. That is, the kind of the first output signal to be extracted can be determined by the target signal, and after the separator model and the judger model are subjected to the alternate countermeasure training, the mixed signal to be actually separated is input into the trained separator model, so that the first extracted signal to be finally required can be extracted. Then, the first extracted signal extracted by the splitter model is subtracted from the mixed music signal to obtain the other signal which is finally required. The specific implementation process can be referred to the above embodiments, and the disclosure is not repeated here.
It should be noted that, in addition to the specific use of the U-type residual network model for the separator model and the HIFIGAN discriminator model for the discriminator model, the separator model and the discriminator model may also use other network models to form the generating countermeasure network, which is not limited in the present application.
In one embodiment, step S410 after separator model training may include steps S411-S414, and step S420 may include steps S421-S423.
S411, inputting the mixed music signal xmix into the separator model, and performing short-time Fourier transform analysis on the mixed music signal xmix to obtain a complex spectrogram of the mixed signal.
S412, obtaining the amplitude spectrum Xmix and the original phase of the mixed signal according to the complex spectrum
S413, inputting the magnitude spectrum Xmix into a separator in the separator model.
S414, the separator outputs a mask matrix M.
S421, multiplying the mask matrix M with the amplitude spectrum Xmix of the original signal to obtain an output amplitude spectrum, namely X1=Xmix M
S422, using the output amplitude spectrum X1 and the original phaseThe composed complex spectrum serves as the output complex spectrum of the separator.
S423, the first extracted signal x1 of the separator is obtained by performing short-time Fourier inverse transformation on the output complex spectrum.
It should be noted that steps S411-S414 are similar to steps S111-S114, and steps S421-S423 are similar to steps S121-S123, which are not described herein.
The application combines the mask network model with the GAN network, namely, a discriminator model is added on the basis of the separator model, and the loss function can be utilized to conduct countermeasure training on the separator model and the discriminator model, so that the separator model and the discriminator model can be mutually promoted, and the separation performance of the separator model is improved. Thereby extracting the desired output signal based on the trained separator model. It can be understood that when the present application is employed to separate the vocal and accompaniment of a mixed music signal, based on the alternate countermeasure training of the separator model and the discriminator model, the separator model and the discriminator model promote each other and game, and the separator model after the completion of the countermeasure training can output a signal as close as possible to a target signal (e.g., a pure vocal signal), making it difficult for the discriminator model to determine the signal output by the separator model as false. Therefore, after the mixed music signals are input into the separator model for the countermeasure training, purer voice signals can be output, and the separation effect of the signal extraction model on voice and accompaniment is improved.
Based on the same technical concept, the embodiment of the present application further provides a signal extraction device, referring to fig. 5, the signal extraction device may include:
the device comprises a separation module, a first output signal and a second output signal, wherein the separation module is used for inputting a mixed signal to be separated into a separator model to extract a first extracted signal;
The loss calculation module is used for inputting the first output signal and the target signal into the discriminator model for discrimination and calculating the current separator model loss and the current discriminator model loss;
And the countermeasure training module is used for alternately performing countermeasure training on the separator model and the discriminator model based on the current separator model loss and the current discriminator model loss.
In one embodiment, the separator model is a U-shaped residual network model and the discriminant model is a HIFIGAN discriminant model.
In one embodiment, the countermeasure training module is specifically configured to:
When the loss of the current discriminator model is calculated, fixing the model parameters of the separator model, and reversely transmitting the loss of the current discriminator model to the discriminator model so as to adjust the model parameters of the discriminator model;
when the current separator model loss is calculated, model parameters of the discriminator model are fixed and the current separator model loss is back-propagated to the separator model to adjust the model parameters of the separator model.
In one embodiment, the separation module is further to:
The first extracted signal is subtracted from the mixed signal to be separated, and the second extracted signal is extracted.
In one embodiment, the separation module is specifically configured to:
The method comprises the steps of performing data preprocessing on a mixed signal to be separated, and inputting the mixed signal to a separator model to obtain a mask matrix;
and transforming the mask matrix corresponding to the data preprocessing to obtain a first extracted signal.
In one embodiment, the separation module is specifically configured to:
performing short-time Fourier transform on the mixed signal to be separated to obtain a complex spectrogram of the mixed signal;
Calculating an amplitude spectrum and an original phase of the mixed signal to be separated based on a complex spectrum diagram of the mixed signal to be separated;
And inputting the amplitude spectrum of the mixed signal to be separated into a separator model to obtain a mask matrix.
In one embodiment, the separation module is further specifically configured to:
performing point multiplication on the magnitude spectrum of the mixed signal to be separated and the mask matrix to obtain an output magnitude spectrum;
combining the output amplitude spectrum with the original phase to obtain an output complex spectrum;
And carrying out short-time Fourier inverse transformation on the output complex spectrum to obtain a first extracted signal.
In one embodiment, the loss calculation module is further to:
Calculating a loss difference value between the current separator model loss and the last calculated historical separator model loss;
and when the loss difference value is not higher than the preset difference value, determining that the separator model finishes the countermeasure training.
It should be noted that, in the signal extraction apparatus provided in the foregoing embodiment, when the signal extraction method is executed, only the division of the foregoing functional modules is used as an example, in practical application, the foregoing functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the signal extraction device and the signal extraction method provided in the foregoing embodiments belong to the same concept, and the implementation principle and the technical effects to be achieved may refer to the method embodiments, which are not described herein.
Based on the same technical concept, the embodiment of the application also provides an electronic device, referring to fig. 6, which comprises a processor and a memory, wherein the memory stores a computer program, and the computer program is suitable for being loaded by the processor and executing the signal extraction method according to any one of the embodiments. The implementation principle of the electronic device for signal extraction and the technical effects to be achieved can refer to the method embodiment, and the application is not described herein.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or may be implemented by hardware. Based on such understanding, the above technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, and the software product of the method for extracting the human voice and accompaniment may be stored in a computer readable storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including storing several instructions for causing an electronic device to perform the method for extracting the human voice and accompaniment described in the various embodiments or some parts of the embodiments.
The foregoing description of the preferred embodiments of the application is not intended to limit the application to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the application are intended to be included within the scope of the application.

Claims (11)

Translated fromChinese
1.一种信号提取方法,其特征在于,所述信号提取方法包括:1. A signal extraction method, characterized in that the signal extraction method comprises:将待分离混合信号输入分离器模型,提取出第一被提取信号;Inputting the mixed signal to be separated into the separator model to extract the first extracted signal;其中,所述分离器模型基于以下训练步骤得到:The separator model is obtained based on the following training steps:将第一混合信号输入所述分离器模型,得到第一输出信号;Inputting a first mixed signal into the separator model to obtain a first output signal;将所述第一输出信号以及目标信号输入判别器模型进行判别,并计算当前分离器模型损失和当前判别器模型损失;Input the first output signal and the target signal into the discriminator model for discrimination, and calculate the current separator model loss and the current discriminator model loss;基于所述当前分离器模型损失和所述当前判别器模型损失,对所述分离器模型和所述判别器模型进行交替对抗训练。Based on the current separator model loss and the current discriminator model loss, the separator model and the discriminator model are alternately trained against each other.2.如权利要求1所述的信号提取方法,其特征在于,所述分离器模型为U型残差网络模型,所述判别器模型为HIFIGAN判别器模型。2. The signal extraction method according to claim 1 is characterized in that the separator model is a U-shaped residual network model, and the discriminator model is a HIFIGAN discriminator model.3.如权利要求1所述的信号提取方法,其特征在于,所述基于所述当前分离器模型损失和所述当前判别器模型损失,对所述分离器模型和所述判别器模型进行交替对抗训练包括:3. The signal extraction method according to claim 1, characterized in that the alternating adversarial training of the separator model and the discriminator model based on the current separator model loss and the current discriminator model loss comprises:当计算出所述当前分离器模型损失时,固定所述判别器模型的模型参数,并将所述当前分离器模型损失反向传播至所述分离器模型,以调整所述分离器模型的模型参数;When the current separator model loss is calculated, the model parameters of the discriminator model are fixed, and the current separator model loss is back-propagated to the separator model to adjust the model parameters of the separator model;当计算出所述当前判别器模型损失时,固定所述分离器模型的模型参数,并将所述当前判别器模型损失反向传播至所述判别器模型,以调整所述判别器模型的模型参数。When the current discriminator model loss is calculated, the model parameters of the separator model are fixed, and the current discriminator model loss is back-propagated to the discriminator model to adjust the model parameters of the discriminator model.4.如权利要求1所述的信号提取方法,其特征在于,在所述将待分离混合信号输入分离器模型,提取出第一被提取信号之后,所述信号提取方法还包括:4. The signal extraction method according to claim 1, characterized in that, after inputting the mixed signal to be separated into the separator model and extracting the first extracted signal, the signal extraction method further comprises:从所述待分离混合信号中减去所述第一被提取信号,提取出第二被提取信号。The first extracted signal is subtracted from the mixed signal to be separated to extract a second extracted signal.5.如权利要求1所述的信号提取方法,其特征在于,所述将待分离混合信号输入分离器模型,提取出第一被提取信号包括:5. The signal extraction method according to claim 1, wherein the step of inputting the mixed signal to be separated into a separator model to extract the first extracted signal comprises:对所述待分离混合信号进行数据预处理后输入所述分离器模型,得到掩码矩阵;Performing data preprocessing on the mixed signal to be separated and then inputting the data into the separator model to obtain a mask matrix;对所述掩码矩阵进行与所述数据预处理对应的变换后,得到所述第一被提取信号。After the mask matrix is transformed corresponding to the data preprocessing, the first extracted signal is obtained.6.如权利要求5所述的信号提取方法,其特征在于,所述对所述待分离混合信号进行数据预处理后输入所述分离器模型,得到掩码矩阵包括:6. The signal extraction method according to claim 5, characterized in that the step of performing data preprocessing on the mixed signal to be separated and then inputting the preprocessed signal into the separator model to obtain the mask matrix comprises:对所述待分离混合信号进行短时傅里叶变换,得到所述待分离混合信号的复数频谱图;Performing short-time Fourier transform on the mixed signal to be separated to obtain a complex frequency spectrum of the mixed signal to be separated;基于所述待分离混合信号的复数频谱图计算所述待分离混合信号的幅度谱以及原始相位;Calculating the amplitude spectrum and the original phase of the mixed signal to be separated based on the complex frequency spectrum of the mixed signal to be separated;将所述待分离混合信号的幅度谱输入所述分离器模型,得到所述掩码矩阵。The amplitude spectrum of the mixed signal to be separated is input into the separator model to obtain the mask matrix.7.如权利要求6所述的信号提取方法,其特征在于,所述对所述掩码矩阵进行与所述数据预处理对应的变换后,得到所述第一被提取信号包括:7. The signal extraction method according to claim 6, wherein the step of performing a transformation corresponding to the data preprocessing on the mask matrix to obtain the first extracted signal comprises:将所述待分离混合信号的幅度谱与所述掩码矩阵进行点乘,得到输出幅度谱;Performing a dot multiplication of the amplitude spectrum of the mixed signal to be separated and the mask matrix to obtain an output amplitude spectrum;将所述输出幅度谱和所述原始相位组合,得到输出复数谱;Combining the output amplitude spectrum with the original phase to obtain an output complex spectrum;对所述输出复数谱进行短时傅里叶逆变换,得到所述第一被提取信号。Performing short-time inverse Fourier transform on the output complex spectrum to obtain the first extracted signal.8.如权利要求1所述的信号提取方法,其特征在于,所述分离器模型的所述训练步骤还包括:8. The signal extraction method according to claim 1, wherein the training step of the separator model further comprises:计算所述当前分离器模型损失与上一次计算的历史分离器模型损失的损失差值;Calculating the loss difference between the current separator model loss and the last calculated historical separator model loss;当所述损失差值不高于预设差值时,确定所述分离器模型完成所述对抗训练。When the loss difference is not higher than a preset difference, it is determined that the separator model completes the adversarial training.9.一种信号提取装置,其特征在于,所述信号提取装置包括:9. A signal extraction device, characterized in that the signal extraction device comprises:分离模块,用于将待分离混合信号输入分离器模型,提取出第一被提取信号;还用于将第一混合信号输入所述分离器模型,得到第一输出信号;A separation module, used to input the mixed signal to be separated into a separator model to extract a first extracted signal; and also used to input the first mixed signal into the separator model to obtain a first output signal;损失计算模块,用于将所述第一输出信号以及目标信号输入判别器模型进行判别,并计算当前分离器模型损失和当前判别器模型损失;A loss calculation module, used for inputting the first output signal and the target signal into the discriminator model for discrimination, and calculating the current separator model loss and the current discriminator model loss;对抗训练模块,用于基于所述当前分离器模型损失和所述当前判别器模型损失,对所述分离器模型和所述判别器模型进行交替对抗训练。An adversarial training module is used to perform alternating adversarial training on the separator model and the discriminator model based on the current separator model loss and the current discriminator model loss.10.一种电子设备,其特征在于,所述电子设备包括:处理器和存储器;其中,所述存储器存储有计算机程序,所述计算机程序适于由所述处理器加载并执行如权利要求1-8任意一项所述的信号提取方法。10. An electronic device, characterized in that the electronic device comprises: a processor and a memory; wherein the memory stores a computer program, and the computer program is suitable for being loaded by the processor and executing the signal extraction method according to any one of claims 1 to 8.11.一种计算机可读存储介质,其特征在于,所述计算机存储介质存储有多条指令,所述指令适于由处理器加载并执行如权利要求1-8任意一项所述的信号提取方法。11. A computer-readable storage medium, characterized in that the computer storage medium stores a plurality of instructions, wherein the instructions are suitable for being loaded by a processor and executing the signal extraction method according to any one of claims 1 to 8.
CN202311078260.6A2023-08-242023-08-24 Signal extraction method, device, electronic device and storage mediumPendingCN119517069A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202311078260.6ACN119517069A (en)2023-08-242023-08-24 Signal extraction method, device, electronic device and storage medium

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202311078260.6ACN119517069A (en)2023-08-242023-08-24 Signal extraction method, device, electronic device and storage medium

Publications (1)

Publication NumberPublication Date
CN119517069Atrue CN119517069A (en)2025-02-25

Family

ID=94660213

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202311078260.6APendingCN119517069A (en)2023-08-242023-08-24 Signal extraction method, device, electronic device and storage medium

Country Status (1)

CountryLink
CN (1)CN119517069A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN110544488A (en)*2018-08-092019-12-06腾讯科技(深圳)有限公司 Method and device for separating multiple voices
CN111968669A (en)*2020-07-282020-11-20安徽大学Multi-element mixed sound signal separation method and device
CN112331218A (en)*2020-09-292021-02-05北京清微智能科技有限公司Single-channel voice separation method and device for multiple speakers
CN116230001A (en)*2023-03-102023-06-06中国农业银行股份有限公司Mixed voice separation method, device, equipment and storage medium
CN116580690A (en)*2023-05-262023-08-11平安科技(深圳)有限公司Speech synthesis method, device, computer equipment and medium based on artificial intelligence

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN110544488A (en)*2018-08-092019-12-06腾讯科技(深圳)有限公司 Method and device for separating multiple voices
CN111968669A (en)*2020-07-282020-11-20安徽大学Multi-element mixed sound signal separation method and device
CN112331218A (en)*2020-09-292021-02-05北京清微智能科技有限公司Single-channel voice separation method and device for multiple speakers
CN116230001A (en)*2023-03-102023-06-06中国农业银行股份有限公司Mixed voice separation method, device, equipment and storage medium
CN116580690A (en)*2023-05-262023-08-11平安科技(深圳)有限公司Speech synthesis method, device, computer equipment and medium based on artificial intelligence

Similar Documents

PublicationPublication DateTitle
Stöter et al.CountNet: Estimating the number of concurrent speakers using supervised learning
Massoudi et al.Urban sound classification using CNN
CN114373476B (en)Sound scene classification method based on multi-scale residual error attention network
Fan et al.SVSGAN: singing voice separation via generative adversarial network
AU2020102038A4 (en)A speaker identification method based on deep learning
CN112562741B (en)Singing voice detection method based on dot product self-attention convolution neural network
Chien et al.Bayesian factorization and learning for monaural source separation
Zhang et al.A pairwise algorithm using the deep stacking network for speech separation and pitch estimation
Fan et al.Singing voice separation and pitch extraction from monaural polyphonic audio music via DNN and adaptive pitch tracking
Prachi et al.Deep learning based speaker recognition system with CNN and LSTM techniques
CN117789699B (en) Speech recognition method, device, electronic device and computer-readable storage medium
CN114299918B (en) Acoustic model training and speech synthesis method, device and system and storage medium
CN111048097A (en) A Siamese Network Voiceprint Recognition Method Based on 3D Convolution
CN111653267A (en)Rapid language identification method based on time delay neural network
Liu et al.A Speech Emotion Recognition Framework for Better Discrimination of Confusions.
Leelavathi et al.Speech emotion recognition using LSTM
CN117275510A (en) A small-sample underwater acoustic target recognition method and system based on multi-gradient flow network
CN113284513A (en)Method and device for detecting false voice based on phoneme duration characteristics
Bavu et al.TimeScaleNet: A multiresolution approach for raw audio recognition using learnable biquadratic IIR filters and residual networks of depthwise-separable one-dimensional atrous convolutions
Jaleel et al.Gender identification from speech recognition using machine learning techniques and convolutional neural networks
Wu et al.The DKU-LENOVO Systems for the INTERSPEECH 2019 Computational Paralinguistic Challenge.
CN119559956A (en) A sound conversion method, device, equipment and medium based on artificial intelligence
CN113870896A (en)Motion sound false judgment method and device based on time-frequency graph and convolutional neural network
CN118230722A (en)Intelligent voice recognition method and system based on AI
Huilian et al.Speech emotion recognition based on BLSTM and CNN feature fusion

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination

[8]ページ先頭

©2009-2025 Movatter.jp