Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the following detailed description of the embodiments of the present application will be given with reference to the accompanying drawings.
It should be understood that the described embodiments are merely some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the application as detailed in the accompanying claims.
In the description of the present application, it should be understood that the terms "first" and "second," etc. are used merely to distinguish similar objects from each other and are not necessarily used to describe a particular order or sequence, nor should they be construed to indicate or imply relative importance. The specific meaning of the above terms in the present application can be understood by those of ordinary skill in the art according to the specific circumstances. Furthermore, in the description of the present application, unless otherwise indicated, "a plurality" means two or more. "and/or" describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate that there are three cases of a alone, a and B together, and B alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.
At present, the mixed signal separation method based on the deep neural network has the problem of poor separation effect. For example, when separating a voice and an accompaniment in a song, there is a case where the voice and the accompaniment are not separated cleanly. Particularly, when only accompaniment is performed in a certain song, after separation processing is performed on the voice accompaniment of the song, the audio amplitude (i.e. the amplitude of the sound wave) of the obtained voice signal is not zero, and some accompaniment leaks. That is, when the existing deep neural network is used for separating the vocal accompaniment of the song, the separated vocal accompaniment is still mixed with part of the accompaniment signals, or the separated accompaniment is still mixed with part of the vocal accompaniment signals.
Based on the above, the application provides a signal extraction method, a signal extraction device, electronic equipment and a storage medium. The separator model and the discriminant model are combined, and the loss function is utilized to conduct countermeasure training on the separator model and the discriminant model, so that the separator model and the discriminant model can mutually promote, an ideal output signal is extracted based on the trained separator model, and the signal separation effect of the separator model is improved.
It should be noted that the signal extraction method provided by the present application can be applied to any scenario in which multiple signals in a mixed signal are separated. For example, the above-mentioned signal separation scene may be that a human voice signal and/or an accompaniment signal is extracted from a song, a personal voice signal is extracted from a multi-person mixed voice, a clean voice signal is extracted from a voice signal containing noise, or the like. Hereinafter, the present application will be described in detail by taking a voice signal and an accompaniment signal extracted from a song as an example.
In practice, the signal extraction method may include inputting the mixed signal into a splitter model, through which the desired kind of signal is extracted. Wherein the separator model is derived based on a training step. For convenience of description, an actual mixed signal to be separated, which is processed by the separator model in the inference stage, may be referred to as a mixed signal to be separated, and a signal output by the separator model for the mixed signal to be separated may be referred to as a first extracted signal. In addition, any one of the mixed signals from the training data set processed by the separator model in the training phase may be referred to as a first mixed signal, and a signal output by the separator model for the first mixed signal may be referred to as a first output signal.
In one embodiment, referring to fig. 1, a signal extraction method provided by the present application may include the following steps:
S100, inputting the first mixed signal into a separator model to be trained, and obtaining a first output signal.
In practice, the splitter model may randomly generate an output signal after the first input of the mixed music signal obtained from the training dataset into the splitter model to be trained. The separator model is trained several times based on several mixed signals in the training data set, and a trained separator model can be constructed. The first extracted signal of the desired kind may then be extracted from the mixed signal to be separated based on the trained separator model. For ease of description, the separator model prior to completion of training may be referred to as the separator model to be trained.
In one embodiment, the first mixed signal may be a mixed music signal including a human voice signal and an accompaniment signal, and the first output signal may be a human voice signal.
In one embodiment, step S100 may specifically include steps S110 and S120 (not shown in the figure).
S110, the mixed signal is input into a separator model after data preprocessing, and a mask matrix is obtained.
S120, after the mask matrix is transformed corresponding to the data preprocessing, a first output signal is obtained.
In the implementation, the mixed signal may be input into the separator model after the data conversion process, and correspondingly, when the separator model outputs the signal, the output signal may be subjected to a corresponding inverse process to obtain the required first output signal.
S200, inputting the first output signal and the target signal into a discriminator model for discrimination, and calculating the current separator model loss and the current discriminator model loss.
The pre-prepared training data set may contain a large number of mixed signals, and a clean target signal corresponding to each mixed signal. The type of the target signal may determine the type of the signal output by the splitter model. That is, the target signal may be prepared in advance based on the kind of signal that eventually needs to be extracted. For example, if it is necessary to extract a clean human voice signal from a mixed music signal, a target signal prepared in advance in the training data set is a human voice signal.
In practice, the first output signal may be input to the discriminator model together with the target signal corresponding to the first output signal to determine whether the signal is true or false. For example, the first mixed signal may be any mixed music signal obtained from a training database, and the first output signal may be a human voice signal output by the splitter model to be trained based on the first mixed signal, where the human voice signal output by the splitter model needs to be as close to a pure human voice signal as possible. Thus, the target signal corresponding to the first output signal may be a clean human voice signal obtained from a training database. In the model training phase, a certain gap exists between the first output signal and the target signal. Thus, after the first output signal and the target signal are input to the discriminator, the discrimination result of the discriminator for the first output signal may be false, and the discrimination result for the target signal may be true. Wherein the identifier network may use tags to tag the recognition results. For example, a signal identified as true is marked as 1, and a signal identified as false is marked as 0.
It will be appreciated that the human voice signal output by the splitter model may be inconsistent to some extent with the true clean human voice signal. At this time, the degree of inconsistency between the first output signal of the splitter model and the target signal may be measured by the discrimination result of the discriminator, that is, the loss of the splitter model with respect to the current signal extraction process (may be referred to as the current splitter model loss) is calculated by the discriminator, and then the current splitter model loss is used for training the splitter model to improve the signal splitting effect of the splitter model. Of course, the loss of the discriminator model in the current signal true and false discriminating process (which may be referred to as the current discriminator model loss) can also be calculated by the discriminator, and the current discriminator model loss is used for training the discriminator model so as to improve the discriminating effect of the discriminator model.
It is worth mentioning that the training of the separator model aims at that the output signal can be discriminated as true by the discriminator model. The training of the classifier model aims at classifying the true target signal as true and classifying the signal output from the separator model as false. Therefore, when the loss of the current separator model is calculated, the human voice signal output by the current separator model can be marked as true, and when the loss of the current discriminator model is calculated, the target signal can be marked as true, and the human voice signal output by the current separator model can be marked as false.
S300, based on the current separator model loss and the current discriminant model loss, alternating countermeasure training is performed on the separator model and the discriminant model.
In practice, the initial splitter model does not perform well for the separation of mixed signals. After the mixed music signals obtained from the training data set are input into the initial separator model, the difference between the extracted human voice signals and the real pure human voice signals is large, and the human voice signals are easily judged to be false by the discriminator model. Thus, the first output signal output by the separator model and the target signal can be directly input into the discriminator model for training at the beginning. Then, by performing alternating countermeasure training on the separator model and the discriminator model, the separator model with a better separation effect can be obtained quickly.
It should be noted that, in the process of alternating countermeasure training by the separator model and the discriminator model, if the training samples of the training data set are sufficient, after any one model updates the model parameters based on the countermeasure training, a new mixed signal is obtained to train the other model next time. If there are fewer training samples in the training data set, the training data set may be reused for alternate countermeasure training for the separator model and the discriminant model after each training sample in the training data set has been used once.
In one embodiment, step S300 may specifically include fixing model parameters of the classifier model and back-propagating the current classifier model loss to the classifier model to adjust model parameters of the classifier model when the current classifier model loss is calculated, and fixing model parameters of the classifier model and back-propagating the current classifier model loss to the classifier model to adjust model parameters of the classifier model when the current classifier model loss is calculated.
In implementations, after calculating the value (loss) of the loss function of the separator model or the discriminant model, the separator model and the discriminant model may each back-propagate loss to update the weight coefficients of the corresponding model. Specifically, the loss back propagation of the separator model may update the weight coefficients of the separator model once, and then the loss back propagation of the discriminant model may update the weight coefficients of the discriminant model once. The separator model and the discriminant model are thus trained alternately and cyclically until the models converge.
S400, inputting the mixed signal to be separated into a trained separator model, and extracting a first extracted signal.
It will be appreciated that by the countermeasure training described above, the separator model may output a human voice signal as close as possible to a clean human voice signal. Thus, after the training of the splitter model is finished, when the mixed music signal is input to the splitter model, the first extracted signal "spurious" can be output.
In one embodiment, step S400 may specifically include steps S410 and S420 (not shown in the figures).
S410, the mixed signal is input into a separator model after data preprocessing, and a mask matrix is obtained.
S420, performing transformation corresponding to data preprocessing on the mask matrix to obtain a first extracted signal.
In one embodiment, it may be determined when the separator model is complete with training based on the loss function value of the separator. Correspondingly, before the step S400, the signal extraction method can further comprise the steps of calculating a loss difference value between the current separator model loss and the last calculated historical separator model loss, and determining the separator model to be trained as the separator model which is completely trained when the loss difference value is not higher than a preset difference value.
In practice, when the loss of the separator model tends to be stable, i.e. the loss of the separator model is not basically reduced, the separator model has a better separation effect, and the training of the separator can be stopped at this time, and the separator model is used as a trained separator model, so that the finally required output signal is extracted from the mixed signal to be separated. Of course, it is also possible to determine when the separator model has completed training based on other iteration termination strategies. For example, the training may be terminated when the number of training times of the separator model reaches a preset number of training times, or when the training time period of the separator reaches a preset training time period, etc., which is not limited by the present application.
It is worth mentioning that the separator model and the discriminant model are continuously subjected to alternate countermeasure training, and can promote and hold down each other. In this way, the loss of the separator model may decrease and then increase during the challenge training process, and may fluctuate over a range. Thus, the loss of the separator model is substantially no longer decreasing, which is understood to mean that the decreasing amplitude is small or that the loss of the separator model fluctuates in a small range after decreasing to a certain value.
In one embodiment, in order to obtain other signals (which may be referred to as second extracted signals) in the mixed signal to be separated, the first extracted signal may be subtracted from the mixed signal to be separated to extract the second extracted signal, after step S400.
Referring to fig. 2, the present embodiment provides specific steps for countermeasure training for the separator model and the discriminant model. Wherein the separator model may be a masking network model and the arbiter model may be a GAN (GENERATIVE ADVERSARIAL network, generate an antagonism network) model. Steps S111 to S114 are specific processing of step S110, steps S121 to S123 are specific processing of step S120, and steps S201 to S202 are specific processing of step S200.
S111, inputting the mixed music signal xmix into the separator model, and performing short-time Fourier transform analysis on the mixed music signal xmix to obtain a complex spectrogram of the mixed signal.
In one embodiment, the separator model may specifically employ a U-shaped residual network model. Referring to fig. 3, the separator model may include several encoding modules (Encoder Block), several decoding modules (Decoder blocks), and a middle layer for connection, which may employ a convolution module (Conv Block). For example, the embodiment shown in FIG. 3 employs 4 Encoder modules and 4 Decoder modules. The Encoder module consists of a residual convolution and an average pooling layer, the Decoder module consists of a transposed convolution and a residual convolution, and the middle Conv module consists of a residual convolution. In fig. 3, the symbol "- >" shows the number of passes per layer module. For example, "1- >32" indicates that the number of input channels is1 and the number of output channels is 32, and "32- >1" indicates that the number of input channels is 32 and the number of output channels is 1.
In this embodiment, the residual convolution structures in Encoder, decoder and intermediate Conv modules can be seen in fig. 4, which includes a normalization layer Batch Norm, an activation function layer leak_ relu and a convolution layer Conv2d. And, the convolution kernels of the residual convolutions all take a size of 3 x 3. In addition, the average pooling layer of Encoder modules uses a convolution kernel of size 2×2 and the transpose layer of the decoder modules uses a convolution kernel of size 3×3.
It should be noted that, the respective modules may also use other convolution kernel sizes, such as 1x1, 5x5, 7x7, etc., which is not limited by the present application.
S112, obtaining the amplitude spectrum Xmix and the original phase of the mixed signal according to the complex frequency spectrum
In practice, the complex spectrum of a signal (complex spectrum for short) comprises two parts, namely a magnitude spectrum and a phase spectrum. The amplitude spectrum is understood as a relation curve between amplitude and frequency obtained after amplitude is taken from the complex spectrum, and the phase spectrum is understood as a relation curve between phase and frequency obtained after phase is taken from the complex spectrum.
S113, inputting the magnitude spectrum Xmix into a separator in a separator model.
S114, the separator outputs a mask matrix M.
S121, multiplying the mask matrix M with the amplitude spectrum Xmix of the original signal to obtain an output amplitude spectrum, namely X1=Xmix M
S122, using the output amplitude spectrum X1 and the original phaseThe composed complex spectrum serves as the output complex spectrum of the separator.
S123, performing short-time Fourier inverse transformation on the output complex spectrum to obtain a first output signal x1 of the separator.
S201, the output signal x1 of the separator and the target signal x0 are input to the discriminator.
In one embodiment, the discriminator model may specifically employ HIFIGAN discriminator models. HIFIGAN the discriminant model may include two discriminants, one being a Multi-period discriminant (Multi-Period Discriminator, MPD) that can be used to identify signals of different periods and the other being a Multi-scale discriminant (Multi-Scale Discriminator, MSD) that can be used to cope with very long data.
In practice, the multi-cycle arbiter may include a plurality of sub-arbiters, each having a different cycle p, which convert audio into 2-D (two-dimensional) data according to the value of the cycle p, and then process the data using CNN (Convolutional Neural Network ) with a convolution kernel of kx1, that is, CNN with a convolution kernel height of k and a convolution kernel width of 1. By using different periods p, patterns of different periodic signals can be obtained. In addition, the multi-size discriminator can also comprise a plurality of sub-discriminators, each discriminator is processed by using an average pool with different length, and then the signal input into the discriminator model is discriminated by using the CNN.
S202, the discriminator discriminates the input signal.
In practice, the classifier model may classify the output signals of the separator from two different angles based on a multi-cycle classifier and a multi-scale classifier, respectively. Pure human voice signals can be judged as true, and unclean human voice signals can be judged as false.
S300, based on the current separator model loss and the current discriminant model loss, alternating countermeasure training is carried out on the separator model to be trained and the discriminant model to be trained.
In practice, the Loss of separator model may be composed of three parts, including generating a countermeasures Loss (GAN Loss), mel-spectrum Loss (Mel-Spectrogram Loss), and feature matching Loss (Feature Match Loss). Specifically, the above-mentioned losses can be calculated using the following loss functions, respectively:
Where S denotes a separator, D denotes a discriminator, x0 denotes a target signal of the separator, xmix denotes an input signal of the separator, χ (°) denotes an operation of converting a time-domain signal into a mel-spectrum, T denotes the number of layers of extracted features in the discriminator, Di denotes features in the i-th layer of the discriminator, and Ni denotes the number of features in the i-th layer of the discriminator.
In practice, the loss of the arbiter model may be calculated using the following loss function:
Based on the above-mentioned loss function, the discriminators are further decomposed into sub-discriminators, and the overall losses of the discriminators model and the separator model can be expressed as:
Where Dj represents the j-th sub-arbiter of the multi-cycle arbiter and the multi-size arbiter, and λmel and λfm represent adjustable hyper-parameters.
In practice, for ease of understanding, the process of alternating countermeasure training for the separator model and the discriminant model to be trained may be specifically as follows:
The mixed music signal obtained from the training data set is input into the separator model for the first time, after the human voice signal is output, the human voice signal output by the separator model and the target signal (namely the real pure human voice signal) can be directly input into the discriminator model, and the initial discriminator model is trained. The arbiter model is required to discriminate the human voice signal as false and the target signal as true.
Then, a new mixed signal is obtained from the training data set, the mixed music signal is input into the separator model, the human voice signal output by the separator model and the target signal corresponding to the mixed signal are discriminated by adopting the discriminator model, and the loss of the separator model is calculatedThen, a back propagation algorithm can be used to performThe parameters of the separator model are updated by propagating from the output layer to the input layer of the separator model, so that the human voice signal output by the separator model can be judged as true by the discriminator model.
At the next training, the loss of the discriminant model can be calculatedThen using a back propagation algorithm, willPropagating from the output layer to the input layer of the discriminator model updates the parameters of the discriminator model so that it can distinguish between the human voice signal and the target signal output by the splitter model.
Therefore, the separator model and the discriminant model can be subjected to repeated alternate iterative training, and respective model parameters can be continuously updated, so that respective loss functions can be converged, and an accurate separator model structure is constructed. Then, the actual music signal xmix to be separated is input into the trained separator model, and the final human voice signal x1 can be extracted. The final human voice signal x1 may then be subtracted from the mixed music signal xmix to obtain the final accompaniment signal x2.
Of course, in another embodiment, the first output signal may be an accompaniment signal, and the target signal corresponding to the first output signal may be a pure accompaniment signal obtained from the training database. The first output signal may be a signal of a certain instrument in the accompaniment signals, and the target signal corresponding to the first output signal may be a clean instrument signal obtained from the training database. That is, the kind of the first output signal to be extracted can be determined by the target signal, and after the separator model and the judger model are subjected to the alternate countermeasure training, the mixed signal to be actually separated is input into the trained separator model, so that the first extracted signal to be finally required can be extracted. Then, the first extracted signal extracted by the splitter model is subtracted from the mixed music signal to obtain the other signal which is finally required. The specific implementation process can be referred to the above embodiments, and the disclosure is not repeated here.
It should be noted that, in addition to the specific use of the U-type residual network model for the separator model and the HIFIGAN discriminator model for the discriminator model, the separator model and the discriminator model may also use other network models to form the generating countermeasure network, which is not limited in the present application.
In one embodiment, step S410 after separator model training may include steps S411-S414, and step S420 may include steps S421-S423.
S411, inputting the mixed music signal xmix into the separator model, and performing short-time Fourier transform analysis on the mixed music signal xmix to obtain a complex spectrogram of the mixed signal.
S412, obtaining the amplitude spectrum Xmix and the original phase of the mixed signal according to the complex spectrum
S413, inputting the magnitude spectrum Xmix into a separator in the separator model.
S414, the separator outputs a mask matrix M.
S421, multiplying the mask matrix M with the amplitude spectrum Xmix of the original signal to obtain an output amplitude spectrum, namely X1=Xmix M
S422, using the output amplitude spectrum X1 and the original phaseThe composed complex spectrum serves as the output complex spectrum of the separator.
S423, the first extracted signal x1 of the separator is obtained by performing short-time Fourier inverse transformation on the output complex spectrum.
It should be noted that steps S411-S414 are similar to steps S111-S114, and steps S421-S423 are similar to steps S121-S123, which are not described herein.
The application combines the mask network model with the GAN network, namely, a discriminator model is added on the basis of the separator model, and the loss function can be utilized to conduct countermeasure training on the separator model and the discriminator model, so that the separator model and the discriminator model can be mutually promoted, and the separation performance of the separator model is improved. Thereby extracting the desired output signal based on the trained separator model. It can be understood that when the present application is employed to separate the vocal and accompaniment of a mixed music signal, based on the alternate countermeasure training of the separator model and the discriminator model, the separator model and the discriminator model promote each other and game, and the separator model after the completion of the countermeasure training can output a signal as close as possible to a target signal (e.g., a pure vocal signal), making it difficult for the discriminator model to determine the signal output by the separator model as false. Therefore, after the mixed music signals are input into the separator model for the countermeasure training, purer voice signals can be output, and the separation effect of the signal extraction model on voice and accompaniment is improved.
Based on the same technical concept, the embodiment of the present application further provides a signal extraction device, referring to fig. 5, the signal extraction device may include:
the device comprises a separation module, a first output signal and a second output signal, wherein the separation module is used for inputting a mixed signal to be separated into a separator model to extract a first extracted signal;
The loss calculation module is used for inputting the first output signal and the target signal into the discriminator model for discrimination and calculating the current separator model loss and the current discriminator model loss;
And the countermeasure training module is used for alternately performing countermeasure training on the separator model and the discriminator model based on the current separator model loss and the current discriminator model loss.
In one embodiment, the separator model is a U-shaped residual network model and the discriminant model is a HIFIGAN discriminant model.
In one embodiment, the countermeasure training module is specifically configured to:
When the loss of the current discriminator model is calculated, fixing the model parameters of the separator model, and reversely transmitting the loss of the current discriminator model to the discriminator model so as to adjust the model parameters of the discriminator model;
when the current separator model loss is calculated, model parameters of the discriminator model are fixed and the current separator model loss is back-propagated to the separator model to adjust the model parameters of the separator model.
In one embodiment, the separation module is further to:
The first extracted signal is subtracted from the mixed signal to be separated, and the second extracted signal is extracted.
In one embodiment, the separation module is specifically configured to:
The method comprises the steps of performing data preprocessing on a mixed signal to be separated, and inputting the mixed signal to a separator model to obtain a mask matrix;
and transforming the mask matrix corresponding to the data preprocessing to obtain a first extracted signal.
In one embodiment, the separation module is specifically configured to:
performing short-time Fourier transform on the mixed signal to be separated to obtain a complex spectrogram of the mixed signal;
Calculating an amplitude spectrum and an original phase of the mixed signal to be separated based on a complex spectrum diagram of the mixed signal to be separated;
And inputting the amplitude spectrum of the mixed signal to be separated into a separator model to obtain a mask matrix.
In one embodiment, the separation module is further specifically configured to:
performing point multiplication on the magnitude spectrum of the mixed signal to be separated and the mask matrix to obtain an output magnitude spectrum;
combining the output amplitude spectrum with the original phase to obtain an output complex spectrum;
And carrying out short-time Fourier inverse transformation on the output complex spectrum to obtain a first extracted signal.
In one embodiment, the loss calculation module is further to:
Calculating a loss difference value between the current separator model loss and the last calculated historical separator model loss;
and when the loss difference value is not higher than the preset difference value, determining that the separator model finishes the countermeasure training.
It should be noted that, in the signal extraction apparatus provided in the foregoing embodiment, when the signal extraction method is executed, only the division of the foregoing functional modules is used as an example, in practical application, the foregoing functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the signal extraction device and the signal extraction method provided in the foregoing embodiments belong to the same concept, and the implementation principle and the technical effects to be achieved may refer to the method embodiments, which are not described herein.
Based on the same technical concept, the embodiment of the application also provides an electronic device, referring to fig. 6, which comprises a processor and a memory, wherein the memory stores a computer program, and the computer program is suitable for being loaded by the processor and executing the signal extraction method according to any one of the embodiments. The implementation principle of the electronic device for signal extraction and the technical effects to be achieved can refer to the method embodiment, and the application is not described herein.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or may be implemented by hardware. Based on such understanding, the above technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, and the software product of the method for extracting the human voice and accompaniment may be stored in a computer readable storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including storing several instructions for causing an electronic device to perform the method for extracting the human voice and accompaniment described in the various embodiments or some parts of the embodiments.
The foregoing description of the preferred embodiments of the application is not intended to limit the application to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the application are intended to be included within the scope of the application.