Movatterモバイル変換


[0]ホーム

URL:


CN111667805B - Accompaniment music extraction method, accompaniment music extraction device, accompaniment music extraction equipment and accompaniment music extraction medium - Google Patents

Accompaniment music extraction method, accompaniment music extraction device, accompaniment music extraction equipment and accompaniment music extraction medium
Download PDF

Info

Publication number
CN111667805B
CN111667805BCN201910165261.1ACN201910165261ACN111667805BCN 111667805 BCN111667805 BCN 111667805BCN 201910165261 ACN201910165261 ACN 201910165261ACN 111667805 BCN111667805 BCN 111667805B
Authority
CN
China
Prior art keywords
accompaniment
amplitude spectrum
music
right channel
left channel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910165261.1A
Other languages
Chinese (zh)
Other versions
CN111667805A (en
Inventor
柯川
朱明清
彭艺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co LtdfiledCriticalTencent Technology Shenzhen Co Ltd
Priority to CN201910165261.1ApriorityCriticalpatent/CN111667805B/en
Publication of CN111667805ApublicationCriticalpatent/CN111667805A/en
Application grantedgrantedCritical
Publication of CN111667805BpublicationCriticalpatent/CN111667805B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

The application belongs to the technical field of music data processing, and discloses an accompaniment music extraction method, device, equipment and medium, wherein the accompaniment music extraction method disclosed by the application comprises the steps of converting audio music to obtain a left channel phase spectrum, a right channel phase spectrum, a left channel amplitude spectrum and a right channel amplitude spectrum; respectively inputting the left channel amplitude spectrum and the right channel amplitude spectrum into an accompaniment extraction model to respectively obtain a left channel accompaniment amplitude spectrum mask and a right channel accompaniment amplitude spectrum mask; a left channel accompaniment amplitude spectrum is obtained based on the left channel amplitude spectrum and the left channel accompaniment amplitude spectrum mask, a right channel accompaniment amplitude spectrum is obtained based on the right channel amplitude spectrum and the right channel accompaniment amplitude spectrum mask, and stereo accompaniment music of the audio music is determined based on the left channel accompaniment amplitude spectrum, the right channel accompaniment amplitude spectrum, the left channel phase spectrum and the right channel phase spectrum. In this way, high quality stereophonic accompaniment music can be obtained.

Description

Accompaniment music extraction method, accompaniment music extraction device, accompaniment music extraction equipment and accompaniment music extraction medium
Technical Field
The present application relates to the field of music data processing technologies, and in particular, to a method, an apparatus, a device, and a medium for extracting accompaniment music.
Background
With the improvement of living standard, music has entered into people's daily life and becomes an important way of leisure and recreation. In public entertainment venues (e.g., KTV) or private audiovisual systems, accompaniment music of a large number of songs is often required.
In the prior art, when accompaniment music is extracted, the following two methods are generally adopted:
the first method is as follows: the accompaniment music is extracted by adopting the traditional method, and the accompaniment music is extracted mainly by utilizing the characteristic that the intensities of the human voice in left and right channels in most songs are similar. However, in this way, the extracted accompaniment music is liable to have human voice residues, and the accompaniment music is poor in sound quality.
The second method is as follows: accompaniment music is extracted based on the deep neural network. However, such methods mainly use monaural extraction of accompaniment music of songs, cannot obtain stereophonic accompaniment music, and have strong human voice residues.
In view of the foregoing, there is a need for a technical solution that can extract accompaniment music with high quality.
Disclosure of Invention
The embodiment of the application provides a method, a device, equipment and a medium for extracting accompaniment music, which are used for improving the quality of the extracted accompaniment music when the accompaniment music of audio music is extracted.
In one aspect, there is provided a method of extracting accompaniment music, including:
obtaining a left channel phase spectrum, a right channel phase spectrum, a left channel amplitude spectrum and a right channel amplitude spectrum of the audio music;
a pre-trained accompaniment extraction model is adopted to respectively obtain a left channel accompaniment amplitude spectrum mask of a left channel amplitude spectrum and a right channel accompaniment amplitude spectrum mask of a right channel amplitude spectrum, the accompaniment extraction model is a deep neural network based on an attention mechanism, a music sample is adopted to train data, and each music sample comprises an audio music sample and an accompaniment music sample thereof;
obtaining a left channel accompaniment magnitude spectrum based on the left channel magnitude spectrum and the left channel accompaniment magnitude spectrum mask, and obtaining a right channel accompaniment magnitude spectrum based on the right channel magnitude spectrum and the right channel accompaniment magnitude spectrum mask;
stereo accompaniment music of the audio music is determined based on the left channel accompaniment amplitude spectrum, the right channel accompaniment amplitude spectrum, the left channel phase spectrum, and the right channel phase spectrum.
In one aspect, there is provided an accompaniment music extracting apparatus comprising:
a first obtaining unit configured to obtain a left channel phase spectrum, a right channel phase spectrum, a left channel magnitude spectrum, and a right channel magnitude spectrum of audio music;
The extraction unit is used for respectively obtaining a left channel accompaniment amplitude spectrum mask of a left channel amplitude spectrum and a right channel accompaniment amplitude spectrum mask of a right channel amplitude spectrum by adopting a pre-trained accompaniment extraction model, wherein the accompaniment extraction model is a deep neural network based on an attention mechanism, and is obtained by training data by adopting music samples, and each music sample comprises an audio music sample and an accompaniment music sample thereof;
a second obtaining unit for obtaining a left channel accompaniment amplitude spectrum based on the left channel amplitude spectrum and the left channel accompaniment amplitude spectrum mask, and obtaining a right channel accompaniment amplitude spectrum based on the right channel amplitude spectrum and the right channel accompaniment amplitude spectrum mask;
a determining unit for determining stereo accompaniment music of the audio music based on the left channel accompaniment amplitude spectrum, the right channel accompaniment amplitude spectrum, the left channel phase spectrum and the right channel phase spectrum.
In one aspect, there is provided a control apparatus including:
at least one memory for storing program instructions;
at least one processor for calling the program instructions stored in the memory and executing the steps of any of the above-mentioned accompaniment music extraction methods according to the obtained program instructions.
In one aspect, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of any one of the accompaniment music extraction methods described above.
In the method, the device, the equipment and the medium for extracting the accompaniment music provided by the embodiment of the application, the audio music is converted to obtain a left channel phase spectrum, a right channel phase spectrum, a left channel amplitude spectrum and a right channel amplitude spectrum; respectively inputting the left channel amplitude spectrum and the right channel amplitude spectrum into an accompaniment extraction model to respectively obtain a left channel accompaniment amplitude spectrum mask and a right channel accompaniment amplitude spectrum mask; a left channel accompaniment amplitude spectrum is obtained based on the left channel amplitude spectrum and the left channel accompaniment amplitude spectrum mask, a right channel accompaniment amplitude spectrum is obtained based on the right channel amplitude spectrum and the right channel accompaniment amplitude spectrum mask, and stereo accompaniment music of the audio music is determined based on the left channel accompaniment amplitude spectrum, the right channel accompaniment amplitude spectrum, the left channel phase spectrum and the right channel phase spectrum. In this way, the music processing model established based on the attention mechanism and the deep neural network is adopted to extract the accompaniment music, so that the quality of the extracted accompaniment music is improved, and further, the accompaniment music is respectively extracted according to the left channel amplitude spectrum and the right channel amplitude spectrum, so that the stereo accompaniment music can be obtained.
Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:
fig. 1 is a schematic diagram of an accompaniment music extraction system according to an embodiment of the present application;
FIG. 2a is a flowchart of a training method of accompaniment extraction models according to an embodiment of the present application;
FIG. 2b is a flow chart of an embodiment of data screening according to the present application;
FIG. 3 is a schematic diagram of an accompaniment extraction model structure according to an embodiment of the present application;
fig. 4a is a schematic flow chart of extracting accompaniment music according to an embodiment of the present application;
fig. 4b is a flowchart showing a detailed implementation of a method for extracting accompaniment music according to an embodiment of the present application;
fig. 5 is a flowchart showing an embodiment of evaluation of an accompaniment music extraction method according to an embodiment of the present application;
fig. 6 is a schematic diagram of an accompaniment music extraction structure according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of a control device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantageous effects of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
First, some terms related to the embodiments of the present application will be described so as to be easily understood by those skilled in the art.
1. Amplitude spectrum and phase spectrum: and carrying out frequency spectrum analysis on the signals through Fourier transformation to obtain frequency domain signals. The frequency domain signal is an imaginary number, the magnitude spectrum is obtained according to the real part of the frequency domain signal, and the phase spectrum is obtained according to the imaginary part of the frequency domain signal.
2. Accompaniment amplitude spectrum mask: representing the ratio between the amplitude spectrum of accompaniment in audio music and the amplitude spectrum of audio music.
3. Upsampling (Upsampling): the original image is enlarged. An interpolation method is mainly adopted, namely, new elements are inserted between pixel points by adopting a proper interpolation algorithm on the basis of pixels of the original image.
4. Deep neural network: is a neural network having at least one hidden layer. Similar to the shallow neural network, the deep neural network can also provide modeling for a complex nonlinear system, but the extra level provides a higher level of abstraction for the model, thus increasing the model's ability. The deep neural network is usually a feedforward neural network, but researches on language modeling and the like expand the deep neural network to a recurrent neural network.
5. Attention mechanism (Attention): in particular to an attention model used in the field of artificial neural networks. The basic assumption that the Attention mechanism operates is: the human being itself does not process the full view of the entire signal at the same time, but selectively focuses on a significant portion of the signal when recognizing various different signals. The Attention mechanism is a rule based on the principle that weights are assigned to multiple sources for synthesis.
6. Accompaniment extraction model: based on the Attention mechanism and the deep neural network, the data are trained by adopting music samples. Wherein each music sample pair data includes an audio music sample and its accompaniment music sample.
7. Speech spectrogram: the method is a spectrogram obtained by processing a received time domain signal, wherein the abscissa of the spectrogram is time, the ordinate is frequency, and the coordinate point value is voice data energy.
In addition, the term "and/or" herein is merely an association relationship describing an association object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. The character "/" herein generally indicates that the associated object is an "or" relationship unless otherwise specified.
The following describes the design concept of the embodiment of the present application.
Conventionally, an accompaniment music extraction method is commonly used as an azimuth discriminating and resynthesizing method (ADRess), and an accompaniment music extraction method based on a deep neural network.
The ADRess method extracts sound in the middle of the song space according to the characteristic that the intensity of human voice in left and right sound channels in most songs is similar, so as to separate singing voice and further obtain accompaniment music of the songs. However, since the voices in some songs are not perfectly aligned in the left and right channels, there is often a voice residue in accompaniment music extracted in this way. In addition, accompaniment music extracted by the method has lower tone quality.
The accompaniment music extraction method based on the deep neural network mainly extracts accompaniment music aiming at a song with a single sound channel, cannot obtain stereo accompaniment music, and has stronger human voice residues.
The applicant has analyzed the conventional art, and has found that there is no technical solution for extracting high-quality stereo accompaniment music in the conventional art, and therefore, there is a need for an accompaniment music extracting technical solution for obtaining high-quality stereo accompaniment music when extracting accompaniment music of audio music.
In view of this, the applicant has considered that an accompaniment extraction model can be established based on a deep neural network of an attention mechanism, and stereophonic accompaniment music can be extracted for the left and right channels, respectively, according to the trained accompaniment extraction model, so that stereophonic accompaniment music of high quality can be obtained.
In view of the above analysis and consideration, in the embodiment of the present application, an extraction scheme of accompaniment music is provided, in the scheme, audio analysis is performed on an audio signal of audio music to obtain a left channel phase spectrum, a right channel phase spectrum, a left channel amplitude spectrum and a right channel amplitude spectrum, and then a left channel accompaniment amplitude spectrum mask of the left channel amplitude spectrum and a right channel accompaniment amplitude spectrum mask of the right channel amplitude spectrum are respectively obtained by adopting an accompaniment extraction model; and further, according to the left channel accompaniment amplitude spectrum mask, the right channel accompaniment amplitude spectrum mask, the left channel phase spectrum, the right channel phase spectrum, the left channel amplitude spectrum and the right channel amplitude spectrum, stereo accompaniment music is obtained. In this way, the music processing model established based on the attention mechanism and the deep neural network is adopted to extract the accompaniment music, so that the quality of the extracted accompaniment music is improved, and further, the accompaniment music is respectively extracted according to the left channel amplitude spectrum and the right channel amplitude spectrum, so that the stereo accompaniment music can be obtained.
The accompaniment music extraction technology provided by the embodiment of the application can be applied to any scene requiring separation of accompaniment music and human voice. For example, the KTV and kog-like application accompaniment music library is supplemented with kog accompaniment music to provide the user with a selection of more kog accompaniment music. For another example, by extracting accompaniment music, the voice of a song may be reversely obtained, and thus a singer or the like may be identified from the voice.
In order to further explain the technical solution provided by the embodiments of the present application, the following details are described with reference to the accompanying drawings and the detailed description. Although embodiments of the present application provide the method operational steps shown in the following embodiments or figures, more or fewer operational steps may be included in the method, either on a routine or non-inventive basis. In steps where there is logically no necessary causal relationship, the execution order of the steps is not limited to the execution order provided by the embodiments of the present application. The methods may be performed sequentially or in parallel as shown in the embodiments or the drawings when the actual processing or the apparatus is performed.
Referring to fig. 1, a schematic diagram of an accompaniment music extraction system is shown. In the embodiment of the application, the accompaniment music extraction system mainly comprises two parts, namely offline model training and online model application. The offline model training further comprises two parts of data screening and accompaniment extraction model training. The online model is applied to extract accompaniment music in the audio music through the trained accompaniment extraction model. In practical applications, the data screening step may not be performed.
Referring to fig. 2a, a flowchart of an implementation of a training method of an accompaniment extraction model according to the present application is shown. The method comprises the following specific processes:
step 200: the control device obtains the pair of music samples data.
Specifically, the control device acquires initial music sample pair data, each of which includes an audio music sample and a corresponding accompaniment sample. The audio music samples are typically songs that contain human voice and accompaniment.
Alternatively, the initial pair of music samples may be KTV accompaniment song data obtained through an accompaniment music library of KTV.
Optionally, the control device may directly use the initial music sample pair data as the music sample pair data, and may also screen the initial music sample pair data, so as to obtain screened music sample pair data.
When the control device screens the data on the initial music sample, the following method can be adopted:
and respectively determining the length difference value and the cosine similarity of the audio frequency music sample and the accompaniment sample contained in each initial music sample pair data, and screening out the music sample pair data with the length difference value of zero and the cosine similarity lower than a preset similarity threshold value from each initial music sample pair data.
Wherein the length difference is a difference in time length between the audio music sample and the accompaniment sample. The cosine similarity is determined based on the similarity between the data of the audio music sample and the data of the accompaniment sample.
This is because if the time lengths between the audio music sample and the accompaniment sample are different, it is interpreted that the audio music sample and the accompaniment sample are not perfectly matched and cannot be used as training data. If the cosine similarity between the audio music sample and the accompaniment sample is too high, it is indicated that the audio music sample may be pure accompaniment music which does not contain human voice, and model training data for accompaniment music extraction cannot be performed.
Referring to FIG. 2b, a flow chart of data screening implementation is shown. In one embodiment, the control device may filter the data for the initial music sample in the manner shown in fig. 2 b. Specifically, the control device performs the following steps for each initial sample pair:
s2000: a length difference between the time length of the audio music sample in the data and the time length of the accompaniment sample is determined for the initial music sample.
S2001: whether the acquired length difference is consistent is judged, if so, S2002 is executed, otherwise S2005 is executed.
S2002: cosine similarity between the audio music sample and the accompaniment sample is determined.
S2003: and judging whether the acquired cosine similarity is lower than a preset similarity threshold value, if so, executing S2004, otherwise, executing S2005.
S2004: the initial music sample pair data is determined as music sample pair data.
S2005: the initial pair of music samples is discarded.
Therefore, mass initial music sample pair data can be obtained through channels such as KTV, and the initial music sample pair data is screened according to the length difference value and cosine similarity of the initial music sample pair data, so that the music sample pair data which can be used for model training is obtained.
Step 201: the control device obtains a left channel amplitude spectrum and a right channel amplitude spectrum of the audio frequency music sample in the data of each music sample pair respectively, and obtains an accompaniment amplitude spectrum sample of the corresponding accompaniment music sample.
Specifically, the control device performs the following steps for each of the music samples for the audio music sample and the accompaniment music sample in the data, respectively:
in one aspect, an audio signal of an audio music sample is extracted and subjected to audio analysis by Short-time fourier transform (Short-Time Fourier Transform, STFT) to obtain a left channel phase spectrum, a right channel phase spectrum, a left channel magnitude spectrum, and a right channel magnitude spectrum of the audio music sample.
On the other hand, an audio signal of an accompaniment music sample is extracted, and the audio signal is subjected to audio analysis by STFT to obtain an accompaniment amplitude spectrum sample of the accompaniment music sample.
Among them, audio music is an important medium among multimedia. The frequency range of an audio signal that we can hear is approximately 20Hz-20kHz, with speech approximately distributed within 300Hz-4kHz, while music and other natural sounds are distributed over a full range. The sound is recorded or reproduced by an analog device to become analog audio, which is then digitized to become digital audio. The audio analysis is a process of extracting a series of characteristics of a signal in a time domain and a frequency domain by taking a digital audio signal as an analysis object and taking digital signal processing as an analysis means.
And fourier transforms and sampling of the signal are the most basic techniques used in performing audio analysis. The audio analysis refers to obtaining the rule of amplitude, phase and the like of components of an audio signal according to the frequency structure of the audio signal and according to the frequency distribution, and establishing various 'spectrums' taking the frequency as a transverse axis, such as an amplitude spectrum and a phase spectrum. In the audio signal, the periodic signal corresponds to the discrete frequency spectrum after Fourier series transformation, and for the non-periodic signal, the periodic signal can be regarded as a periodic signal with the period T being infinity, when the period approaches infinity, the fundamental wave spectrum and the spectrum interval (omega=2pi/T) approach infinity, so that the discrete frequency spectrum becomes a continuous frequency spectrum. Therefore, the spectrum of the non-periodic signal is continuous. STFT is a mathematical transformation associated with the fourier transform to determine the frequency and phase of the local area sine wave of a time-varying signal.
Step 202: the control equipment establishes an accompaniment extraction model based on a deep neural network of an attention mechanism, trains the accompaniment extraction model on data by adopting each music sample, and obtains a trained accompaniment extraction model.
Specifically, when executing step 202, the following steps are executed for each piece of music sample pair data:
s2021: the depth neural network based on the attention mechanism takes the left channel amplitude spectrum and the right channel amplitude spectrum of the audio music sample as input to obtain a left channel accompaniment amplitude spectrum mask and a right channel accompaniment amplitude spectrum mask of the audio music sample.
Specifically, the Attention mechanism is combined with the deep neural network to establish an accompaniment extraction model, and the accompaniment extraction model is adopted to respectively encode and decode the left channel amplitude spectrum and the right channel amplitude spectrum of the audio music sample, so as to obtain a left channel accompaniment amplitude spectrum mask and a right channel accompaniment amplitude spectrum mask of the audio music sample.
The accompaniment extraction model includes an encoding portion and a decoding portion. The coding part carries out multi-stage convolution processing on the left channel amplitude spectrum and the right channel amplitude spectrum step by step to obtain coding features extracted by convolution of each stage, wherein the coding features comprise left channel coding features and right channel coding features.
The decoding process of the decoding section is as follows:
with the attention mechanism, the following steps are performed for the first stage attention gate: the coding feature output by the final stage of convolution is used as gating information and acts on the coding feature connected through skip connections to obtain corresponding obvious coding features; the following steps are performed for each of the other stages of attention gates in turn: the method comprises the steps of using a characteristic extracted by current convolution as gating information and acting on coding characteristics connected through skip connections to obtain corresponding significant coding characteristics, wherein the characteristic extracted by the current convolution is obtained by splicing and convolving the significant coding characteristics output by a previous level attention gate and the characteristics extracted by an up-sampling stage; a left channel accompaniment amplitude spectrum mask of the left channel amplitude spectrum and a right channel accompaniment amplitude spectrum mask of the right channel amplitude spectrum are output.
Alternatively, the deep neural network may employ U-Net. The accompaniment extraction model adopts an Attention U-Net model. Referring to fig. 3, a schematic diagram of an accompaniment extraction model architecture is shown.
Wherein the encoding section of the accompaniment extraction model may be set as follows: consists of 6 convolutions. Convolution uses a 3x3 convolution kernel, padding (padding) of 2, and operates with an activation function (Relu). Each convolution is followed by a max pooling (max pooling). Pool (pool) size is 2x2 and packing is 2. The left channel magnitude spectrum and the right channel magnitude spectrum of the audio music sample are input to the encoding section in the form of magnitude spectrum slices.
The coding part adopts 6 layers of convolution layers (namely Conv2D1-Conv2D 6) and 5 layers of maximum Pooling (namely Max Pooling 1-Max Pooling 5), and carries out multistage convolution and Pooling treatment on the left channel amplitude spectrum and the right channel amplitude spectrum step by step respectively to obtain the left channel coding characteristic and the right channel coding characteristic of each stage. The method comprises the following specific steps:
conv2D1 (convolutional layer 1): after convolution, the picture dimension changes from 2048x128x2 to 2048x128x64. Wherein 2048x128x2 contains a left channel magnitude spectrum and a right channel magnitude spectrum.
Max Pooling 1 (Max Pooling 1): after pooling, the picture dimension changes from 2048x128x64 to 1024x64x64.
Conv2D 2: after convolution, the picture dimension changes from 1024x64x64 to 1024x64x128.
Max working 2: after pooling, the picture dimension changes from 1024x64x128 to 512x32x128.
Conv2D 3: after convolution, the picture dimension changes from 512x32x128 to 512x32x256.
Max working 3: after pooling, the picture dimension changes from 512x32x256 to 256x16x256.
Conv2D 4: after convolution, the picture dimension changes from 256x16x256 to 256x16x512.
Max working 4: after pooling, the picture dimension changes from 256x16x512 to 128x8x512.
Conv2D 5: after convolution, the picture dimension changes from 128x8x512 to 128x8x512.
Max working 5: after pooling, the picture dimension changes from 128x8x512 to 64x4x512.
Conv2D 6: after convolution, the picture dimension changes from 64x4x512 to 64x4x1024.
The decoding section may be set in the following manner: comprising 5 Attention Gates (AG) modules.
For the first stage attention gate, the following steps are performed: the coding feature with the picture dimension of 64x4x1024 output by Conv2D 6 is used as gating information and acts on the coding feature with the picture dimension of 128x8x512 corresponding to Conv2D 5 connected through skip connections, and the significant coding feature with the picture dimension of 128x8x512 is output.
The following steps are performed for each of the other stages of attention gates in turn: and using the characteristic extracted by the current convolution as gating information and acting on the coding characteristics connected through skip connections to obtain corresponding significant coding characteristics, wherein the characteristic extracted by the current convolution is obtained by splicing (registration) and convolving the significant coding characteristics output by the attention gate of the previous stage and the characteristics extracted by the up-sampling stage.
Wherein the convolution layer is used for extracting features. The max pooling is used to compress the input picture. Upsampling is used to magnify an image. Where n is the level of the convolution layer and is an integer.
Finally, the feature of the picture dimension 2048x128x2 is determined as an accompaniment magnitude spectrum mask (including a left channel accompaniment magnitude spectrum mask and a right channel accompaniment magnitude spectrum mask).
S2022: and obtaining the predicted accompaniment amplitude spectrum of the audio music sample according to the left channel accompaniment amplitude spectrum mask and the left channel amplitude spectrum of the audio music sample and the right channel accompaniment amplitude spectrum mask and the right channel amplitude spectrum of the audio music sample.
Specifically, determining the product of the left channel accompaniment amplitude spectrum mask and the left channel amplitude spectrum of the audio music sample as the left channel accompaniment amplitude spectrum; and determining the product of the right channel accompaniment amplitude spectrum mask and the right channel amplitude spectrum of the audio music sample as the right channel accompaniment amplitude spectrum.
Wherein the accompaniment magnitude spectrum mask represents a ratio between a magnitude spectrum of accompaniment in the audio music and a magnitude spectrum of the audio music.
S2023: and determining a loss function value according to the predicted accompaniment magnitude spectrum and the corresponding accompaniment magnitude spectrum sample.
Specifically, the difference between the accompaniment amplitude spectrum of the audio music sample and the corresponding accompaniment amplitude spectrum sample is determined as the loss function value.
Alternatively, the following formula may be used in determining the loss function value:
L(X,Y;θ)=||f(X,θ)⊙X-Y||1,1
Wherein L (X, Y; θ) is a loss function value, X is an amplitude spectrum of the audio music, Y is an accompaniment amplitude spectrum sample, θ is a model parameter, and f (X, θ) is a predicted accompaniment amplitude spectrum output by the accompaniment extraction model.
Further, the weight of the accompaniment extraction model is iteratively updated using gradient descent algorithm based training data during model training. Alternatively, the gradient descent algorithm may employ Adam algorithm, the initial learning rate is set to 0.0001, and the Batch Size (batch_size) may be 8.
S2024: and adjusting parameters of the accompaniment extraction model according to the loss function value to obtain an adjusted accompaniment extraction model.
Thus, each parameter in the accompaniment extraction model can be adjusted according to the loss function value, and the adjusted accompaniment extraction model is obtained.
In the embodiment of the application, only model training is described by taking a music sample as an example, and based on the same principle, other audio music samples can be adopted to perform model training on the data, and the description is omitted here.
In the embodiment of the present application, the accompaniment extraction model for extracting a stereo accompaniment is obtained by training the accompaniment extraction model with only the music sample including the audio music sample and the corresponding accompaniment sample. Based on the same principle, the music sample pair data comprising an audio music sample and a corresponding voice sample can be used for training the model to obtain a voice extraction model for extracting voice in audio music, and the music sample pair data comprising an audio music sample, a corresponding accompaniment sample and a voice sample can be used for training the model to obtain an extraction model for extracting voice and accompaniment in audio music. And will not be described in detail herein.
Referring to fig. 4a, a flow chart of extracting accompaniment music is shown. After the accompaniment extraction model is trained, the control apparatus may extract accompaniment music from the audio music through the trained accompaniment extraction model.
The control device receives input audio music and performs STFT transformation on the audio music to obtain a left channel phase spectrum, a right channel phase spectrum, a left channel amplitude spectrum and a right channel amplitude spectrum of the audio music. Then, the control device obtains a left channel accompaniment amplitude spectrum mask of the left channel amplitude spectrum and a right channel accompaniment amplitude spectrum mask of the right channel amplitude spectrum through the accompaniment extraction model, respectively, and obtains a spectrogram of the accompaniment music based on the left channel accompaniment amplitude spectrum, the right channel accompaniment amplitude spectrum, the left channel phase spectrum and the right channel phase spectrum. Finally, the control device performs inverse short-time Fourier transform on the obtained spectrogram to obtain stereo accompaniment music.
The above embodiments are described in further detail below with reference to a specific application scenario. Referring to fig. 4b, a detailed implementation flowchart of an accompaniment music extraction method according to the present application is shown. The method comprises the following specific processes:
Step 400: the control device obtains a left channel phase spectrum, a right channel phase spectrum, a left channel magnitude spectrum, and a right channel magnitude spectrum of the audio music.
Specifically, the control device extracts an audio signal of the audio music, and performs audio analysis on the audio signal through the STFT to obtain a left channel phase spectrum, a right channel phase spectrum, a left channel amplitude spectrum and a right channel amplitude spectrum of the audio music.
Step 401: the control device adopts a pre-trained accompaniment extraction model to respectively obtain a left channel accompaniment amplitude spectrum mask of the left channel amplitude spectrum and a right channel accompaniment amplitude spectrum mask of the right channel amplitude spectrum.
Specifically, the control device inputs the left channel amplitude spectrum to the accompaniment extraction model to obtain a left channel accompaniment amplitude spectrum mask of the left channel amplitude spectrum, and inputs the right channel amplitude spectrum to the accompaniment extraction model to obtain a right channel accompaniment amplitude spectrum mask of the right channel amplitude spectrum.
In practical application, in order to reduce human voice residues, a left channel amplitude spectrum and a left channel amplitude spectrum with the amplitude range of 2048hz are mainly adopted.
Wherein, when the control device adopts a pre-trained accompaniment extraction model to respectively obtain a left channel accompaniment amplitude spectrum mask of the left channel amplitude spectrum and a right channel accompaniment amplitude spectrum mask of the right channel amplitude spectrum, the following steps can be adopted:
S4010: and respectively carrying out multi-stage convolution processing on the left channel amplitude spectrum and the right channel amplitude spectrum step by step to obtain left channel coding characteristics and right channel coding characteristics of each stage.
S4011: with the attention mechanism, the following steps are performed for the first stage attention gate: and using the coding feature output by the final stage of convolution as gating information and acting on the coding feature connected through skip connections to obtain corresponding obvious coding features.
S4012: the following steps are performed for each of the other stages of attention gates in turn: and using the characteristic extracted by the current convolution as gating information and acting on the coding characteristic connected through skip connections to obtain a corresponding obvious coding characteristic.
The characteristic extracted by the current convolution is obtained by splicing and convolving the significant coding characteristic output by the attention gate of the upper stage and the characteristic extracted by the up-sampling stage.
S4013: a left channel accompaniment amplitude spectrum mask of the left channel amplitude spectrum and a right channel accompaniment amplitude spectrum mask of the right channel amplitude spectrum are output.
Step 402: the control device obtains a left channel accompaniment magnitude spectrum based on the left channel magnitude spectrum and the left channel accompaniment magnitude spectrum mask, and obtains a right channel accompaniment magnitude spectrum based on the right channel magnitude spectrum and the right channel accompaniment magnitude spectrum mask.
Specifically, the control apparatus determines a product of a left channel accompaniment amplitude spectrum mask and a left channel amplitude spectrum of the audio music as a left channel accompaniment amplitude spectrum, and determines a product of a right channel accompaniment amplitude spectrum mask and a right channel amplitude spectrum of the audio music sample as a right channel accompaniment amplitude spectrum.
Step 403: the control device obtains stereo accompaniment music of the audio music based on the left channel accompaniment amplitude spectrum, the right channel accompaniment amplitude spectrum, the left channel phase spectrum, and the right channel phase spectrum.
Specifically, the control apparatus obtains a spectrogram of the accompaniment music based on the left channel accompaniment amplitude spectrum, the right channel accompaniment amplitude spectrum, the left channel phase spectrum, and the right channel phase spectrum, and performs inverse short-time fourier transform on the obtained spectrogram to obtain the stereophonic accompaniment music.
The spectrogram is obtained by processing the received time domain signal, the abscissa of the spectrogram is time, the ordinate is frequency, and the coordinate point value is voice data energy.
In this way, high-quality accompaniment music can be extracted through the accompaniment extraction model, mixed human voice in the accompaniment music is reduced, and in the accompaniment extraction process, the left channel amplitude spectrum and the right channel amplitude spectrum of the audio music are respectively input into the accompaniment extraction model, so that a left channel accompaniment amplitude spectrum mask and a right channel accompaniment amplitude spectrum mask are output, and further stereo accompaniment music is obtained.
Further, the control device may further obtain the vocal audio in the audio music according to the left channel accompaniment amplitude spectrum mask and the right channel accompaniment amplitude spectrum mask, and the specific flow is as follows:
s4030: a left channel human voice amplitude spectrum mask is obtained according to the left channel accompaniment amplitude spectrum mask, and a right channel human voice amplitude spectrum mask is obtained according to the right channel accompaniment amplitude spectrum mask.
Specifically, since the accompaniment amplitude spectrum mask is the ratio between the accompaniment amplitude spectrum in the audio music and the amplitude spectrum of the audio music, the accompaniment amplitude spectrum mask can be directly subtracted from 1 to obtain the corresponding human voice amplitude spectrum mask.
The accompaniment amplitude spectrum mask is a left channel accompaniment amplitude spectrum mask and/or a right channel accompaniment amplitude spectrum mask, and the human voice amplitude spectrum mask is a left channel human voice amplitude spectrum mask and/or a right channel human voice amplitude spectrum mask.
S4031: a left channel human voice magnitude spectrum is obtained based on the left channel human voice magnitude spectrum mask and the left channel magnitude spectrum, and a right channel human voice magnitude spectrum is obtained based on the right channel human voice magnitude spectrum mask and the right channel magnitude spectrum.
Specifically, a left channel human voice amplitude spectrum is obtained according to the product between the left channel human voice amplitude spectrum mask and the left channel amplitude spectrum, and a right channel human voice amplitude spectrum is obtained according to the product between the right channel human voice amplitude spectrum mask and the right channel amplitude spectrum.
S4032: and obtaining the voice audio based on the left channel voice amplitude spectrum, the right channel voice amplitude spectrum, the left channel phase spectrum and the right channel phase spectrum.
Specifically, a spectrogram of the voice audio is obtained based on the left voice channel voice amplitude spectrum, the right voice channel voice amplitude spectrum, the left voice channel phase spectrum and the right voice channel phase spectrum, and the voice audio is obtained by performing inverse short-time Fourier transform on the spectrogram of the voice audio.
Thus, the human voice audio in the audio music can be extracted according to the left channel accompaniment amplitude spectrum mask and the right channel accompaniment amplitude spectrum mask. The human voice audio extraction can be applied to various scenes.
For example, singers are identified from vocal audio. For another example, song classification is performed based on singer information identified by voice audio. For another example, the singer information identified from the voice audio is retrieved, recommended, and the like. For another example, songs are identified from voice audio. For another example, K songs are scored based on vocal audio.
Referring to fig. 5, a flowchart of an implementation of evaluation of an accompaniment music extraction method is shown. In the embodiment of the application, the accompaniment music extracted by the accompaniment music extraction method provided by the application is evaluated and compared with the accompaniment music extracted by other accompaniment extraction methods. The specific implementation procedure for evaluation comparison is as follows:
Step 501: the control device selects a specified number of audio music, and constructs a test set of audio music containing the specified data.
Specifically, the control device acquires a specified number of audio music according to preset music screening conditions, and builds a test set of the audio music containing the specified data.
Alternatively, the audio music may be available as a network (e.g., audio software) download, or may be available from a local music database (e.g., a KTV music library). In practical application, the music screening conditions can be set correspondingly according to practical requirements. Alternatively, the music filtering condition may be set according to the type of music language, the region to which the music belongs, the style of music, and the like. The audio music acquisition path and the music screening condition are not limited herein. The specified number may be set accordingly according to the actual application, and for example, the specified number may be 97.
Step 502: the control device adopts the method for extracting the accompaniment music of each audio music in the test set and adopts other specified accompaniment extraction methods to extract the accompaniment music of each audio music in the test set.
When the control device adopts the method for extracting the accompaniment music provided by the application to extract the accompaniment music of each audio music in the test set, the following steps are executed for the accompaniment music of each audio music in the test set respectively:
S5020: and extracting an audio signal of the audio music, and performing audio analysis on the audio signal through STFT to obtain a left channel phase spectrum, a right channel phase spectrum, a left channel amplitude spectrum and a right channel amplitude spectrum of the audio music.
S5021: and respectively obtaining a left channel accompaniment amplitude spectrum mask of the left channel amplitude spectrum and a right channel accompaniment amplitude spectrum mask of the right channel amplitude spectrum by adopting an accompaniment extraction model.
Specifically, when S5021 is performed, specific steps are referred to step 401 in the above embodiment.
S5022: a left channel accompaniment magnitude spectrum is obtained based on the left channel magnitude spectrum and the left channel accompaniment magnitude spectrum mask, and a right channel accompaniment magnitude spectrum is obtained based on the right channel magnitude spectrum and the right channel accompaniment magnitude spectrum mask.
Specifically, when S5022 is performed, specific steps are referred to step 402 in the above embodiment.
S5023: and obtaining stereo accompaniment music of the audio music, namely the accompaniment music of the audio music, based on the left channel accompaniment amplitude spectrum, the right channel accompaniment amplitude spectrum, the left channel phase spectrum and the right channel phase spectrum.
Specifically, when S5023 is performed, specific steps are referred to step 403 in the above embodiment.
Therefore, the accompaniment music extracting method provided by the application can be adopted to respectively obtain the accompaniment music of each audio music in the test set.
Alternatively, other methods for extracting accompaniment may be conventional methods, that is, a method for extracting accompaniment music by using the characteristic that the intensities of human voices in left and right channels in most songs are relatively similar, or a method for extracting accompaniment music based on a deep neural network may be used, which is not limited herein.
Step 503: and evaluating the acquired accompaniment music to obtain an evaluation result.
In the embodiment of the application, professional staff with a certain music background evaluates the accompaniment music obtained in the two modes to respectively determine whether each accompaniment music meets the requirements.
For example, each accompaniment music may be scored by the singer, and if the score of the accompaniment music is higher than the preset score threshold value, the accompaniment music is judged to be qualified, otherwise, the accompaniment music is judged to be unqualified.
Referring to table 1, a table for comparing and evaluating accompaniment music is shown.
Table 1.
Source of accompaniment musicMeets the requirementsIs not in compliance with the requirements
Accompaniment music extracted by the application943
Accompaniment music extracted in other ways5839
In the embodiment of the application, an evaluation and explanation are given to a test set containing 97 pieces of audio music. As can be seen from table 1, 94 accompaniment music among the accompaniment music extracted by the present application meets the quality requirement, and 3 accompaniment music does not meet the quality requirement. And among the accompaniment music extracted in other modes, only 58 accompaniment music meets the quality requirement, and 39 accompaniment music does not meet the quality requirement. Obviously, the accompaniment extraction scheme provided by the application can accurately separate the accompaniment music from the audio music to obtain high-quality accompaniment music which meets the requirements of users.
In the embodiment of the application, the Attention mechanism is combined with the deep neural network to establish the accompaniment extraction model, massive KTV song data is used as a model training sample to train the accompaniment extraction model, and the left channel amplitude spectrum and the right channel amplitude spectrum are respectively used for extracting the accompaniment music, so that the model has strong generalization capability and less human voice residues, and high-fidelity stereo accompaniment music can be obtained.
Based on the same inventive concept, the embodiment of the application also provides an accompaniment music extraction device, and because the principle of solving the problems of the device and equipment is similar to that of an accompaniment music extraction method, the implementation of the device can be referred to the implementation of the method, and the repetition is omitted.
Fig. 6 is a schematic diagram of an accompaniment music extraction structure according to an embodiment of the present application. An accompaniment music extraction apparatus includes:
a first obtaining unit 61 for obtaining a left channel phase spectrum, a right channel phase spectrum, a left channel amplitude spectrum, and a right channel amplitude spectrum of audio music;
an extracting unit 62, configured to obtain a left channel accompaniment amplitude spectrum mask of a left channel amplitude spectrum and a right channel accompaniment amplitude spectrum mask of a right channel amplitude spectrum respectively by using a pre-trained accompaniment extraction model, where the accompaniment extraction model is a deep neural network based on an attention mechanism, and is obtained by training data by using music samples, and each music sample includes an audio music sample and an accompaniment music sample thereof;
A second obtaining unit 63 for obtaining a left channel accompaniment amplitude spectrum based on the left channel amplitude spectrum and the left channel accompaniment amplitude spectrum mask, and obtaining a right channel accompaniment amplitude spectrum based on the right channel amplitude spectrum and the right channel accompaniment amplitude spectrum mask;
a determining unit 64 for determining stereo accompaniment music of the audio music based on the left channel accompaniment amplitude spectrum, the right channel accompaniment amplitude spectrum, the left channel phase spectrum and the right channel phase spectrum.
Preferably, the determining unit 64 is further configured to:
obtaining a left channel human voice amplitude spectrum mask according to the left channel accompaniment amplitude spectrum mask, and obtaining a right channel human voice amplitude spectrum mask according to the right channel accompaniment amplitude spectrum mask;
obtaining a left channel human voice magnitude spectrum based on the left channel human voice magnitude spectrum mask and the left channel magnitude spectrum, and obtaining a right channel human voice magnitude spectrum based on the right channel human voice magnitude spectrum mask and the right channel magnitude spectrum;
and obtaining the voice audio based on the left channel voice amplitude spectrum, the right channel voice amplitude spectrum, the left channel phase spectrum and the right channel phase spectrum.
Preferably, the extracting unit 62 is configured to:
performing multi-stage convolution processing on the left channel amplitude spectrum and the right channel amplitude spectrum step by step to obtain coding features extracted by convolution of each stage, wherein the coding features comprise left channel coding features and right channel coding features;
With the attention mechanism, the following steps are performed for the first stage attention gate: the coding feature output by the final stage of convolution is used as gating information and acts on the coding feature connected through skip connections to obtain corresponding obvious coding features;
the following steps are performed for each of the other stages of attention gates in turn: the method comprises the steps of using a characteristic extracted by current convolution as gating information and acting on coding characteristics connected through skip connections to obtain corresponding significant coding characteristics, wherein the characteristic extracted by the current convolution is obtained by splicing and convolving the significant coding characteristics output by a previous level attention gate and the characteristics extracted by an up-sampling stage;
a left channel accompaniment amplitude spectrum mask of the left channel amplitude spectrum and a right channel accompaniment amplitude spectrum mask of the right channel amplitude spectrum are output.
Preferably, the extracting unit 62 is further configured to:
obtaining a left channel amplitude spectrum and a right channel amplitude spectrum of the audio music sample based on the audio music sample in the pair of music samples, and obtaining an accompaniment amplitude spectrum sample based on the accompaniment music sample in the pair of music samples;
the depth neural network based on the attention mechanism takes a left channel amplitude spectrum and a right channel amplitude spectrum of the audio music sample as input to obtain a left channel accompaniment amplitude spectrum mask and a right channel accompaniment amplitude spectrum mask of the audio music sample;
Obtaining an accompaniment amplitude spectrum of the audio music sample according to the left channel accompaniment amplitude spectrum mask and the left channel amplitude spectrum of the audio music sample and the right channel accompaniment amplitude spectrum mask and the right channel amplitude spectrum of the audio music sample;
determining a loss function value according to the predicted accompaniment magnitude spectrum and the corresponding accompaniment magnitude spectrum sample;
and adjusting parameters of the accompaniment extraction model according to the loss function value to obtain an adjusted accompaniment extraction model.
Preferably, the music sample is obtained by screening the data according to the following steps:
respectively determining a length difference value and a cosine similarity of each initial music sample pair data, wherein the length difference value is a difference value of time length between the audio music sample and the accompaniment sample, and the cosine similarity is determined according to the similarity between the data of the audio music sample and the data of the accompaniment sample;
and screening out the music sample pair data with the length difference value of zero and cosine similarity lower than a preset similarity threshold value from the initial music sample pair data.
Referring to fig. 7, a schematic diagram of a control device is shown. Based on the same technical concept, the embodiment of the present application also provides a control device, which may include a memory 701 and a processor 702.
The memory 701 is used for storing a computer program executed by the processor 702. The memory 701 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created from the use of blockchain nodes, and the like. The processor 702 may be a central processing unit (central processing unit, CPU), or a digital processing unit, etc. The specific connection medium between the memory 701 and the processor 702 is not limited in the embodiments of the present application. In the embodiment of the present application, the memory 701 and the processor 702 are connected through the bus 703 in fig. 7, the bus 703 is shown by a thick line in fig. 7, and the connection manner between other components is only schematically illustrated, but not limited to. The bus 703 may be classified into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 7, but not only one bus or one type of bus.
The memory 701 may be a volatile memory (RAM), such as a random-access memory (RAM); the memory 701 may also be a non-volatile memory (non-volatile memory), such as a read-only memory, a flash memory (flash memory), a Hard Disk Drive (HDD) or a Solid State Drive (SSD), or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited thereto. Memory 701 may be a combination of the above.
A processor 702 for executing the method of extracting accompaniment music provided by the embodiment shown in fig. 4b when calling the computer program stored in the memory 701.
The embodiment of the application also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the method for extracting accompaniment music in any of the above method embodiments.
From the above description of embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus a general purpose hardware platform, or may be implemented by hardware. Based on such understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the related art in the form of a software product, which may be stored in a computer readable storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a control device (which may be a personal computer, a server, or a network device, etc.) to perform the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims (12)

CN201910165261.1A2019-03-052019-03-05Accompaniment music extraction method, accompaniment music extraction device, accompaniment music extraction equipment and accompaniment music extraction mediumActiveCN111667805B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201910165261.1ACN111667805B (en)2019-03-052019-03-05Accompaniment music extraction method, accompaniment music extraction device, accompaniment music extraction equipment and accompaniment music extraction medium

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201910165261.1ACN111667805B (en)2019-03-052019-03-05Accompaniment music extraction method, accompaniment music extraction device, accompaniment music extraction equipment and accompaniment music extraction medium

Publications (2)

Publication NumberPublication Date
CN111667805A CN111667805A (en)2020-09-15
CN111667805Btrue CN111667805B (en)2023-10-13

Family

ID=72381722

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201910165261.1AActiveCN111667805B (en)2019-03-052019-03-05Accompaniment music extraction method, accompaniment music extraction device, accompaniment music extraction equipment and accompaniment music extraction medium

Country Status (1)

CountryLink
CN (1)CN111667805B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
WO2022082607A1 (en)*2020-10-222022-04-28Harman International Industries, IncorporatedVocal track removal by convolutional neural network embedded voice finger printing on standard arm embedded platform
CN113055809B (en)*2021-03-122023-02-28腾讯音乐娱乐科技(深圳)有限公司5.1 sound channel signal generation method, equipment and medium
CN113836344A (en)*2021-09-302021-12-24广州艾美网络科技有限公司Personalized song file generation method and device and music singing equipment
CN114171051B (en)*2021-11-302025-05-13北京达佳互联信息技术有限公司 Audio separation method, device, electronic device and storage medium
CN115116466A (en)*2022-06-172022-09-27广州小鹏汽车科技有限公司 Audio data separation method, device, vehicle, server and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN102138341A (en)*2009-07-072011-07-27索尼公司Acoustic signal processing device, processing method thereof, and program
CN102402977A (en)*2010-09-142012-04-04无锡中星微电子有限公司Method and device for extracting accompaniment and human voice from stereo music
JP2012118234A (en)*2010-11-302012-06-21Brother Ind LtdSignal processing device and program
CN103943113A (en)*2014-04-152014-07-23福建星网视易信息系统有限公司Method and device for removing accompaniment from song
CN105282679A (en)*2014-06-182016-01-27中兴通讯股份有限公司Stereo sound quality improving method and device and terminal
JP2016156938A (en)*2015-02-242016-09-01国立大学法人京都大学 Singing signal separation method and system
CN106024005A (en)*2016-07-012016-10-12腾讯科技(深圳)有限公司Processing method and apparatus for audio data
CN107749302A (en)*2017-10-272018-03-02广州酷狗计算机科技有限公司Audio-frequency processing method, device, storage medium and terminal
KR101840015B1 (en)*2016-12-212018-04-26서강대학교산학협력단Music Accompaniment Extraction Method for Stereophonic Songs

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US8168877B1 (en)*2006-10-022012-05-01Harman International Industries Canada LimitedMusical harmony generation from polyphonic audio signals
US10014002B2 (en)*2016-02-162018-07-03Red Pill VR, Inc.Real-time audio source separation using deep neural networks
US10460727B2 (en)*2017-03-032019-10-29Microsoft Technology Licensing, LlcMulti-talker speech recognizer

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN102138341A (en)*2009-07-072011-07-27索尼公司Acoustic signal processing device, processing method thereof, and program
CN102402977A (en)*2010-09-142012-04-04无锡中星微电子有限公司Method and device for extracting accompaniment and human voice from stereo music
JP2012118234A (en)*2010-11-302012-06-21Brother Ind LtdSignal processing device and program
CN103943113A (en)*2014-04-152014-07-23福建星网视易信息系统有限公司Method and device for removing accompaniment from song
CN105282679A (en)*2014-06-182016-01-27中兴通讯股份有限公司Stereo sound quality improving method and device and terminal
JP2016156938A (en)*2015-02-242016-09-01国立大学法人京都大学 Singing signal separation method and system
CN106024005A (en)*2016-07-012016-10-12腾讯科技(深圳)有限公司Processing method and apparatus for audio data
KR101840015B1 (en)*2016-12-212018-04-26서강대학교산학협력단Music Accompaniment Extraction Method for Stereophonic Songs
CN107749302A (en)*2017-10-272018-03-02广州酷狗计算机科技有限公司Audio-frequency processing method, device, storage medium and terminal

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Deep karaoke: Extracting vocals from musical mixtures using a convolutional deep neural network;SIMPSON A J R;《International Conference on Latent Variable Analysis and Signal Separation》;第429-436*

Also Published As

Publication numberPublication date
CN111667805A (en)2020-09-15

Similar Documents

PublicationPublication DateTitle
CN111667805B (en)Accompaniment music extraction method, accompaniment music extraction device, accompaniment music extraction equipment and accompaniment music extraction medium
Valle et al.Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis
Marafioti et al.GACELA: A generative adversarial context encoder for long audio inpainting of music
Schulze-Forster et al.Unsupervised music source separation using differentiable parametric source models
US10430154B2 (en)Tonal/transient structural separation for audio effects
US20240395278A1 (en)Universal speech enhancement using generative neural networks
Greshler et al.Catch-a-waveform: Learning to generate audio from a single short example
Moliner et al.BEHM-GAN: Bandwidth extension of historical music using generative adversarial networks
Airaksinen et al.Data augmentation strategies for neural network F0 estimation
Sarroff et al.Blind arbitrary reverb matching
CN116438599A (en)Human voice track removal by convolutional neural network embedded voice fingerprint on standard ARM embedded platform
AkesbiAudio denoising for robust audio fingerprinting
Gul et al.Single-channel speech enhancement using colored spectrograms
Ananthabhotla et al.Using a neural network codec approximation loss to improve source separation performance in limited capacity networks
Chen et al.Synthesis and Restoration of Traditional Ethnic Musical Instrument Timbres Based on Time-Frequency Analysis.
CN113870836A (en)Audio generation method, device and equipment based on deep learning and storage medium
Basnet et al.Deep learning based voice conversion network
ColonelAutoencoding neural networks as musical audio synthesizers
ShresthaChord classification of an audio signal using artificial neural network
BousA neural voice transformation framework for modification of pitch and intensity
AlinooriMusic-STAR: a Style Translation system for Audio-based Rearrangement
Walczyński et al.Comparison of selected acoustic signal parameterization methods in the problem of machine recognition of classical music styles
LandiniSynthetic speech detection through convolutional neural networks in noisy environments
LigtenbergThe Effect of Deep Learning-based Source Separation on Predominant Instrument Classification in Polyphonic Music
Kumar et al.Machine learning for audio processing: from feature extraction to model selection

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp