Source of accompaniment music	Meets the requirements	Is not in compliance with the requirements
			Accompaniment music extracted by the application	94	3
Accompaniment music extracted in other ways	58	39

In the embodiment of the application, the Attention mechanism is combined with the deep neural network to establish the accompaniment extraction model, massive KTV song data is used as a model training sample to train the accompaniment extraction model, and the left channel amplitude spectrum and the right channel amplitude spectrum are respectively used for extracting the accompaniment music, so that the model has strong generalization capability and less human voice residues, and high-fidelity stereo accompaniment music can be obtained.

Based on the same inventive concept, the embodiment of the application also provides an accompaniment music extraction device, and because the principle of solving the problems of the device and equipment is similar to that of an accompaniment music extraction method, the implementation of the device can be referred to the implementation of the method, and the repetition is omitted.

Fig. 6 is a schematic diagram of an accompaniment music extraction structure according to an embodiment of the present application. An accompaniment music extraction apparatus includes:

a first obtaining unit 61 for obtaining a left channel phase spectrum, a right channel phase spectrum, a left channel amplitude spectrum, and a right channel amplitude spectrum of audio music;

an extracting unit 62, configured to obtain a left channel accompaniment amplitude spectrum mask of a left channel amplitude spectrum and a right channel accompaniment amplitude spectrum mask of a right channel amplitude spectrum respectively by using a pre-trained accompaniment extraction model, where the accompaniment extraction model is a deep neural network based on an attention mechanism, and is obtained by training data by using music samples, and each music sample includes an audio music sample and an accompaniment music sample thereof;

A second obtaining unit 63 for obtaining a left channel accompaniment amplitude spectrum based on the left channel amplitude spectrum and the left channel accompaniment amplitude spectrum mask, and obtaining a right channel accompaniment amplitude spectrum based on the right channel amplitude spectrum and the right channel accompaniment amplitude spectrum mask;

a determining unit 64 for determining stereo accompaniment music of the audio music based on the left channel accompaniment amplitude spectrum, the right channel accompaniment amplitude spectrum, the left channel phase spectrum and the right channel phase spectrum.

Preferably, the determining unit 64 is further configured to:

obtaining a left channel human voice amplitude spectrum mask according to the left channel accompaniment amplitude spectrum mask, and obtaining a right channel human voice amplitude spectrum mask according to the right channel accompaniment amplitude spectrum mask;

obtaining a left channel human voice magnitude spectrum based on the left channel human voice magnitude spectrum mask and the left channel magnitude spectrum, and obtaining a right channel human voice magnitude spectrum based on the right channel human voice magnitude spectrum mask and the right channel magnitude spectrum;

and obtaining the voice audio based on the left channel voice amplitude spectrum, the right channel voice amplitude spectrum, the left channel phase spectrum and the right channel phase spectrum.

Preferably, the extracting unit 62 is configured to:

performing multi-stage convolution processing on the left channel amplitude spectrum and the right channel amplitude spectrum step by step to obtain coding features extracted by convolution of each stage, wherein the coding features comprise left channel coding features and right channel coding features;

With the attention mechanism, the following steps are performed for the first stage attention gate: the coding feature output by the final stage of convolution is used as gating information and acts on the coding feature connected through skip connections to obtain corresponding obvious coding features;

the following steps are performed for each of the other stages of attention gates in turn: the method comprises the steps of using a characteristic extracted by current convolution as gating information and acting on coding characteristics connected through skip connections to obtain corresponding significant coding characteristics, wherein the characteristic extracted by the current convolution is obtained by splicing and convolving the significant coding characteristics output by a previous level attention gate and the characteristics extracted by an up-sampling stage;

a left channel accompaniment amplitude spectrum mask of the left channel amplitude spectrum and a right channel accompaniment amplitude spectrum mask of the right channel amplitude spectrum are output.

Preferably, the extracting unit 62 is further configured to:

obtaining a left channel amplitude spectrum and a right channel amplitude spectrum of the audio music sample based on the audio music sample in the pair of music samples, and obtaining an accompaniment amplitude spectrum sample based on the accompaniment music sample in the pair of music samples;

the depth neural network based on the attention mechanism takes a left channel amplitude spectrum and a right channel amplitude spectrum of the audio music sample as input to obtain a left channel accompaniment amplitude spectrum mask and a right channel accompaniment amplitude spectrum mask of the audio music sample;

Obtaining an accompaniment amplitude spectrum of the audio music sample according to the left channel accompaniment amplitude spectrum mask and the left channel amplitude spectrum of the audio music sample and the right channel accompaniment amplitude spectrum mask and the right channel amplitude spectrum of the audio music sample;

determining a loss function value according to the predicted accompaniment magnitude spectrum and the corresponding accompaniment magnitude spectrum sample;

and adjusting parameters of the accompaniment extraction model according to the loss function value to obtain an adjusted accompaniment extraction model.

Preferably, the music sample is obtained by screening the data according to the following steps:

respectively determining a length difference value and a cosine similarity of each initial music sample pair data, wherein the length difference value is a difference value of time length between the audio music sample and the accompaniment sample, and the cosine similarity is determined according to the similarity between the data of the audio music sample and the data of the accompaniment sample;

and screening out the music sample pair data with the length difference value of zero and cosine similarity lower than a preset similarity threshold value from the initial music sample pair data.

Referring to fig. 7, a schematic diagram of a control device is shown. Based on the same technical concept, the embodiment of the present application also provides a control device, which may include a memory 701 and a processor 702.

The memory 701 is used for storing a computer program executed by the processor 702. The memory 701 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created from the use of blockchain nodes, and the like. The processor 702 may be a central processing unit (central processing unit, CPU), or a digital processing unit, etc. The specific connection medium between the memory 701 and the processor 702 is not limited in the embodiments of the present application. In the embodiment of the present application, the memory 701 and the processor 702 are connected through the bus 703 in fig. 7, the bus 703 is shown by a thick line in fig. 7, and the connection manner between other components is only schematically illustrated, but not limited to. The bus 703 may be classified into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 7, but not only one bus or one type of bus.

The memory 701 may be a volatile memory (RAM), such as a random-access memory (RAM); the memory 701 may also be a non-volatile memory (non-volatile memory), such as a read-only memory, a flash memory (flash memory), a Hard Disk Drive (HDD) or a Solid State Drive (SSD), or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited thereto. Memory 701 may be a combination of the above.

A processor 702 for executing the method of extracting accompaniment music provided by the embodiment shown in fig. 4b when calling the computer program stored in the memory 701.

The embodiment of the application also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the method for extracting accompaniment music in any of the above method embodiments.

From the above description of embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus a general purpose hardware platform, or may be implemented by hardware. Based on such understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the related art in the form of a software product, which may be stored in a computer readable storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a control device (which may be a personal computer, a server, or a network device, etc.) to perform the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. A method of extracting accompaniment music, comprising:

a pre-trained accompaniment extraction model is adopted to respectively obtain a left channel accompaniment amplitude spectrum mask of the left channel amplitude spectrum and a right channel accompaniment amplitude spectrum mask of the right channel amplitude spectrum, the accompaniment extraction model is a deep neural network based on an attention mechanism, data are trained by adopting music samples, and each music sample comprises an audio music sample and an accompaniment music sample thereof; wherein the accompaniment magnitude spectrum mask represents: a ratio between an amplitude spectrum of accompaniment in the audio music and an amplitude spectrum of the audio music;

determining the product of the left channel amplitude spectrum and the left channel accompaniment amplitude spectrum mask as a left channel accompaniment amplitude spectrum, and determining the product of the right channel amplitude spectrum and the right channel accompaniment amplitude spectrum mask as a right channel accompaniment amplitude spectrum;

and determining stereo accompaniment music of the audio music based on the left channel accompaniment magnitude spectrum, the right channel accompaniment magnitude spectrum, the left channel phase spectrum and the right channel phase spectrum.

2. The method as recited in claim 1, further comprising:

obtaining a left channel voice amplitude spectrum mask according to the left channel accompaniment amplitude spectrum mask, and obtaining a right channel voice amplitude spectrum mask according to the right channel accompaniment amplitude spectrum mask;

obtaining a left channel human voice amplitude spectrum based on the left channel human voice amplitude spectrum mask and the left channel amplitude spectrum, and obtaining a right channel human voice amplitude spectrum based on the right channel human voice amplitude spectrum mask and the right channel amplitude spectrum;

3. The method of claim 1, wherein the obtaining a left channel accompaniment amplitude spectrum mask of the left channel amplitude spectrum and a right channel accompaniment amplitude spectrum mask of the right channel amplitude spectrum using a pre-trained accompaniment extraction model, respectively, comprises:

and outputting a left channel accompaniment amplitude spectrum mask of the left channel amplitude spectrum and a right channel accompaniment amplitude spectrum mask of the right channel amplitude spectrum.

4. A method according to any one of claims 1-3, wherein the accompaniment extraction model is based on a deep neural network of an attention mechanism and is trained on data using music samples, wherein the accompaniment extraction model is trained on data using music samples, comprising in particular:

obtaining a predicted accompaniment amplitude spectrum of the audio music sample according to the left channel accompaniment amplitude spectrum mask and the left channel amplitude spectrum of the audio music sample and the right channel accompaniment amplitude spectrum mask and the right channel amplitude spectrum of the audio music sample;

5. The method of claim 4, wherein the music sample pair data is obtained by screening the following steps:

6. An accompaniment music extracting apparatus, comprising:

the extraction unit is used for respectively obtaining a left channel accompaniment amplitude spectrum mask of the left channel amplitude spectrum and a right channel accompaniment amplitude spectrum mask of the right channel amplitude spectrum by adopting a pre-trained accompaniment extraction model, the accompaniment extraction model is a deep neural network based on an attention mechanism, the accompaniment extraction model is obtained by training data by adopting music samples, and each music sample comprises an audio music sample and an accompaniment music sample thereof; wherein the accompaniment magnitude spectrum mask represents: a ratio between an amplitude spectrum of accompaniment in the audio music and an amplitude spectrum of the audio music;

a second obtaining unit, configured to determine a product of the left channel amplitude spectrum and a left channel accompaniment amplitude spectrum mask as a left channel accompaniment amplitude spectrum, and determine a product of the right channel amplitude spectrum and a right channel accompaniment amplitude spectrum mask as a right channel accompaniment amplitude spectrum;

A determining unit configured to determine stereo accompaniment music of the audio music based on the left channel accompaniment amplitude spectrum, the right channel accompaniment amplitude spectrum, the left channel phase spectrum, and the right channel phase spectrum.

7. The apparatus of claim 6, wherein the determining unit is further to:

8. The apparatus of claim 6, wherein the extraction unit is to:

9. The apparatus of any one of claims 6-8, wherein the extraction unit is further configured to:

10. The apparatus of claim 9, wherein the music sample pair data is obtained by screening the following steps:

11. A control apparatus, characterized by comprising:

at least one memory for storing program instructions;

at least one processor for invoking program instructions stored in said memory and for performing the steps of the method according to any of the preceding claims 1-5 according to the obtained program instructions.

12. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any one of claims 1-5.