wherein, T (x, y) is used for symbolizing an included angle cosine between the vector x and the vector y, x.y represents an inner product of the vector x and the vector y, x y represents a modular product of the vector x and the vector y, and x_i Represents the ith element, y, in vector x_i Represents the ith element in vector y, i is a positive integer, i.e. [1, n ]]N is the number of elements contained in the expression vector x (or vector y).

According to the audio matching method, the feature information of the audio to be matched is used for multiple times, other features are not required to be determined, so that the model selection and training times are reduced, meanwhile, the audio is matched with the tag information of the audio and the feature information of the audio, the feature information of the audio generated by the trained model is enabled to be more fit with the expression of the semantic features of the original audio, and the similarity between the determined target candidate audio and the audio to be matched is improved.

In addition, in the process of determining the multiband semantic feature sequence, a multilayer and multi-domain mechanism is innovatively added, so that generated interactive feature information simultaneously reserves the time domain characteristics and the frequency domain characteristics of the audio to be matched, and meanwhile, the interactive mechanism can enable a high-level network to learn the bottom layer features, thereby being beneficial to improving the semantic representation capability of feature information and the accuracy of the determined label information of the audio.

In some embodiments, the audio to be matched is audio in the target video; the audio matching method further comprises the following steps: and replacing the audio to be matched in the target video by using the target candidate audio to obtain the processed target video.

In some embodiments, the target audio includes audio to be matched. In one example, the target video is a movie, and the audio used in the audio movie is to be matched. Optionally, the copyright of the audio to be matched expires, and the audio to be matched in the target video needs to be replaced so as not to infringe the copyright of the audio to be matched.

In some embodiments, after determining the target candidate audio, the computer device performs duration equalization processing on the target candidate audio according to the playing duration of the audio to be matched, so as to obtain the processed target candidate audio. Optionally, the computer device replaces the audio to be matched in the target video with the processed target candidate audio.

For example, in the case where the play time period of the audio to be matched is longer than the play time period of the target candidate audio, the computer device lengthens the play time period of the target candidate audio. For example, a certain period of time in the target candidate audio is repeated so that the play duration of the extended target candidate audio is equal to the play duration of the audio to be matched.

For another example, in the case where the playing time length of the audio to be matched is smaller than the playing time length of the target candidate audio, the computer device clips the playing time length of the target candidate audio, for example, clips a header or a tail of the target candidate audio, so that the playing time length of the clipped target candidate audio is equal to the playing time length of the audio to be matched.

By the method, the automaticity of the process of replacing the audio to be matched in the target video is improved, the labor cost consumed in the process of replacing the audio to be matched is reduced, and the speed of replacing the audio to be matched is improved; meanwhile, the problem that subjectivity exists in the target candidate audio selected manually in the audio replacement process is avoided, the playing effect of the target video after audio replacement is guaranteed to be close to the playing effect of the target video before audio replacement, and the influence on the look and feel of the target video caused by audio replacement is reduced.

The following are device embodiments of the present application, which may be used to perform method embodiments of the present application. For details not disclosed in the device embodiments of the present application, please refer to the method embodiments of the present application.

Fig. 7 shows a block diagram of an audio matching device according to an exemplary embodiment of the present application. Theapparatus 700 may include:feature acquisition module 710,tag determination module 720,audio screening module 730, andaudio determination module 740.

Thefeature obtaining module 710 is configured to obtain feature information of the audio to be matched, where the feature information of the audio to be matched is used to characterize semantic features of the audio to be matched.

Thetag determining module 720 is configured to determine tag information of the audio to be matched according to the feature information of the audio to be matched, where the tag information of the audio to be matched includes confidence degrees corresponding to the audio to be matched under multiple categories, and the confidence degrees are used for characterizing correlation degrees between the audio to be matched and the categories.

Theaudio filtering module 730 is configured to select at least one candidate audio from the audio library according to the tag information of the audio to be matched and the tag information of each audio contained in the audio library.

An audio determiningmodule 740, configured to determine a target candidate audio that matches the audio to be matched from the at least one candidate audio.

In some embodiments, the tag determination module comprises: the confidence coefficient determining unit is used for determining a classification result of the audio to be matched according to the characteristic information of the audio to be matched through a plurality of different classification networks; wherein, different classification networks correspond to different classification modes, and the classification result determined by each classification network comprises: confidence degrees corresponding to the audio to be matched respectively under a plurality of categories of a classification mode corresponding to the classification network; and the label generating unit is used for determining the label information of the audio to be matched according to the classification results respectively determined by the plurality of different classification networks.

In some embodiments, the plurality of different classification networks includes a first classification network and a second classification network, the first classification network corresponds to a classification based on audio styles, and the second classification network corresponds to a classification based on audio moods; the confidence determining unit is configured to determine, through the first classification network, a first classification result of the audio to be matched according to the feature information of the audio to be matched, where the first classification result includes: confidence degrees corresponding to the audio to be matched respectively under a plurality of categories obtained based on audio style classification; determining a second classification result of the audio to be matched according to the characteristic information of the audio to be matched through the second classification network, wherein the second classification result comprises: and the confidence degrees of the audio to be matched, which are respectively corresponding to the audio to be matched, are obtained under a plurality of categories based on the audio emotion classification.

In some embodiments, the tag generation unit. For each classification network, selecting at least one confidence coefficient meeting the result screening condition from the classification results determined by the classification network according to the result screening condition corresponding to the classification network, and obtaining a screened classification result corresponding to the classification network; and integrating the filtered classification results corresponding to the classification networks respectively to obtain the label information of the audio to be matched.

In some embodiments, thefeature acquisition module 710 includes: a sequence obtaining unit, configured to obtain a multiband semantic feature sequence of the audio to be matched, where the multiband semantic feature sequence includes: framing the audio to be matched to obtain semantic features corresponding to a plurality of audio frames respectively; and the characteristic generating unit is used for generating characteristic information of the audio to be matched according to the multiband semantic characteristic sequence.

In some embodiments, the sequence acquisition unit comprises: the characteristic extraction subunit is used for extracting time domain characteristic information and frequency domain characteristic information of the audio to be matched, wherein the time domain characteristic information is used for representing the characteristics of the audio to be matched in the time domain dimension, and the frequency domain characteristic information is used for representing the characteristics of the audio to be matched in the frequency domain dimension; the feature interaction subunit is used for carrying out fusion processing on at least one intermediate time domain feature in the time domain feature information extraction process and at least one intermediate frequency domain feature in the frequency domain feature information extraction process to obtain interaction feature information of the audio to be matched, wherein the interaction feature information is used for representing interaction features of the audio to be matched between the time domain dimension and the frequency domain dimension; and the sequence generation subunit is used for obtaining the multiband semantic feature sequence of the audio to be matched according to the time domain feature information, the frequency domain feature information and the interaction feature information.

In some embodiments, the at least one intermediate time-domain feature and the at least one intermediate frequency-domain feature form at least one feature set, the feature set comprising a set of corresponding intermediate time-domain features and intermediate frequency-domain features; the feature interaction subunit is configured to, for each feature group, splice a group of corresponding intermediate time domain features and intermediate frequency domain features included in the feature group, and obtain a spliced feature corresponding to the feature group; and carrying out feature extraction processing on the spliced features respectively corresponding to the feature groups according to a cascading order to obtain the interactive feature information of the audio to be matched.

In some embodiments, the feature extraction subunit is configured to: processing the audio to be matched through a time domain feature extraction network to obtain time domain feature information of the audio to be matched; performing time-frequency conversion on the audio to be matched to obtain frequency spectrum information of the audio to be matched; and processing the frequency spectrum information through a frequency domain feature extraction network to obtain the frequency domain feature information of the audio to be matched.

In some embodiments, the sequence generating subunit is configured to splice the time domain feature information, the frequency domain feature information, and the interaction feature information to obtain fusion feature information; carrying out pooling treatment on the fusion characteristic information by adopting a plurality of different pooling modes to obtain a plurality of pooling results; and generating the multiband semantic feature sequence of the audio to be matched according to the plurality of pooling results.

In some embodiments, the feature generating unit is configured to perform context semantic association processing on the multiband semantic feature sequence according to a positive timing direction to obtain a forward embedded vector, where the forward embedded vector is used to characterize semantic features of the audio to be matched in a playing direction; performing context semantic association processing on the multiband semantic feature sequence according to a time sequence reverse direction to obtain a reverse embedded vector, wherein the reverse embedded vector is used for representing semantic features of the audio to be matched in a reverse play direction; and splicing the forward embedded vector and the reverse embedded vector to obtain the characteristic information of the audio to be matched.

In some embodiments, the audio to be matched is audio in a target video; theapparatus 700 further comprises: and replacing the audio to be matched in the target video by adopting the target candidate audio to obtain the processed target video.

It should be noted that, in the apparatus provided in the foregoing embodiment, when implementing the functions thereof, only the division of the foregoing functional modules is used as an example, in practical application, the foregoing functional allocation may be implemented by different functional modules, that is, the content structure of the device is divided into different functional modules, so as to implement all or part of the functions described above. In addition, the apparatus and the method embodiments provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the apparatus and the method embodiments are detailed in the method embodiments and are not repeated herein. The beneficial effects of the device provided in the foregoing embodiments are described with reference to the method side embodiments, and are not repeated herein.

Fig. 8 shows a block diagram of a computer device according to an exemplary embodiment of the present application.

In general, thecomputer device 800 includes: aprocessor 801 and amemory 802.

Processor 801 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. Theprocessor 801 may be implemented in hardware in at least one of digital signal processing (Digital Signal Processing, DSP), field programmable gate array (Field Programmable Gate Array, FPGA), programmable logic array (Programmable Logic Array, PLA). Theprocessor 801 may also include a main processor, which is a processor for processing data in an awake state, also referred to as a central processor (Central Processing Unit, CPU), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, theprocessor 801 may integrate with an image processor (Graphics Processing Unit, GPU) for rendering and rendering of content required for display by the display screen. In some embodiments, theprocessor 801 may also include an AI processor for processing computing operations related to machine learning.

Memory 802 may include one or more computer-readable storage media, which may be tangible and non-transitory.Memory 802 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium inmemory 802 stores a computer program that is loaded and executed byprocessor 801 to implement the audio matching method provided by the method embodiments described above.

Those skilled in the art will appreciate that the architecture shown in fig. 8 is not limiting and that more or fewer components than shown may be included or that certain components may be combined or that a different arrangement of components may be employed.

The embodiments of the present application also provide a computer readable storage medium having a computer program stored therein, the computer program being loaded and executed by a processor to implement the audio matching method provided by the above-mentioned method embodiments.

The computer readable medium may include computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes random access Memory (Random Access Memory, RAM), erasable programmable read-Only Memory (EPROM), electrically erasable programmable read-Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), flash Memory or other solid state Memory technology, high density digital video disk (Digital Video Disc, DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that the computer storage medium is not limited to the ones described above.

The embodiments of the present application also provide a computer program product, which includes a computer program, where the computer program is stored in a computer readable storage medium, and a processor reads and executes the computer program from the computer readable storage medium to implement the audio matching method provided in the foregoing method embodiments.

It should be understood that references herein to "a plurality" are to two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

It should be noted that, information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data for analysis, stored data, presented data, etc.), and signals referred to in this application are all authorized by the user or are fully authorized by the parties, and the collection, use, and processing of relevant data is required to comply with relevant laws and regulations and standards of relevant countries and regions. For example, the target voice data referred to in this application are all acquired with sufficient authorization.

The foregoing description of the preferred embodiments is merely illustrative of the present application and is not intended to limit the invention to the particular embodiments shown, but on the contrary, the intention is to cover all modifications, equivalents, alternatives, and alternatives falling within the spirit and principles of the invention.

Claims

1. An audio matching method, the method comprising:

2. The method according to claim 1, wherein the determining tag information of the audio to be matched according to the feature information of the audio to be matched includes:

Determining a classification result of the audio to be matched according to the characteristic information of the audio to be matched through a plurality of different classification networks; wherein, different classification networks correspond to different classification modes, and the classification result determined by each classification network comprises: confidence degrees corresponding to the audio to be matched respectively under a plurality of categories of a classification mode corresponding to the classification network;

and determining the label information of the audio to be matched according to the classification results respectively determined by the plurality of different classification networks.

3. The method of claim 2, wherein the plurality of different classification networks includes a first classification network and a second classification network, the first classification network corresponding to a classification based on audio style classification and the second classification network corresponding to a classification based on audio emotion classification;

the determining, by a plurality of different classification networks, a classification result of the audio to be matched according to the feature information of the audio to be matched, includes:

determining a first classification result of the audio to be matched according to the characteristic information of the audio to be matched through the first classification network, wherein the first classification result comprises: confidence degrees corresponding to the audio to be matched respectively under a plurality of categories obtained based on audio style classification;

Determining a second classification result of the audio to be matched according to the characteristic information of the audio to be matched through the second classification network, wherein the second classification result comprises: and the confidence degrees of the audio to be matched, which are respectively corresponding to the audio to be matched, are obtained under a plurality of categories based on the audio emotion classification.

4. The method according to claim 2, wherein the determining the tag information of the audio to be matched according to the classification results respectively determined by the plurality of different classification networks includes:

for each classification network, selecting at least one confidence coefficient meeting the result screening condition from the classification results determined by the classification network according to the result screening condition corresponding to the classification network, and obtaining a screened classification result corresponding to the classification network;

and integrating the filtered classification results corresponding to the classification networks respectively to obtain the label information of the audio to be matched.

5. The method according to claim 1, wherein the obtaining feature information of the audio to be matched includes:

acquiring a multiband semantic feature sequence of the audio to be matched, wherein the multiband semantic feature sequence comprises: framing the audio to be matched to obtain semantic features corresponding to a plurality of audio frames respectively;

And generating the characteristic information of the audio to be matched according to the multiband semantic characteristic sequence.

6. The method of claim 5, wherein the obtaining the multi-band semantic feature sequence of the audio to be matched comprises:

extracting time domain feature information and frequency domain feature information of the audio to be matched, wherein the time domain feature information is used for representing the feature of the audio to be matched in the time domain dimension, and the frequency domain feature information is used for representing the feature of the audio to be matched in the frequency domain dimension;

at least one intermediate time domain feature in the time domain feature information extraction process and at least one intermediate frequency domain feature in the frequency domain feature information extraction process are fused to obtain interaction feature information of the audio to be matched, wherein the interaction feature information is used for representing interaction features of the audio to be matched between the time domain dimension and the frequency domain dimension;

and obtaining the multiband semantic feature sequence of the audio to be matched according to the time domain feature information, the frequency domain feature information and the interaction feature information.

7. The method of claim 6, wherein the at least one intermediate time-domain feature and the at least one intermediate frequency-domain feature form at least one feature set, each feature set comprising a corresponding set of intermediate time-domain features and intermediate frequency-domain features;

The fusing processing is performed on at least one intermediate time domain feature in the time domain feature information extraction process and at least one intermediate frequency domain feature in the frequency domain feature information extraction process to obtain the interactive feature information of the audio to be matched, including:

for each feature group, splicing a group of corresponding intermediate time domain features and intermediate frequency domain features contained in the feature group to obtain splicing features corresponding to the feature group;

and carrying out feature extraction processing on the spliced features respectively corresponding to the feature groups according to a cascading order to obtain the interactive feature information of the audio to be matched.

8. The method of claim 6, wherein the extracting the time domain feature information and the frequency domain feature information of the audio to be matched comprises:

processing the audio to be matched through a time domain feature extraction network to obtain time domain feature information of the audio to be matched;

performing time-frequency conversion on the audio to be matched to obtain frequency spectrum information of the audio to be matched;

and processing the frequency spectrum information through a frequency domain feature extraction network to obtain the frequency domain feature information of the audio to be matched.

9. The method of claim 6, wherein the obtaining the multi-band semantic feature sequence of the audio to be matched based on the time domain feature information, the frequency domain feature information, and the interaction feature information comprises:

splicing the time domain feature information, the frequency domain feature information and the interaction feature information to obtain fusion feature information;

carrying out pooling treatment on the fusion characteristic information by adopting a plurality of different pooling modes to obtain a plurality of pooling results;

and generating the multiband semantic feature sequence of the audio to be matched according to the plurality of pooling results.

10. The method of claim 5, wherein generating the feature information of the audio to be matched from the multi-band semantic feature sequence comprises:

performing context semantic association processing on the multiband semantic feature sequence according to a time sequence positive direction to obtain a forward embedded vector, wherein the forward embedded vector is used for representing semantic features of the audio to be matched in a playing direction;

performing context semantic association processing on the multiband semantic feature sequence according to a time sequence reverse direction to obtain a reverse embedded vector, wherein the reverse embedded vector is used for representing semantic features of the audio to be matched in a reverse play direction;

And splicing the forward embedded vector and the reverse embedded vector to obtain the characteristic information of the audio to be matched.

11. The method according to any one of claims 1 to 10, wherein the audio to be matched is audio in a target video; the method further comprises the steps of:

and replacing the audio to be matched in the target video by adopting the target candidate audio to obtain the processed target video.

12. An audio matching device, the device comprising:

13. A computer device comprising a processor and a memory, the memory having stored therein a computer program that is loaded and executed by the processor to implement the audio matching method of any of claims 1 to 11.

14. A computer readable storage medium, characterized in that the storage medium has stored therein a computer program, which is loaded and executed by a processor to implement the audio matching method according to any of claims 1 to 11.

15. A computer program product, characterized in that the computer program product comprises a computer program stored in a computer readable storage medium, from which a processor reads and executes the computer program to implement the audio matching method according to any of claims 1 to 11.