CN112182281B

Movatterモバイル変換

Info

Publication number: CN112182281B
Application number: CN201910604904.8A
Authority: CN
Inventors: 刘龙坡
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-07-05
Filing date: 2019-07-05
Publication date: 2023-09-19
Anticipated expiration: 2039-07-05
Also published as: CN112182281A

Abstract

The application discloses an audio recommendation method, an audio recommendation device and a storage medium, which relate to the technical field of computers and are used for improving the accuracy of audio recommendation. In the method, after image feature information capable of describing semantic content of image information to be matched is determined, feature similarity is calculated between the image feature information and text feature information which can also describe semantic features of audio, and the audio which can be used as background music of the image information to be matched is recommended according to the calculated feature similarity. In this way, by matching the semantics of the image information to be matched with the semantics of the audio, the audio with the same or similar semantics is recommended to the user, so that the audio recommendation is more accurate; meanwhile, the operation complexity of the user when selecting the music in a plurality of audios irrelevant to the image information to be music can be reduced, and the time of the user when the user plays the music is saved.

Description

Audio recommendation method, device and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to an audio recommendation method, an audio recommendation device, and a storage medium.

Background

With the rise of various instant messaging applications and information sharing platform applications, users can share image information such as pictures, videos and the like on the platform. When sharing image information, users usually perform a score on the shared image information.

However, in the prior art, the user needs to select audio in the audio library when making a score for the shared image information. However, most of the audio automatically recommended to the user in the audio library is current hot music, and cannot well meet the current requirements of the user, so that the accuracy of the recommended audio is required to be improved.

Disclosure of Invention

The application embodiment provides an audio recommendation method, device and storage medium, which are used for improving the accuracy of audio recommendation.

In a first aspect, a method of audio recommendation is provided, comprising:

acquiring image information to be matched; wherein the image information to be assembled comprises at least one frame of image;

extracting features of the image information to be matched to obtain image feature information capable of describing semantic content of the image information to be matched;

determining feature similarity of the image feature information and each text feature information in a plurality of text feature information; the text feature information has a corresponding relation with the audio, and the text feature information is used for describing semantic features of the audio;

and selecting a specified number of audios from the plurality of audios to be used as recommended audios of the image information to be matched according to the feature similarity.

In one embodiment, if the image information to be assembled includes a plurality of frames of images, the feature extraction of the image information to be assembled to obtain image feature information capable of describing semantic content of the image information to be assembled includes:

extracting the characteristics of each frame of image in the image information to be matched based on a residual network model to obtain the characteristic vector of each frame of image;

determining the average feature vector of the image information to be matched according to the feature vector of each frame of image;

respectively carrying out convolution calculation on the average feature vector and a preset number of first convolution kernels, and adding corresponding offset parameters to obtain feature vectors with specified dimensions as the image feature information; wherein the convolution parameters of the different first convolution kernels are not exactly the same.

In one embodiment, the text corresponding to each audio is pre-stored, and the method further includes:

text characteristic information of each audio is obtained according to the following method:

after word segmentation processing is carried out on the text corresponding to the audio, word removal processing is carried out, and each word segmentation in the text corresponding to the audio is obtained;

splicing the word segmentation according to the sequence in the text corresponding to the audio frequency to obtain a word segmentation sequence;

Embedding each word in the word segmentation sequence through words to obtain word vectors of each word about context connection relations as text vectors of the audio;

carrying out convolution calculation on the text vector and a preset number of second convolution kernels, and adding corresponding offset parameters to obtain feature vectors with appointed dimensions of the audio as text feature information; wherein the convolution parameters of different second convolution kernels are not exactly the same, and the number of first convolution kernels is the same as the number of second convolution kernels.

In one embodiment, the determining the feature similarity of the image feature information to each of the plurality of text feature information includes:

and determining cosine similarity between the image characteristic information and each text characteristic information.

In one embodiment, the respective convolution parameters of the first convolution kernel and the second convolution kernel and the bias parameters are obtained by training in the following way:

obtaining a training sample; each training sample comprises image characteristic information of sample image information and text characteristic information of sample text, and the label of the training sample is subjective similarity between the image characteristic information and the text characteristic information in the training sample;

Determining the feature similarity of the image feature information and the text feature information in each training sample;

determining errors between the feature similarity corresponding to each training sample and the corresponding subjective similarity;

and transmitting the error to the first convolution kernel and the second convolution kernel through a back transmission algorithm so as to update convolution parameters and corresponding offset parameters in the first convolution kernel and the second convolution kernel, thereby obtaining the convolution parameters and the offset parameters which meet specified conditions.

In a second aspect, an apparatus for audio recommendation is provided, comprising:

the first acquisition module is used for acquiring the image information to be matched; wherein the image information to be assembled comprises at least one frame of image;

the feature extraction module is used for carrying out feature extraction on the image information to be assembled to obtain image feature information capable of describing semantic content of the image information to be assembled;

and the recommending module is used for selecting a specified number of audios from the plurality of audios to serve as recommended audios of the image information to be matched according to the feature similarity.

In one embodiment, if the image information to be assembled includes a plurality of frames of images, the feature extraction module includes:

the feature extraction unit is used for extracting features of each frame of image in the image information to be matched based on a residual network model to obtain feature vectors of each frame of image;

the feature average unit is used for determining an average feature vector of the image information to be matched according to the feature vector of each frame of image;

the computing unit is used for respectively carrying out convolution computation on the average feature vector and a preset number of first convolution kernels and adding corresponding offset parameters to obtain feature vectors with specified dimensions as the image feature information; wherein the convolution parameters of the different first convolution kernels are not exactly the same.

In one embodiment, the text corresponding to each audio is pre-stored, and the device further includes:

acquiring text characteristic information of each audio according to the following devices:

the processing module is used for performing word segmentation processing on the text corresponding to the audio, and then performing de-stop word processing to obtain each word segmentation in the text corresponding to the audio;

the splicing module is used for splicing the segmented words according to the sequence in the text corresponding to the audio frequency to obtain a segmented word sequence;

The word embedding module is used for embedding each word in the word segmentation sequence through words to obtain word vectors of the context connection relation of each word as text vectors of the audio;

the calculation module is used for carrying out convolution calculation on the text vector and a preset number of second convolution kernels and adding corresponding offset parameters to obtain a feature vector with the appointed dimension of the audio as text feature information; wherein the convolution parameters of different second convolution kernels are not exactly the same, and the number of first convolution kernels is the same as the number of second convolution kernels.

In one embodiment, the first similarity determining module is specifically configured to determine cosine similarity between the image feature information and each text feature information.

In one embodiment, the respective convolution parameters of the first convolution kernel and the second convolution kernel and the bias parameters are obtained by training:

the second acquisition module is used for acquiring training samples; each training sample comprises image characteristic information of sample image information and text characteristic information of sample text, and the label of the training sample is subjective similarity between the image characteristic information and the text characteristic information in the training sample;

the error determining module is used for determining errors between the feature similarity corresponding to each training sample and the corresponding subjective similarity;

and the updating parameter module is used for transmitting the error to the first convolution kernel and the second convolution kernel through a back transmission algorithm so as to update the convolution parameters and corresponding offset parameters in the first convolution kernel and the second convolution kernel to obtain the convolution parameters and the offset parameters which meet the specified conditions.

In a third aspect, a computing device is provided, comprising at least one processing unit, and at least one storage unit, wherein the storage unit stores a computer program, which when executed by the processing unit, causes the processing unit to perform the steps of any of the methods of audio recommendation described above.

In one embodiment, the computing device may be a server or a terminal device.

In a fourth aspect, a computer readable medium is provided storing a computer program executable by a terminal device, which when run on the terminal device causes the terminal device to perform the steps of any one of the audio recommendation methods described above.

According to the audio recommendation method, device and storage medium provided by the embodiment of the application, after the image feature information capable of describing the semantic content of the image information to be assembled is determined, the image feature information and the text feature information which can also describe the semantic features of the audio are subjected to feature similarity calculation, and the audio which can be used as background music of the image information to be assembled is recommended according to the calculated feature similarity. In this way, by matching the semantics of the image information to be matched with the semantics of the audio, the audio with the same or similar semantics is recommended to the user, so that the audio recommendation is more accurate; meanwhile, the audio related to the image information to be sounded by the user is recommended to the user, and the user does not need to manually filter irrelevant audio, so that the operation cost of the user during soundtrack can be reduced, and the time of the user during soundtrack is saved.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

fig. 1 is a schematic view of an application scenario of an audio recommendation method in an embodiment of the present application;

FIG. 2 is a training flow chart of a predictive model in an embodiment of the application;

FIG. 3 is a flowchart of acquiring image feature information according to an embodiment of the present application;

FIG. 4 is a flowchart of acquiring text feature information according to an embodiment of the present application;

FIG. 5 is a schematic flow chart of predictive model training in an embodiment of the application;

FIG. 6 is a schematic flow chart of audio recommendation according to an embodiment of the present application;

FIG. 7 is a schematic structural diagram of an audio recommendation apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

Detailed Description

In order to solve the problems of the prior art that when a user is happy with image information, the operation cost of the user is high due to manual selection of audio, and the selected audio type is single due to the small range of the selected audio type, the embodiment of the application provides an audio recommendation method, an audio recommendation device and a storage medium. In order to better understand the technical scheme provided by the embodiment of the application, the following is a simple description of the basic principle of the scheme:

Taking background music as an example for short videos, currently pushed music is popular music, the music has nothing to do with the short videos to be released by a user, and the music in an audio library needs to be listened to by the user one by one from a music list. Because the ordering of the audio in the music list is not related to the short video to be released by the user, the user typically cannot find the appropriate background music after listening to multiple pieces of audio. In view of the above, the embodiments of the present application provide an audio recommendation method, an audio recommendation device, and a storage medium. In the method, after image feature information capable of describing semantic content of image information to be matched is determined, feature similarity is calculated between the image feature information and text feature information which can also describe semantic features of audio, and the audio which can be used as background music of the image information to be matched is recommended according to the calculated feature similarity. In this way, by matching the semantics of the image information to be matched with the semantics of the audio, the audio with the same or similar semantics is recommended to the user, so that the audio recommendation is more accurate; meanwhile, the operation complexity of the user during the music is reduced, namely, the user does not need to manually filter the irrelevant audio from the audio mixed with the image information irrelevant to the music to be assembled, so that the operation time of the user during the music is saved, and the processing resource consumed by processing the user operation instruction of the irrelevant audio is reduced.

For easy understanding, the technical scheme provided by the application is further described below with reference to the accompanying drawings.

Fig. 1 is a schematic view of an application scenario of an audio recommendation method according to an embodiment of the present application. The scenario includes a terminal device 101 and a server 102 of a user 100.

The user 100 wants to make a score for one short video on the terminal device 101, and can send the short video requiring the score to the server 102 for publishing the short video through an instant messaging tool, an information sharing tool, etc. application on the terminal device 101.

The server 102 performs feature extraction on the received short video to obtain image feature information of the short video; performing feature similarity calculation on the obtained image feature information and text feature information stored in a database in advance; according to the calculated feature similarity, a specified number of audios are selected, and audio identifications are sent to applications such as an instant messaging tool, an information sharing tool and the like on the terminal device 101.

An instant communication tool, an information sharing tool, etc. on the terminal device 101 determine recommended audio according to the audio identification, and display the recommended audio on an operation interface of the terminal device 101.

In one embodiment, the terminal device 101 may be a mobile phone, a tablet computer, a personal computer, a virtual machine, or a terminal device simulated by a simulator, etc.

In one embodiment, the communication between the terminal device 101 and the server 102 is wireless communication or ethernet communication or the like.

In one embodiment, the above operations may also be performed by an application such as an instant messaging tool, an information sharing tool, etc. on the terminal device 101, that is: processing means applied to an instant messaging tool, an information sharing tool and the like on the terminal equipment 101 perform feature extraction on the short video to obtain image feature information of the short video; and sends a request to the server 102 to obtain text feature information. The server 102 transmits text feature information to an application such as an instant messaging tool, an information sharing tool, or the like on the terminal device 101 according to the received request.

According to the received text feature information, processing means applied to an instant messaging tool, an information sharing tool and the like on the terminal equipment 101 calculate feature similarity between the image feature information and the text feature information, select a specified number of audios as recommended audios according to the calculated feature similarity, and display the recommended audios on an operation interface of the terminal equipment 101.

In one embodiment, in order to improve the efficiency of audio recommendation, feature extraction of short videos may be further completed on an application such as an instant messaging tool, an information sharing tool, etc. on the terminal device 101, and the extracted image feature information is sent to the server 102, where the server calculates feature similarity.

Therefore, in the embodiment of the present application, the solution may be applied to the terminal device 101, the server 102, or the combination of the terminal device 101 and the server 102, which is not limited in this aspect of the present application.

In the embodiment of the application, the method for automatically recommending the audio to the user can be divided into two parts, namely a training part and a prediction part; the training section is used for training and extracting image feature information or text feature information, and the prediction section is used for predicting which audio and image information to be assembled are related to the prediction section, and the training section and the prediction section are respectively described below.

1. Training part:

in the embodiment of the present application, in order to implement an operation of automatically recommending audio to a user, a prediction model needs to be established first, and the prediction model is trained by a training sample to obtain audio that can be predicted to be related to image information, as shown in fig. 2, the method specifically includes the following steps:

step 201: obtaining a training sample; each training sample comprises image characteristic information of sample image information and text characteristic information of sample text, and the label of the training sample is subjective similarity between the image characteristic information and the text characteristic information in the training sample.

Step 202: and determining the feature similarity of the image feature information and the text feature information in each training sample.

Wherein the image characteristic information is used to describe semantic content of the image information, for example: if the display content of the image information is a snowing scene, the image characteristic information extracted from the image information is used for representing snowing.

The sample text is a text corresponding to the sample audio, and if the audio is audio with lyrics, the lyrics of the audio are taken as the text corresponding to the audio; and if the audio has no lyrics, describing the description information of the semantic content to be conveyed by the audio as a text corresponding to the audio. The descriptive information may be determined by human experience and musical knowledge. For example: if a certain audio is the audio without lyrics, the content expressed by the audio is the condition of thinking, and the description information of the audio is statement related to thinking.

The text characteristic information is used for describing semantic content of the audio, and the semantic content of the audio is obtained according to lyrics of the audio; if the audio is the audio without lyrics, the audio is obtained according to the description information describing the audio. For example: if the lyrics of the audio are related to snow (e.g., the lyrics are i love you, snow in north of south), the text feature information is used to indicate snowing.

In the embodiment of the application, the subjective similarity can be determined through manual experience and knowledge. For example: if the semantic content described by the image characteristic information of the sample image information is a snowing scene and the semantic content described by the text characteristic information of the sample text is related to snow (such as snow fight, snow watching, etc.), determining the subjective similarity as 1; if the semantic content described by the image feature information of the sample image information is a snowing scene and the semantic content described by the text feature information of the sample text is irrelevant to snow (such as raining, thunder, etc.), the subjective similarity is determined to be 0. That is, the subjective similarity may have a value ranging from 0 to 1, with a larger value indicating a higher subjective similarity, i.e., the more relevant the text and image information. Of course, for simplicity, subjective similarity may have two values, 0 and 1, where 0 is uncorrelated and 1 is correlated.

Of course, in the embodiment of the present application, in addition to classifying the subjective similarity into two types (1 or 0), the subjective similarity may be classified in more detail, for example, classifying the subjective similarity into 3 types, that is, uncorrelated (subjective similarity is determined to be 0), correlated (subjective similarity is determined to be 0.5), and very correlated (subjective similarity is determined to be 1). Therefore, the correlation degree between the image information and the audio can be obtained through finer classification, so that the audio which is more in line with the image information is recommended to the user, and the accuracy of the audio recommendation is improved.

In the embodiment of the application, the cosine value of the included angle between the image characteristic information and the text characteristic information can be calculated, so that the cosine similarity of the image characteristic information and the text characteristic information is determined.

Step 203: and determining errors between the feature similarity corresponding to each training sample and the corresponding subjective similarity.

Step 204: and transmitting the error to the first convolution kernel and the second convolution kernel through a back transmission algorithm so as to update convolution parameters and corresponding offset parameters in the first convolution kernel and the second convolution kernel, thereby obtaining the convolution parameters and the offset parameters which meet specified conditions.

The first convolution kernel is used for obtaining the image characteristic information, and the second convolution kernel is used for obtaining the text characteristic information.

Among them, the back propagation algorithm (Backpropagation algorithm), also called BP algorithm, acts to reduce errors; namely, the obtained error is smaller by updating the convolution parameters in the first convolution kernel and the second convolution kernel and the corresponding offset parameters, so that the model and the requirements are met.

Therefore, the prediction model can better predict the image information by training the prediction model, find the audio more suitable for the image information, and recommend the audio to the user, so that the audio recommendation is more accurate; meanwhile, the operation cost of the user during the music is reduced, and the time of the user during the music is saved; in addition, since the audio selected through the predictive model is selected among all the audios in the audio library, the type of the selected audio is more diversified than the type of the automatically recommended audio, thereby increasing the type of the selected audio.

As described above, in the embodiment of the present application, the training sample includes the image feature information and the text feature information, and how to acquire the image feature information and the text feature information is described in detail below.

(1) Image feature information:

in the embodiment of the application, if the image characteristic information is to be acquired, the image information is required to be acquired first; the image information may be video information, picture information, or image information formed by combining a plurality of pictures. After the image information is obtained, extracting features of the image information to obtain image feature information, wherein the method can be implemented as steps A1-A3:

step A1: and extracting the characteristics of each frame of image in the sample image information based on a residual network model to obtain the characteristic vector of each frame of image.

Wherein, the sample image information comprises a plurality of frames of images.

In an embodiment of the application, a residual network (ResNet) model is used to perform feature extraction on the image. Wherein the residual network model is a convolutional neural network model (ConvolutionalNeural Networks, CNN), the residual network is easy to optimize, and the accuracy of the extracted image feature information can be improved by adding considerable depth. The core is to solve the side effect (degradation problem) caused by increasing the depth in the existing convolutional neural network model, so that the network performance can be improved by simply increasing the network depth.

In addition, in the embodiment of the application, the image information can be extracted by other convolutional neural network models as well, and the image information can be extracted by network models such as a cyclic neural network (RNN, recurrent Neural Networks) and a Long Short-Term Memory (LSTM). The method and the device are applicable to the embodiment of the application as long as the semantic features of the image information can be extracted.

Step A2: and determining the average characteristic vector of the sample image information according to the characteristic vector of each frame of image.

If the sample image information is only one picture, the picture is directly subjected to feature extraction, and then the average feature vector of the sample image information can be obtained.

In one embodiment, the feature vectors for each frame of image may be added and averaged to obtain an average feature vector for the sample image information.

In one embodiment, the feature vectors of each frame of image may also be weighted and summed and averaged to obtain an average feature vector for the sample image information.

Step A3: respectively carrying out convolution calculation on the average feature vector and a preset number of first convolution kernels, and adding corresponding offset parameters to obtain feature vectors with specified dimensions as the image feature information; wherein the convolution parameters of the different first convolution kernels are not exactly the same.

Wherein the convolution parameters and bias parameters of the first convolution kernel may be modified by the error obtained in step 203 to make the obtained image feature information more consistent with the semantic content described by the sample image.

The number of the first convolution kernels may be set according to practical situations, such as 512 kernels, 1000 kernels, and the like. If the number of the first convolution kernels is 512, a 512-dimensional feature vector is obtained.

In this way, the image information is subjected to feature extraction through the neural network model, so that the semantic content of the obtained image feature information describing the image information can be more accurate, and more satisfactory audio can be recommended for the image information.

As shown in fig. 3, a flow chart of the process of acquiring image feature information is shown. After each frame of image of the image information is acquired (the image frame represents the image of each frame in the image information), feature extraction is performed on each frame of image through the residual network model 301, so as to obtain a feature vector of each frame of image; carrying out average processing on the feature vector of each frame of image to obtain an average feature vector of the image information; the average feature vector is input into the full connection layer 302 for calculation (wherein the full connection layer represents the content described in step A3, that is, in the full connection layer, the average feature vector of the image information and the first convolution kernel of the full connection layer respectively perform convolution calculation and add corresponding offset parameters), so as to obtain the image feature information of the image information.

(2) Text feature information:

in the embodiment of the application, if the text characteristic information is to be acquired, the text corresponding to the audio needs to be acquired. After the text corresponding to the audio is obtained, extracting features from the text to obtain text feature information of the audio, wherein the steps B1-B4 can be implemented:

step B1: and after word segmentation processing is carried out on the text corresponding to the audio, carrying out word disabling processing, and obtaining each word segmentation in the text corresponding to the audio.

For example, if the text is "i prefer to do a trivial person", the word segmentation process is performed to obtain the result "i/prefer/do/one/trivial/person"; wherein, the 'rather' and the 'one' are stop words, the stop word processing is carried out on the segmentation processing result, and the obtained segmentation is 'I/do/trivial/person'.

Step B2: and splicing the word segmentation according to the sequence in the text corresponding to the audio frequency to obtain a word segmentation sequence.

The purpose of stitching according to the sequence in the text is to acquire the context connection relationship in the text. As described above, the word sequence obtained by the concatenation in the text order is "i do trivial person".

Step B3: and embedding each word in the word segmentation sequence through words to obtain word vectors of the word segments about the context connection relationship as text vectors of the audio.

Word embedding may be a generic term for language models and token learning techniques in Natural Language Processing (NLP). Conceptually, it may be a method of embedding a high-dimensional space with a dimension of the number of all words into a continuous vector space with a much lower dimension, each word or phrase being mapped to a vector on the real number domain; word embedding may be performed by word2vec (word vector).

In the embodiment of the application, after each word is embedded into a word, a word vector of each word is obtained, and then the feature extraction is carried out on the word vector of each word, so as to obtain a text vector of a text.

Step B4: carrying out convolution calculation on the text vector and a preset number of second convolution kernels, and adding corresponding offset parameters to obtain feature vectors with appointed dimensions of the audio as text feature information; wherein the convolution parameters of different second convolution kernels are not exactly the same, and the number of first convolution kernels is the same as the number of second convolution kernels.

Wherein the convolution parameters and offset parameters of the second convolution kernel may be modified by the error obtained in step 203 to make the obtained text feature information more consistent with the semantic content of the text of the audio. The number of the second convolution kernels may be set according to practical situations, and the number of dimensions of the obtained feature vector is the same as the number of the second convolution kernels.

The text feature information has a corresponding relation with the audio, and the text feature information is used for describing semantic features of the audio. The text characteristic information and the audio frequency can be in one-to-one correspondence, or in one-to-many correspondence. For example: if two audios without lyrics exist, the expressed audios are all in the condition of being in the mind, the corresponding texts are possibly the same texts, so that the corresponding text characteristic information is the same.

In this way, the text corresponding to the audio is extracted through the word vector model, so that the semantic content of the text described by the obtained text feature information is more accurate, and the audio meeting the requirements can be recommended for the image information.

In one embodiment, text of audio with lyrics covers richer content than text of audio without lyrics, so that information expressed by the audio can be more comprehensively described based on text feature information extracted by the lyrics, in order to recommend the audio to a user with higher accuracy, the audio with the lyrics can be preferentially recommended when the audio is recommended to the user, and thus, text feature information is determined according to the lyrics of the audio, and the audio can be recommended by the user more accurately.

As shown in fig. 4, a flow chart of the text feature information acquisition is shown. After the audio text is acquired, performing word segmentation processing and de-word segmentation processing on the audio text to obtain each word segmentation in the audio text; outputting each word segment of the audio text to a word vector model 401 for calculation (wherein the word vector model represents the content described in the step B3, namely, performing word embedding operation on each word segment of the audio text) to obtain a word vector of each word segment; extracting features of word vectors of each word segmentation, and reducing dimensions to obtain feature vectors of the audio text; the feature vector of the audio text is input into the full-connection layer 402 (wherein the full-connection layer represents the content described in step B4, that is, in the full-connection layer, the feature vector of the text and the second convolution kernel of the full-connection layer respectively perform convolution calculation and add corresponding offset parameters), so as to obtain text feature information of the audio text.

It should be noted that the fully connected layer 302 in fig. 3 and the fully connected layer 402 in fig. 4 are two fully connected layers.

2. Prediction part:

in the embodiment of the present application, after the prediction model is trained, the image information to be assembled can be recommended through the prediction model in an audio manner, as shown in fig. 6, and specifically the method may include the following steps:

step 601: acquiring image information to be matched; wherein the image information to be dubbed comprises at least one frame of image.

Step 602: and extracting the characteristics of the image information to be matched to obtain the image characteristic information capable of describing the semantic content of the image information to be matched.

Step 603: determining feature similarity of the image feature information and each text feature information in a plurality of text feature information; the text feature information has a corresponding relation with the audio, and the text feature information is used for describing semantic features of the audio.

The text feature information can be obtained by extracting features of the text of the audio in the audio library when the audio is recommended to the music image information. The text corresponding to the audio in the audio library can be extracted in advance and stored in the database; in this way, text feature information can be obtained directly from the database when used.

Step 604: and selecting a specified number of audios from the plurality of audios to be used as recommended audios of the image information to be matched according to the feature similarity.

The specified number of the audios can be selected as recommended audios of the image information to be assembled according to the feature similarity and the order from high to low, or the specified number of the audios can be randomly selected as recommended audios of the image information to be assembled in the audios meeting the requirements. The audio meeting the requirements is audio with feature similarity larger than the audio corresponding to the preset similarity.

Thus, through intelligent matching of the image characteristic information of the image information to be assembled and the text characteristic information of the semantic characteristics of the audio, the operation cost of a user during the assembly can be reduced, and the selected audio type can be increased.

In the embodiment of the application, when in actual operation, the text of the audio is extracted in advance to obtain and store the text characteristic information without using a part for obtaining the text characteristic information in the prediction model. When the predictive model needs to use the text feature information, the text feature information can be directly obtained from the stored text feature information.

Based on the same inventive concept, the embodiment of the application also provides an audio recommendation device. As shown in fig. 7, the apparatus includes:

a first acquisition module 701, configured to acquire image information to be assembled; wherein the image information to be assembled comprises at least one frame of image;

The feature extraction module 702 is configured to perform feature extraction on the image information to be assembled to obtain image feature information capable of describing semantic content of the image information to be assembled;

and a recommendation module 704, configured to select a specified number of audio frequencies from the plurality of audio frequencies as recommended audio frequencies of the image information to be assembled according to the feature similarity.

Further, if the to-be-matched image information includes a plurality of frames of images, the feature extraction module 702 includes:

Further, the text corresponding to each audio is pre-stored, and the device further comprises:

Further, the first determining similarity module 703 is specifically configured to determine cosine similarity between the image feature information and each text feature information.

Further, the convolution parameters of the first convolution kernel and the second convolution kernel and the offset parameters are obtained through training by the following devices:

Based on the same technical concept, the embodiment of the present application further provides a terminal device 800, and referring to fig. 8, the terminal device 800 is configured to implement the methods described in the above embodiments of the methods, for example, implement the embodiment shown in fig. 6, where the terminal device 800 may include a memory 801, a processor 802, an input unit 4503, and a display panel 804.

A memory 801 for storing a computer program for execution by the processor 802. The memory 801 may mainly include a storage program area that may store an operating system, application programs required for at least one function, and the like, and a storage data area; the storage data area may store data created according to the use of the terminal device 800, and the like. The processor 802 may be a central processing unit (central processing unit, CPU), or a digital processing unit, etc. An input unit 803 may be used to obtain user instructions entered by a user. The display panel 804 is configured to display information input by a user or information provided to the user, and in this embodiment of the present application, the display panel 804 is mainly configured to display interfaces of applications in a terminal device and control entities displayed in the display interfaces. Alternatively, the display panel 804 may be configured in the form of a liquid crystal display (liquid crystal display, LCD) or an OLED (organic light-emitting diode) or the like.

The specific connection medium between the memory 801, the processor 802, the input unit 803, and the display panel 804 is not limited in the embodiment of the present application. In the embodiment of the present application, the memory 801, the processor 802, the input unit 803 and the display panel 804 are connected through the bus 805 in fig. 8, the bus 805 is shown by a thick line in fig. 8, and the connection manner between other components is only schematically illustrated, but not limited thereto. The bus 805 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 8, but not only one bus or one type of bus.

The memory 801 may be a volatile memory (RAM) such as a random-access memory (RAM); the memory 801 may also be a nonvolatile memory (non-volatile memory), such as a read-only memory, a flash memory (flash memory), a Hard Disk Drive (HDD) or a Solid State Drive (SSD), or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited thereto. The memory 801 may be a combination of the above memories.

A processor 802 for implementing the embodiment shown in fig. 6, comprising:

a processor 802 for invoking a computer program stored in memory 801 to perform the embodiment shown in fig. 6.

The embodiment of the application also provides a computer readable storage medium which stores computer executable instructions required to be executed by the processor and contains a program for executing the processor.

In some possible embodiments, aspects of a method of audio recommendation provided by the present application may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps of a method of audio recommendation according to the various exemplary embodiments of the present application as described in the present specification, when the program product is run on the terminal device. For example, the terminal device may perform the embodiment shown in fig. 6.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A program product for an audio recommendation according to embodiments of the present application may employ a portable compact disc read only memory (CD-ROM) and include program code and may run on a computing device. However, the program product of the present application is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an entity oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

It should be noted that although several units or sub-units of the apparatus are mentioned in the above detailed description, such a division is merely exemplary and not mandatory. Indeed, the features and functions of two or more of the elements described above may be embodied in one element in accordance with embodiments of the present application. Conversely, the features and functions of one unit described above may be further divided into a plurality of units to be embodied.

Furthermore, although the operations of the methods of the present application are depicted in the drawings in a particular order, this is not required to either imply that the operations must be performed in that particular order or that all of the illustrated operations be performed to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. An audio recommendation method, the method comprising:

if the image information to be assembled comprises multi-frame images, extracting the characteristics of each frame of image in the image information to be assembled based on a residual network model to obtain the characteristic vector of each frame of image; determining the average feature vector of the image information to be matched according to the feature vector of each frame of image; respectively carrying out convolution calculation on the average feature vector and a preset number of first convolution kernels, and adding corresponding offset parameters to obtain feature vectors with specified dimensions as image feature information; wherein the convolution parameters of the different first convolution kernels are not exactly the same;

determining feature similarity of the image feature information and each text feature information in a plurality of text feature information; wherein, the text characteristic information has a corresponding relation with the audio frequency;

text characteristic information of each audio is obtained according to the following method: after word segmentation processing is carried out on the text corresponding to the audio, word removal processing is carried out, and each word segmentation in the text corresponding to the audio is obtained; splicing the word segmentation according to the sequence in the text corresponding to the audio frequency to obtain a word segmentation sequence; embedding each word in the word segmentation sequence through words to obtain word vectors of each word about context connection relations as text vectors of the audio; carrying out convolution calculation on the text vector and a preset number of second convolution kernels, and adding corresponding offset parameters to obtain feature vectors with appointed dimensions of the audio as text feature information; wherein the convolution parameters of different second convolution kernels are not identical, and the number of the first convolution kernels is identical to the number of the second convolution kernels;

2. The method of claim 1, wherein the determining the feature similarity of the image feature information to each of a plurality of text feature information comprises:

3. The method of claim 1, wherein the respective convolution parameters and bias parameters of the first convolution kernel and the second convolution kernel are obtained by training in the following way:

4. An audio recommendation apparatus, the apparatus comprising:

if the to-be-matched image information includes a plurality of frames of images, the feature extraction module includes:

the computing unit is used for respectively carrying out convolution computation on the average feature vector and a preset number of first convolution kernels and adding corresponding offset parameters to obtain feature vectors with specified dimensions as image feature information; wherein the convolution parameters of the different first convolution kernels are not exactly the same;

the calculation module is used for carrying out convolution calculation on the text vector and a preset number of second convolution kernels and adding corresponding offset parameters to obtain a feature vector with the appointed dimension of the audio as text feature information; wherein the convolution parameters of different second convolution kernels are not identical, and the number of the first convolution kernels is identical to the number of the second convolution kernels;

5. The apparatus of claim 4, wherein the first determining similarity module is specifically configured to determine cosine similarity of the image feature information to each text feature information.

6. The apparatus of claim 4, wherein the respective convolution parameters and bias parameters of the first convolution kernel and the second convolution kernel are obtained by training:

7. A computing device comprising at least one processing unit, and at least one storage unit, wherein the storage unit stores a computer program which, when executed by the processing unit, causes the processing unit to perform the steps of the method of any of claims 1 to 3.

8. A computer readable medium, characterized in that it stores a computer program executable by a terminal device, which program, when run on the terminal device, causes the terminal device to perform the steps of the method according to any of claims 1-3.