Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.
The embodiment of the invention provides a content retrieval method, a content retrieval device and a computer readable storage medium. The content search device may be integrated in an electronic device, which may be a server or a terminal.
The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, network acceleration services (Content Delivery Network, CDN), basic cloud computing services such as big data and an artificial intelligent platform. The terminal may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the present application is not limited herein.
For example, referring to fig. 1, taking the case that the content retrieval device is integrated in an electronic device, after the electronic device obtains the content to be retrieved for retrieving the target content, when the content to be retrieved is the video content, multi-mode feature extraction is performed on the video content to obtain mode features of each mode, feature extraction is performed on the mode features of each mode to obtain mode content features of each mode, the mode content features are fused to obtain video features of the video content, and the target text content corresponding to the video content is retrieved in the preset content set according to the video features, so that the accuracy of content retrieval is improved.
It should be noted that, the content retrieval method provided by the embodiment of the present application relates to a computer vision technology in the field of artificial intelligence, that is, in the embodiment of the present application, feature extraction may be performed on text content and video content by using the computer vision technology of artificial intelligence, and based on the extracted features, target content may be screened out from a preset content set.
Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is a theory, method, technique, and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other directions.
The Computer Vision technology (CV) is a science of researching how to make a machine "look at", and more specifically, to replace a human eye with a camera and a Computer to perform machine Vision such as recognition, follow-up and measurement on a target, and further perform graphic processing, so that the Computer processing becomes an image more suitable for the human eye to observe or transmit to an instrument for detection. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision technologies typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning and mapping, autopilot, intelligent transportation, etc., as well as common biometric technologies such as face recognition, fingerprint recognition, etc.
The following will describe in detail. The following description of the embodiments is not intended to limit the preferred embodiments.
The embodiment will be described from the perspective of a content retrieval apparatus, which may be integrated in an electronic device, which may be a server or a terminal, where the terminal may include a tablet computer, a notebook computer, a personal computer (PC, personal Computer), a wearable device, a virtual reality device, or other devices capable of performing content retrieval.
A content retrieval method, comprising:
When the content to be searched is video content, carrying out multi-mode feature extraction on the video content to obtain the mode feature of each mode, carrying out feature extraction on the mode feature of each mode to obtain the mode content feature of each mode, fusing the mode content features to obtain the video feature of the video content, and searching the target text content corresponding to the video content in a preset content set according to the video feature.
As shown in fig. 2, the specific flow of the content retrieval method is as follows:
101. and acquiring the content to be searched for searching the target content.
The content to be searched may be understood as a content in a search condition for searching the target content, and the content type of the content to be searched may be various, for example, may be text content or video content.
The manner of obtaining the content to be retrieved may be various, and specifically may be as follows:
for example, the content to be searched sent by the user through the terminal may be directly received, or the content to be searched may be obtained from a network or a third party database, or when the memory of the content to be searched is large or the number of the memories is large, a content search request is received, where the content search request carries a memory address of the content to be searched, and the content to be searched is obtained from the memory, the cache or the third party database according to the memory address.
102. When the content to be retrieved is video content, multi-mode feature extraction is carried out on the video content to obtain mode features corresponding to a plurality of modes.
The mode characteristics may be understood as characteristic information corresponding to each mode in the video content, and the video content may include a plurality of modes, for example, description actions, audio, scenes, faces, OCR lectures, entities, and/or the like.
The multi-modal feature extraction method for the video content may be various, and specifically may be as follows:
For example, a trained content retrieval model is adopted to conduct multi-mode feature extraction on video content to obtain initial mode features of each mode in the video content, video frames are extracted from the video content, a trained content retrieval model is adopted to conduct multi-mode feature extraction on the video frames to obtain basic mode features of each video frame, target mode features corresponding to each mode are screened out of the basic mode features, and the target mode features and the corresponding initial mode features are fused to obtain mode features of each mode.
The video content and the video frames contain multiple modes, for different modes, different feature extraction methods can be used for extracting multi-mode features of the video content and the video frames, for example, for describing an action mode, a pre-trained S3D (an action recognition model) model on an action recognition data set can be used for extracting features, for an audio mode, a pre-trained VGGish (an audio extraction model) model can be used for extracting features, for a scene mode, a pre-trained DenseNet-161 (a depth model) model can be used for extracting features, for a face mode, a pre-trained SSD model and a ResNet model can be used for extracting features, for a face mode, a Google API (a feature extraction network) can be used for extracting features, and for a physical mode, a pre-trained SENet-154 (a feature extraction network) can be used for extracting features. The extracted initial modality features and basic modality features may include image features, expert features, temporal features, and the like.
The target modal feature and the corresponding initial modal feature may be fused in various manners, for example, the image feature (F), the expert feature (E), and the time feature (T) in the target modal feature and the initial modal feature may be added to obtain a modal feature (Ω) of each mode, which may be specifically shown in fig. 3. Or the weighting coefficients of the target modal characteristics and the initial modal characteristics can be obtained, the target modal characteristics and the initial modal characteristics are weighted according to the weighting coefficients, and the weighted target modal characteristics and the initial modal characteristics are fused to obtain the modal characteristics of each mode.
The trained content retrieval model can be set according to the requirements of practical applications, and in addition, it should be noted that the trained content retrieval model can be preset by maintenance personnel, and also can be trained by a content retrieval device, that is, before the step of performing multi-mode feature extraction on video content by using the trained content retrieval model to obtain initial mode features of each mode in the video content, the content retrieval method can further include:
Acquiring a content sample set, wherein the content sample set comprises a video sample and a text sample, the text sample comprises at least one text word, a preset content retrieval model is adopted to extract multi-modal features of the video sample, sample modal features of each mode are obtained, feature extraction is respectively carried out on the sample modal features of each mode, sample modal content features of the video sample are obtained, the sample modal content features are fused, sample video features of the video sample are obtained, feature extraction is carried out on the text sample, sample text features and text word features corresponding to each text word are obtained, and convergence is carried out on the preset content retrieval model according to the sample modal video features, the sample text features and the text word features, so that a trained content retrieval model is obtained, and the method comprises the following steps:
S1, acquiring a content sample set.
Wherein the set of content samples includes a video sample and a text sample, the text sample including at least one text word.
The manner of acquiring the content sample set may be various, and specifically may be as follows:
For example, a video sample and a text sample may be directly obtained to obtain a content sample set, or an original video content and an original text content may be obtained, then the original video content and the original text content are sent to a labeling server, a matching tag between the original video content and the original text content returned by the labeling server is received, the matching tag is added to the original video content and the original text content, so as to obtain a video sample and a text sample, the video sample and the text sample are combined to obtain a content sample set, or when the number of content samples in the content sample set is large or the memory is large, a model training request may be received, where the model training request carries a storage address of the content sample set, and the content sample set is obtained in a memory, a cache or a third party database according to the storage address.
S2, carrying out multi-mode feature extraction on the video sample by adopting a preset content retrieval model to obtain sample mode features of each mode.
For example, a preset content retrieval model is adopted to perform multi-mode feature extraction on a video sample to obtain initial sample mode features of each mode in the video sample, a video frame is extracted from the video sample, a preset content retrieval model is adopted to perform multi-mode feature extraction on the video frame to obtain basic sample mode features of each video frame, target sample mode features corresponding to each mode are screened out from the basic sample mode features, and the target sample mode features and the corresponding initial sample mode features are fused to obtain mode sample mode features of each mode, which can be seen from the above, and details are not repeated here.
And S3, respectively carrying out feature extraction on the sample mode features of each mode to obtain sample mode content features of the video sample, and fusing the sample mode content features to obtain sample video features of the video sample.
For example, according to the mode of the sample mode characteristics, a target video characteristic extraction network corresponding to each mode is identified in a video characteristic extraction network of a preset content retrieval model, and the sample mode characteristics are subjected to characteristic extraction by adopting the target video characteristic extraction network to obtain sample mode content characteristics corresponding to each mode. And fusing the sample modal content characteristics to obtain sample video characteristics of the video sample.
The mode of the video feature extraction network of the preset content retrieval model is fixed, so that the video feature extraction network corresponding to the mode can be identified only according to the mode of the sample mode feature, and the identified video feature extraction network is used as a target video feature extraction network.
After identifying the target video feature extraction network, the target video feature extraction network may be used to perform feature extraction on the modal features, and the feature extraction process may be multiple, for example, the target video feature extraction network may encode the sample modal features for an encoder of a mode-specific transducer (a conversion network), so as to extract the sample modal content features of each mode.
After the sample modal content features are extracted, the sample modal content features can be fused, for example, the sample modal content features of each mode can be combined to obtain a sample modal content feature set of the video sample, the sample modal content feature set is input into a transducer for encoding so as to calculate the association weight of the sample modal content features, the sample modal content features are weighted according to the association weight, and the weighted sample modal content features are fused to obtain the sample video features of the video sample.
And S4, extracting features of the text sample to obtain sample text features and text word features corresponding to each text word, and converging a preset content retrieval model according to the sample modal content features, the sample video features, the sample text features and the text word features to obtain a trained content retrieval model.
For example, a text feature extraction network of a preset content retrieval model is adopted to perform feature extraction on a text sample to obtain text features of the text sample and text word features of text words, and then the preset content retrieval model is converged according to sample modal content features, sample video features, sample text features and text word features to obtain a trained content retrieval model.
The text sample may be extracted by various ways, for example, a text encoder may be used to extract text features of the text sample to obtain text features and text word features, and the text encoder may be various types, for example, may include Bert (a text encoder) or word2vector (a word vector generation model), and so on.
After extracting the text features and the text word features, the preset content retrieval model can be converged according to the sample modal content features, the sample video features, the sample text features and the text word features, and various convergence modes can be adopted, and specifically can be as follows:
For example, feature loss information of a content sample set may be determined according to sample modal content features and text word features, content loss information of the content sample set may be determined based on sample video features and sample text features, the feature loss information and the content loss information may be fused, and a preset content retrieval model may be converged based on the fused loss information to obtain a trained content retrieval model, which may be specifically as follows:
(1) And determining the feature loss information of the content sample set according to the sample modal content features and the text word features.
For example, feature similarity between the sample modal content feature and the text word feature can be calculated to obtain first feature similarity, sample similarity between the video sample and the text sample is determined according to the first feature similarity, and feature distance between the video sample and the text sample is calculated based on the sample similarity to obtain feature loss information of the content sample set.
The method for calculating the feature similarity between the sample modal content feature and the text word feature may be various, for example, cosine similarity between the sample modal content feature and the text word feature may be calculated, and the cosine similarity is used as the first feature similarity, and specifically, the method may be shown in the formula (1):
Wherein Sij is a first feature similarity, wi is a text word feature,Is a sample modality video feature.
After the first feature similarity is calculated, the sample similarity between the video sample and the text sample can be determined according to the first feature similarity, and various determining modes can be adopted, for example, feature interaction can be carried out on sample mode content features and text word features according to the first feature similarity to obtain post-interaction video features and post-interaction text word features, feature similarity between the post-interaction video features and the post-interaction text word features is calculated to obtain second feature similarity, and the second feature similarity is fused to obtain the sample similarity between the video sample and the text sample.
The method for performing feature interaction on the sample modal content features and the text word features can be various, for example, the first feature similarity can be subjected to standardized processing to obtain target feature similarity, the association weight of the sample modal content features is determined according to the target feature similarity, the association weight is used for indicating the association relationship between the sample modal content features and the text word features, the sample modal content features are weighted based on the association weight, and the text word features are updated based on the weighted sample modal content features to obtain the post-interaction video features and the post-interaction text word features.
The method for normalizing the first feature similarity may include, for example, normalizing the first feature similarity by using an activation function, where the activation function may include a variety of kinds, for example, reLU (x) =max (0, x)), and the normalization process may be as shown in formula (2):
Wherein, theFor the target feature similarity, Sij is the first feature similarity, relu is the activation function.
The method for determining the association weight of the sample modal content features according to the target feature similarity may be various, for example, a preset association parameter may be obtained, the association parameter and the target feature similarity are fused to obtain the association weight, and the association weight may be understood as an attention weight, and may be specifically shown as formula (3):
wherein aij is an association weight, λ is a preset association parameter, which may be an inverse temperature parameter of softmax,Is the similarity of the target features.
After determining the association weight of the sample modal content features, weighting the sample modal visual content features based on the association weight, and fusing the weighted sample modal content features to obtain weighted modal content features, wherein the weighted modal content features are used as initial interactive video features of the video samples, and specifically, the method can be shown by referring to the formula (4):
Where ai is the initial post-interaction video feature, aij is the associated weight,Is a sample modality video feature.
After the initial post-interaction video feature is calculated, the text word feature can be updated based on the initial post-interaction video feature to obtain an post-interaction video feature and an post-interaction text word feature, for example, the text word feature can be updated based on the initial post-interaction video feature to obtain an initial post-interaction text word feature, the feature similarity of the initial post-interaction video feature and the initial post-interaction text word feature is calculated to obtain a third feature similarity, and the initial post-interaction video feature and the initial post-interaction text word feature are updated according to the third feature similarity to obtain an post-interaction video feature and an post-interaction text word feature.
The method for updating the text word features based on the video features after the initial interaction can be various, for example, preset updating parameters can be obtained, the preset updating parameters, the video features after the initial interaction and the text word features are fused to obtain the text word features after the initial interaction, and the method specifically can be as shown in a formula (5):
Wherein F9wi,ai) is a text word feature after initial interaction, wi is a text word feature, ai is a video feature after initial interaction, gi is a gate operation for selecting the most useful information, oi is a fusion feature for enhancing interaction of the text word feature with the video feature after initial interaction, and Fg、bg、Fo and bo are preset update parameters. For multi-step operation (multiple feature interaction), multiple updates of text word features are involved, so that the formula (5) can be integrated to obtain Fa, and then the formula in the process of K feature interactions (mutual cross attention operation) can be obtained, as shown in the formula (6):
wherein, K represents the Kth time,AndThe text word features of the Kth and (K-1) th are respectively represented, Ak is the video feature after the interaction of the Kth, and Vrep is the modal video feature.
The method comprises the steps of updating the initial post-interaction video feature and the initial post-interaction text word feature according to the third feature similarity, for example, feature interaction can be performed on the initial post-interaction video feature and the initial post-interaction text word feature according to the third feature similarity to obtain a target post-interaction video feature and a target post-interaction text word feature, the target post-interaction video feature is used as the initial post-interaction video feature, the target post-interaction text word feature is used as the initial post-interaction text word feature, and the step of calculating the feature similarity of the initial post-interaction video feature and the initial post-interaction text word feature is performed until the feature interaction times of the initial post-interaction video feature and the initial post-interaction text word feature reach a preset time, so that the post-interaction video feature and the post-interaction text word feature are obtained.
The process of feature interaction can be regarded as multi-step cross attention calculation, so that the post-interaction video features and the post-interaction text word features are obtained. The number of feature interactions may be set according to the actual application, and may be generally.
After the post-interaction video feature and the post-interaction text word feature are obtained, the sample similarity between the video sample and the text sample can be calculated, and various calculation modes can be adopted, for example, the feature similarity between the post-interaction video feature and the post-interaction text word feature can be calculated to obtain a second feature similarity, the second feature similarity is fused to obtain the sample similarity between the video sample and the text sample, and the sample similarity is shown in a formula (7):
where S (Tw,rep) is sample similarity, wki is post-interaction text word feature, and aki is post-interaction video feature.
After the sample similarity is calculated, the feature distance between the video sample and the text sample can be calculated, so that feature loss information of the content sample set can be obtained, for example, a preset feature boundary value corresponding to the content sample set can be obtained, a first content sample pair matched with the text sample and a second content sample pair not matched with the video sample are screened out from the content sample set according to the sample similarity, and feature loss information of the content sample set can be obtained by calculating the feature distance between the first content sample pair and the second content sample pair based on the preset feature boundary value.
According to the sample similarity, a plurality of modes can be adopted for screening the first content sample pair and the second content sample pair from the content sample set, for example, the sample similarity can be compared with a preset similarity threshold, a video sample and a corresponding text sample, the sample similarity of which exceeds the preset similarity threshold, are screened from the content sample set, so that the first content sample pair can be obtained, and a video sample and a corresponding text sample, the sample similarity of which does not exceed the preset similarity threshold, are screened from the content sample set, so that the second content sample pair can be obtained.
After the first content sample pair and the second content sample pair are screened out, the feature distance between the first content sample pair and the second content sample pair can be calculated, and the calculation modes can be various, for example, the content sample pair with the largest sample similarity can be screened out from the second content sample pair to obtain the target content sample pair, the similarity difference between the sample similarity of the first content sample pair and the sample similarity of the target content sample pair is calculated to obtain a first similarity difference, and the preset feature boundary value and the first similarity difference are fused to obtain feature loss information of the content sample set, as shown in formula (8):
Wherein LTri is feature loss information, Δ is a preset feature boundary value, B is the number of content sample sets, B represents matching of video samples and text samples, and B represents a difficult negative sample, i.e. a video sample or a text sample in a content sample pair with the largest sample similarity in the second content sample. It can be found that there may be two target content sample pairs, and after fusing the preset feature boundary value and the first similarity difference value, the fused similarity difference value may be further subjected to standardization processing, and the standardized similarity difference value is fused again, so that feature loss information of the content sample set may be obtained.
The feature loss information can be regarded as loss information obtained by back propagation and parameter updating by adopting a triple loss (a loss function), and the feature loss information is mainly used for shortening the distance of the matched video text in a feature space.
(2) Content loss information for a set of content samples is determined based on the sample video features and the sample text features.
For example, feature similarity between the sample video features and the text features can be calculated to obtain content similarity between the video samples and the text samples, a third content sample pair matched with the video samples and a fourth content sample pair not matched with the video samples and the text samples are selected from the content sample set according to the content similarity, a preset content boundary value corresponding to the content sample set is obtained, and a content difference value between the third content sample pair and the fourth content sample pair is calculated according to the preset content boundary value to obtain content loss information of the content sample set.
The method for calculating the content difference between the third content sample and the fourth content sample pair to obtain the content loss information of the content sample set may be various, for example, the similarity difference between the content similarity of the third content sample pair and the content similarity of the fourth content sample pair may be calculated to obtain a second similarity difference, the second similarity difference is fused with a preset content boundary value to obtain the content difference between the third content sample pair and the fourth content sample pair, and the content difference is subjected to standardization processing to obtain the content loss information of the content sample set, as shown in formula (9):
Wherein Lmar is content loss information, B is the number of content sample sets, B represents that video samples are matched with text samples, d represents that content samples except B in the content sample sets, Θ is a preset content boundary value, Sh is content similarity,As a feature of the text it is,Is in combination withVideo features of the matched video samples. Content loss information can be seen as loss information obtained by back-propagation and parameter updating with bi-directional max-MARGIN RANKING loss (a loss function).
(3) And fusing the characteristic loss information and the content loss information, and converging the preset content retrieval model based on the fused loss information to obtain the trained content retrieval model.
For example, a preset balance parameter may be obtained, the preset balance parameter and the feature loss information are fused to obtain balanced feature loss information, and the balanced feature loss information and the content loss information are added to obtain fused loss information, as shown in formula (10):
L=Lmar+β*LTri (10)
Wherein L is post-fusion loss information, Lmar is content loss information, LTri is feature loss information, and β is a preset balance parameter for dimensionally balancing the two loss functions.
Optionally, weighting parameters of the feature loss information and the content loss information can be obtained, the feature loss information and the content loss information are weighted based on the weighting parameters, the weighted feature loss information and the weighted content loss information are fused, so as to obtain fused loss information,
After the fused loss information is obtained, the preset content retrieval model can be converged based on the fused loss information, and various convergence modes can be adopted, for example, the network parameters in the preset content retrieval model can be updated by adopting a gradient descent algorithm according to the fused loss information so as to converge the preset content retrieval model, and a trained content retrieval model can be obtained, or other algorithms can be adopted, and the network parameters in the preset content retrieval model can be updated by adopting the fused loss information so as to converge the preset content retrieval model, and a trained content retrieval model can be obtained.
In the training process of the content retrieval model, the text sample and the video sample are subjected to multi-step cross attention calculation and content similarity calculation, and the back propagation and parameter updating are respectively performed by adopting a triple loss and a bidirectional max-MARGIN RANKING loss, so that a trained content retrieval model is obtained, which can be shown in fig. 4.
103. And respectively extracting the characteristics of the mode corresponding to each mode to obtain the mode content characteristics corresponding to each mode.
Wherein the modality content characteristics may be an overall characteristic of each modality content for indicating the content characteristics under the modality.
The mode of extracting the mode features may be various, and specifically may be as follows:
For example, according to the modes of the mode features, a target video feature extraction network corresponding to each mode can be identified in the video feature extraction network of the trained content retrieval model, and the mode features are extracted by adopting the target video feature extraction network to obtain the mode content features corresponding to each mode.
The video feature extraction network of the content retrieval model after training is fixed in mode, so that the video feature extraction network corresponding to the mode can be identified only according to the mode of the mode feature, and the identified video feature extraction network is used as a target video feature extraction network.
After the target video feature extraction network is identified, the target video feature extraction network may be used to perform feature extraction on the modal features, and the feature extraction process may be multiple, for example, the target video feature extraction network may encode the modal features for an encoder of a transducer specific to a modality, so as to extract the modal content features corresponding to the video content of each modality.
104. And fusing the modal content characteristics to obtain video characteristics of the video content, and searching target text content corresponding to the video content in the preset content set according to the video characteristics.
The mode of fusing the modal video features may be various, and specifically may be as follows:
For example, the mode content features of each mode may be combined to obtain a mode content feature set of the video content, the mode content feature set is input into a transform model to be encoded, so as to calculate an association weight of the mode content features, the mode content features are weighted according to the association weight, the weighted mode content features are fused to obtain the video features of the video content, or a weighting parameter corresponding to each mode is obtained, the mode content features are weighted based on the weighting parameter, and the weighted mode content features are fused to obtain the video features of the video content, or the mode content features are directly spliced to obtain the video features of the video content.
After the video features of the video content are obtained, the target text content corresponding to the video content can be searched in the preset content set according to the video features, and the searching mode can be various, for example, feature similarity between the video features and text features of candidate text content in the preset content set can be calculated respectively, and the target text content corresponding to the video content is screened out from the candidate text content according to the feature similarity.
The text feature extraction method for the candidate text content may be various, for example, a text encoder may be used to extract features of the candidate text content to obtain text features of the candidate text content, the text encoder may be various in type, for example, may include a Bert and a word2vector, or may also extract features of each text word in the candidate text content, then calculate an association weight between each text word, and weight the text word features based on the association weight, so as to obtain the text features of the candidate text content. The time for extracting the text features of the candidate text contents in the preset content set can be various, for example, the time can be real-time extraction, for example, when the acquired content to be searched is video content, the text features of the candidate text contents can be extracted to obtain the text features of the candidate text contents, or the text features of the candidate text contents in the preset content set can be extracted to obtain the text features of the candidate text contents before the content to be searched is acquired, so that the feature similarity between the text features and the video features can be calculated offline, and the target text contents corresponding to the video content can be screened out from the candidate text contents more quickly.
The method for calculating the feature similarity between the video feature and the text feature of the candidate text content may also be various, for example, cosine similarity between the video feature and the text feature of the candidate text content may be calculated, so that the feature similarity may be obtained, or feature distance between the video feature and the text feature of the candidate text content may also be calculated, and the feature similarity between the video feature and the text feature may be determined according to the feature distance.
After calculating the feature similarity, the target text content corresponding to the video content can be screened out from the candidate text contents according to the feature similarity, for example, the candidate visual text content with the feature similarity exceeding the preset similarity threshold can be screened out from the candidate text contents, the screened candidate text contents are sorted, the sorted candidate text content is used as the target text content corresponding to the video content, or the candidate text contents can be sorted according to the feature similarity, the target text content corresponding to the video content is screened out from the sorted candidate text contents, the screened target text content can be one or a plurality of target text contents, when the number of the target text contents is one, the candidate text content with the largest feature similarity to the video feature is used as the target text content, and when the number of the target text content is a plurality of candidate text contents, TOP N candidate text contents with the TOP ranking with the feature similarity to the video feature rank can be screened out of the sorted candidate text contents.
Optionally, when the content to be retrieved is text content, feature extraction may be further performed on the text content, and according to the extracted text feature, the target video content corresponding to the text content may be retrieved from the preset content set, which may specifically be as follows:
For example, when the content to be retrieved is text content, the text content is extracted by adopting a text feature extraction network of the trained content retrieval model, so as to obtain text features of the text content. And respectively calculating the feature similarity between the text feature and the video feature of the candidate video content in the preset content set, and screening out target video content corresponding to the text content from the candidate video content according to the feature similarity.
The text content may be extracted in various ways, for example, a text encoder may be used to extract overall features in the text content to obtain text features, the text encoder may be of various types, for example, may include Bert and word2vector, or may also extract features of each text word in the text content, then calculate association weights between each text word, and weight the text word features based on the association weights to obtain text features.
After extracting the text features of the text content, the feature similarity between the text features and the video features can be calculated, and the feature similarity can be calculated in various manners, for example, feature extraction can be performed on candidate video contents in a preset content set to obtain video features of each candidate video content, and then cosine similarity between the text features and the video features is calculated, so that the feature similarity can be obtained.
The method for extracting the video features from the candidate video content may be various, for example, a trained content retrieval model may be used to perform multi-mode feature extraction on the candidate video content to obtain mode features corresponding to multiple modes, feature extraction is performed on the mode features corresponding to each mode to obtain mode video features corresponding to each mode, and the mode video features are fused to obtain the video features of each candidate video content. The time for extracting the video features of the candidate video content may be various, for example, the video features of the candidate video content may be extracted in real time, for example, each time the content to be retrieved is obtained, the video features may be extracted from the candidate video content, or before the content to be retrieved is obtained, feature extraction may be performed on each candidate video content in the preset content set, and the video features may be extracted, so that feature similarity between the text features and the video features may be calculated offline, and thus, the target video content corresponding to the text content may be screened out from the candidate video content more quickly.
According to the feature similarity, there may be multiple manners of screening the target video content corresponding to the text content from the candidate video contents, for example, the candidate video content with the feature similarity exceeding the preset similarity threshold is screened from the candidate video contents, the screened candidate video content is ranked, the ranked candidate video content is used as the target video content corresponding to the text content, or the candidate video content may be ranked according to the feature similarity, the target video content corresponding to the text content is screened from the ranked candidate video contents, the screened target video content may be one or more, when the number of the target video content is one, the candidate video content with the largest feature similarity with the text feature may be used as the target video content, and when the number of the target video content is more, TOP N candidate video contents with the feature similarity with the text feature ranked ahead may be selected from the ranked candidate video contents as the target video content. In the scheme, the multi-mode information in the video is better extracted, and more important words in the search text are better focused, so that a better search result is achieved. On the data sets MSR-VTT, LSMDC and ACTIVITYNET, the content retrieval performance is greatly improved compared with the currently mainstream method, and the results are shown in tables 1,2 and 3. In the table, R1, R5, R10 and R50 represent the recognition rates of row 1, row 5, row 10 and row 50, respectively, and MdR and MnR are the mean and median of the recognition rates.
TABLE 1 results on MSR-VTT datasets
Table 2LSMDC results on dataset
Table 3ACTIVITYNET results on dataset
As can be seen from the above, after obtaining the content to be searched for searching the target content, when the content to be searched is the video content, the embodiment of the application performs multi-mode feature extraction on the video content to obtain the mode feature of each mode, performs feature extraction on the mode feature of each mode to obtain the mode content feature of each mode, fuses the mode content features to obtain the video feature of the video content, and searches the target text content corresponding to the video content in the preset content set according to the video feature.
According to the method described in the above embodiments, examples are described in further detail below.
In this embodiment, the content search device is specifically integrated in an electronic device, and the electronic device is exemplified as a server.
Server training content retrieval model
And C1, the server acquires a content sample set.
For example, the server may directly obtain the video sample and the text sample to obtain a content sample set, or may obtain the original video content and the original text content, then send the original video content and the original text content to the labeling server, receive a matching tag between the original video content and the original text content returned by the labeling server, add the matching tag to the original video content and the original text content, thereby obtaining the video sample and the text sample, combine the video sample and the text sample to obtain a content sample set, or when the number of content samples in the content sample set is greater or the memory is greater, may receive a model training request, where the model training request carries a storage address of the content sample set, and obtain the content sample set in the memory, the cache, or the third party database according to the storage address.
And C2, the server adopts a preset content retrieval model to extract multi-mode characteristics of the video sample, and sample mode characteristics of each mode are obtained.
For example, the server performs multi-mode feature extraction on the video sample by using a preset content retrieval model to obtain initial sample mode features of each mode in the video sample, extracts video frames from the video sample, performs multi-mode feature extraction on the video frames by using the preset content retrieval model to obtain basic sample mode features of each video frame, screens out target sample mode features corresponding to each mode from the basic sample mode features, and fuses the target sample mode features and the corresponding initial sample mode features to obtain mode feature of each mode.
And C3, the server respectively performs feature extraction on the sample modal features of each mode to obtain sample modal content features of the video sample, and fuses the sample modal content features to obtain sample video features of the video sample.
For example, the server identifies a transducer network corresponding to each mode in a video feature extraction network of a preset content retrieval model according to the mode of the sample mode feature, and encodes the sample mode feature by adopting an encoder of the transducer network, thereby extracting the sample mode content feature of each mode. Combining the sample mode content features of each mode to obtain a sample mode content feature set of the video sample, inputting the sample mode content feature set into the whole transducer network for encoding so as to calculate the association weight of the sample mode content features, weighting the sample mode content features according to the association weight, and fusing the weighted sample mode content features to obtain the sample video features of the video sample.
And C4, extracting features of the text sample by the server to obtain sample text features and text word features corresponding to each text word, and converging a preset content retrieval model according to the sample modal content features, the sample video features, the sample text features and the text word features to obtain a trained content retrieval model.
For example, the server may use a text encoder such as Bert or word2vector to extract features of text features of the text sample, so as to obtain text features and text word features. According to the sample modal content characteristics and the text word characteristics, determining the characteristic loss information of the content sample set, determining the content loss information of the content sample set based on the sample video characteristics and the sample text characteristics, fusing the characteristic loss information and the content loss information, and converging a preset content retrieval model based on the fused loss information to obtain a trained content retrieval model, wherein the method specifically comprises the following steps:
(1) And the server determines the feature loss information of the content sample set according to the sample modal content features and the text word features.
For example, the server may calculate cosine similarity between the sample modal content feature and the text word feature, and the cosine similarity is used as the first feature similarity, which may be specifically shown by referring to formula (1). The first feature similarity is normalized by using an activation function, for example, the activation function may be of various types, for example, reLU (x) =max (0, x), the normalization process may be as shown in formula (2), further, the normalized target feature similarity is obtained, a preset association parameter is obtained, the association parameter and the target feature similarity are fused, and an association weight is obtained, and the association weight may also be understood as an attention weight, and may be specifically shown in formula (3). And weighting the sample modal content characteristics based on the association weight, and fusing the weighted sample modal content characteristics to obtain weighted modal video characteristics, wherein the weighted modal content characteristics are used as initial interaction video characteristics of the video samples, and the method can be specifically shown by referring to a formula (4).
After calculating the video characteristics after the initial interaction, the server can acquire preset updating parameters, and fuse the preset updating parameters, the video characteristics after the initial interaction and the text word characteristics to obtain the text word characteristics after the initial interaction, wherein the text word characteristics after the initial interaction can be specifically shown as a formula (5). Calculating the feature similarity of the video feature after initial interaction and the text word feature after initial interaction to obtain a third feature similarity, carrying out feature interaction on the video feature after initial interaction and the text word feature after initial interaction according to the third feature similarity to obtain a target video feature after target interaction and the text word feature after target interaction, taking the video feature after target interaction as the video feature after initial interaction, taking the text word feature after target interaction as the text word feature after initial interaction, and returning to the step of executing the feature similarity of the video feature after initial interaction and the text word feature after initial interaction until the feature interaction times of the video feature after initial interaction and the text word feature after initial interaction reach the preset times to obtain the video feature after interaction and the text word feature after interaction.
After the server obtains the video feature after interaction and the text word feature after interaction, the server can calculate the feature similarity between the video feature after interaction and the text word feature after interaction to obtain a second feature similarity, and the second feature similarity is fused to obtain the sample similarity between the video sample and the text sample, as shown in a formula (7). Comparing the sample similarity with a preset similarity threshold, screening video samples and corresponding text samples, wherein the sample similarity of the video samples and the corresponding text samples exceeds the preset similarity threshold, so that a first content sample pair can be obtained, and screening video samples and corresponding text samples, wherein the sample similarity of the video samples and the corresponding text samples does not exceed the preset similarity threshold, in the content sample set, so that a second content sample pair can be obtained. Obtaining a preset characteristic boundary value corresponding to the content sample set, screening out a content sample pair with the maximum sample similarity from the second content sample pair to obtain a target content sample pair, calculating a similarity difference value between the sample similarity of the first content sample pair and the sample similarity of the target content sample pair to obtain a first similarity difference value, and fusing the preset characteristic boundary value and the first similarity difference value to obtain characteristic loss information of the content sample set, as shown in a formula (8).
(2) The server determines content loss information for the set of content samples based on the sample video features and the sample text features.
For example, the server may calculate feature similarity between the sample video feature and the text feature, obtain content similarity between the video sample and the text sample, screen out a third content sample pair matching the video sample and the text sample from the content sample set according to the content similarity, and obtain a preset content boundary value corresponding to the content sample set. Calculating a similarity difference between the content similarity of the third content sample pair and the content similarity of the fourth content sample pair to obtain a second similarity difference, fusing the second similarity difference with a preset content boundary value to obtain a content difference between the third content sample pair and the fourth content sample pair, and carrying out standardization processing on the content difference to obtain content loss information of the content sample set, wherein the content loss information is shown in a formula (9).
(3) The server fuses the characteristic loss information and the content loss information, and converges the preset content retrieval model based on the fused loss information to obtain a trained content retrieval model.
For example, the server may obtain a preset balance parameter, fuse the preset balance parameter with the feature loss information to obtain balanced feature loss information, and add the balanced feature loss information to the content loss information to obtain fused loss information, as shown in formula (10). And then, according to the fused loss information, updating network parameters in the preset content retrieval model by adopting a gradient descent algorithm so as to converge the preset content retrieval model to obtain a trained content retrieval model, or adopting other algorithms, and updating the network parameters in the preset content retrieval model by adopting the fused loss information so as to converge the preset content retrieval model to obtain the trained content retrieval model.
As shown in fig. 5, a content retrieval method specifically includes the following steps:
201. The server acquires the content to be retrieved for retrieving the target content.
For example, the server may directly receive the content to be searched sent by the user through the terminal, or may acquire the content to be searched from the network or the third party database, or when the memory of the content to be searched is large or the number of the content to be searched is large, receive a content search request, where the content search request carries a memory address of the content to be searched, and acquire the content to be searched from the memory, the cache or the third party database according to the memory address.
202. When the content to be retrieved is video content, the server performs multi-mode feature extraction on the video content to obtain mode features corresponding to a plurality of modes.
For example, when the content to be retrieved is video content, the server performs multi-modal feature extraction on the video content by using a trained content retrieval model to obtain initial modal features of each mode in the video content, extracts video frames from the video content, performs multi-modal feature extraction on the video frames by using the trained content retrieval model to obtain basic modal features of each video frame, screens out target modal features corresponding to each mode from the basic modal features, and fuses the target modal features with the corresponding initial modal features to obtain modal features of each mode.
The video content and the video frames in the video content may include multiple modes, for describing an action mode, feature extraction may be performed by using an S3D model that is pre-trained on an action recognition dataset, for an audio mode, feature extraction may be performed by using a VGGish model that is pre-trained, for a scene mode, feature extraction may be performed by using a DenseNet-161 model that is pre-trained, for a face mode, feature extraction may be performed by using an SSD model that is pre-trained and a ResNet model, for a face mode, feature extraction may be performed by using a Google API, and for an entity mode, feature extraction may be performed by using SENet-154 that is pre-trained. The extracted initial modality features and basic modality features may include image features, expert features, temporal features, and the like.
203. And the server respectively performs feature extraction on the modal features corresponding to each mode to obtain the modal content features corresponding to each mode.
For example, according to the mode of the mode feature, a transducer network corresponding to each mode can be identified in the video feature extraction network of the trained content retrieval model as a target video feature extraction network, and the mode feature is encoded by adopting an encoder of a transducer specific to the mode, so that the mode content feature corresponding to each mode is extracted.
204. And the server fuses the modal content characteristics to obtain the video characteristics of the video content.
For example, the server may combine the modal content features of each mode to obtain a sample modal content feature set of the video content, input the modal view content feature set into a transform model to encode, so as to calculate an association weight of the modal content features, weight the modal content features according to the association weight, and fuse the weighted modal content features to obtain video features of the video content, or obtain a weighting parameter corresponding to each mode, weight the modal content features based on the weighting parameter, and fuse the weighted modal content features to obtain video features of the video content, or directly splice the modal video features to obtain video features of the video content.
205. And the server searches the target text content corresponding to the video content in the preset content set according to the video characteristics.
For example, the server may extract features of the candidate text content by using a text encoder such as Bert or word2vector to obtain text features of the candidate text content, or may extract features of each text word in the candidate text content, then calculate an association weight between each text word, and weight the text word features based on the association weight, so as to obtain the text features of the candidate text content.
The server calculates cosine similarity between the video feature and the text feature of the candidate text content, so that feature similarity can be obtained, or can also calculate feature distance between the video feature and the text feature of the candidate text content, and determine feature similarity between the video feature and the text feature according to the feature distance.
The server screens out candidate visual text contents with feature similarity exceeding a preset similarity threshold value from the candidate text contents, sorts the screened candidate text contents, takes the sorted candidate text contents as target text contents corresponding to the video contents, or sorts the candidate text contents according to the feature similarity, screens out target text contents corresponding to the video contents from the sorted candidate text contents, wherein one or more target text contents can be screened out, when the number of the target text contents is one, the candidate text contents with the largest feature similarity with the video features can be used as target text contents, and when the number of the target text contents is a plurality of target text contents, TOP N candidate text contents with the feature similarity with the video features, which are ranked at the front, can be screened out from the sorted candidate text contents as target text contents.
The time for extracting the text features of the candidate text contents in the preset content set may be various, for example, the time may be real-time extraction, for example, when the acquired content to be searched is video content, the text features of the candidate text contents may be extracted, so as to obtain the text features of the candidate text contents, or the text features of the candidate text contents in the preset content set may be extracted before the content to be searched is acquired, so as to obtain the text features of the candidate text contents, thereby realizing offline calculation of feature similarity between the text features and the video features, and further screening the target text contents corresponding to the video contents from the candidate text contents more quickly.
206. When the content to be searched is text content, the server performs feature extraction on the text content, and searches target video content corresponding to the text content in a preset content set according to the extracted text features.
For example, when the content to be retrieved is text content, the server may extract the overall feature in the text content by using a text encoder such as Bert or word2vector, to obtain the text feature of the text content. And carrying out multi-mode feature extraction on the candidate video content by adopting the trained content retrieval model to obtain mode features corresponding to a plurality of modes, respectively carrying out feature extraction on the mode features corresponding to each mode to obtain mode video features corresponding to each mode, and fusing the mode video features to obtain the video features of each candidate video content. Then, the cosine similarity between the text feature and the video feature is calculated, so that the feature similarity can be obtained. Candidate video contents with feature similarity exceeding a preset similarity threshold are screened out of the candidate video contents, the screened candidate video contents are ranked, the ranked candidate video contents are used as target video contents corresponding to text contents, or the candidate video contents can be ranked according to the feature similarity, the target video contents corresponding to the text contents are screened out of the ranked candidate video contents, one or more candidate video contents can be screened out, when the number of the candidate video contents is one, the candidate video contents with the largest feature similarity with the text features can be used as target video contents, and when the number of the target video contents is more, TOP N candidate video contents with the front feature similarity ranking with the text features can be screened out of the ranked candidate video contents as target video contents.
The time for extracting the video features of the candidate video content may be various, for example, the video features of the candidate video content may be extracted in real time, for example, each time the content to be searched is obtained, the video features may be extracted from the candidate video content, or before the content to be searched is obtained, feature extraction may be performed on each candidate video content in the preset content set, so as to extract the video features, thereby implementing off-line calculation of feature similarity between the text features and the video features, and further, more quickly screening out the target video content corresponding to the text content from the candidate video content.
As can be seen from the above, after obtaining the content to be searched for the target content, when the content to be searched for is the video content, the server in the embodiment of the application performs multi-mode feature extraction on the video content to obtain the mode feature of each mode, performs feature extraction on the mode feature of each mode to obtain the mode content feature of each mode, fuses the mode content features to obtain the video feature of the video content, searches the target text content corresponding to the video content in the preset content set according to the video feature, performs feature extraction on the text content when the content to be searched for is the text content, and searches the target video content corresponding to the text content in the preset content set according to the extracted text feature.
In order to better implement the above method, the embodiment of the present invention further provides a content retrieval device, where the content retrieval device may be integrated into an electronic device, such as a server or a terminal, where the terminal may include a tablet computer, a notebook computer, and/or a personal computer.
For example, as shown in fig. 6, the content retrieval device may include an acquisition unit 301, a first extraction unit 302, a second extraction unit 303, and a text retrieval unit 304, as follows:
(1) An acquisition unit 301;
an acquisition unit 301 for acquiring content to be retrieved for retrieving the target content.
For example, the obtaining unit 301 may be specifically configured to receive content to be retrieved sent by a user through a terminal, or may obtain the content to be retrieved from a network or a third party database, or when the memory of the content to be retrieved is large or the number of the content to be retrieved is large, receive a content retrieval request, where the content retrieval request carries a storage address of the content to be retrieved, and obtain the content to be retrieved from the memory, the cache, or the third party database according to the storage address.
(2) A first extraction unit 302;
The first extracting unit 302 is configured to perform multi-mode feature extraction on the video content when the content to be retrieved is the video content, so as to obtain a mode feature of each mode.
For example, the first extraction unit 302 may specifically be configured to, when the content to be retrieved is video content, perform multi-modal feature extraction on the video content by using a trained content retrieval model to obtain initial modal features of each mode in the video content, extract video frames from the video content, perform multi-modal feature extraction on the video frames by using a trained content retrieval model to obtain basic modal features of each video frame, screen out target modal features corresponding to each mode from the basic modal features, and fuse the target modal features with the corresponding initial modal features to obtain modal features of the video content of each mode.
(3) A second extraction unit 303;
The second extracting unit 303 is configured to perform feature extraction on the modal feature of each mode, so as to obtain a modal content feature of each mode.
For example, the second extraction unit 303 may be specifically configured to identify, according to the modes of the mode features, a target video feature extraction network corresponding to each mode in the video feature extraction networks of the trained content retrieval model, and perform feature extraction on the mode features by using the target video feature extraction network to obtain the mode content feature of each mode.
(4) A text retrieval unit 304;
The text retrieval unit 304 is configured to fuse the modal content features to obtain video features of the video content, and retrieve target text content corresponding to the video content from the preset content set according to the video features.
For example, the text retrieval unit 304 may specifically be configured to combine the modal content features of each mode to obtain a sample modal content feature set of the video content, input the modal view content feature set into a transform model for encoding, so as to calculate an association weight of the modal content features, weight the modal content features according to the association weight, fuse the weighted modal content features to obtain video features of the video content, respectively calculate feature similarities between the video features and text features of candidate text content in a preset content set, and screen out target text content corresponding to the video content from the candidate text content according to the feature similarities.
Optionally, the content retrieval device may further include a training unit 305, as shown in fig. 7, specifically may be as follows:
the training unit 305 is configured to train the preset content retrieval model to obtain a trained content retrieval model.
For example, the training unit 305 may specifically be configured to obtain a content sample set, where the content sample set includes a video sample and a text sample, the text sample includes at least one text word, a preset content search model is used to perform multi-modal feature extraction on the video sample to obtain sample modal features of each mode, feature extraction is performed on the sample modal features of each mode to obtain sample modal content features of the video sample, the sample modal content features are fused to obtain sample video features of the video sample, feature extraction is performed on the text sample to obtain sample text features and text word features corresponding to each text word, and the preset content search model is converged according to the sample modal content features, the sample video features, the sample text features and the text word features to obtain a trained content search model.
Optionally, the content retrieval device may further include a video retrieval unit 306, as shown in fig. 8, and specifically may be as follows:
the video retrieving unit 306 is configured to perform feature extraction on the text content when the content to be retrieved is the text content, and retrieve, according to the extracted text feature, the target video content corresponding to the text content from the preset content set.
For example, the video retrieving unit 306 may be specifically configured to, when the content to be retrieved is text content, perform feature extraction on the text content by using a text feature extraction network of the trained content retrieval model, so as to obtain text features of the text content. And respectively calculating the feature similarity between the text feature and the video feature of the candidate video content in the preset content set, and screening out target video content corresponding to the text content from the candidate video content according to the feature similarity.
In the implementation, each unit may be implemented as an independent entity, or may be implemented as the same entity or several entities in any combination, and the implementation of each unit may be referred to the foregoing method embodiment, which is not described herein again.
As can be seen from the foregoing, in this embodiment, after the obtaining unit 301 obtains the content to be searched for the target content, when the content to be searched is the video content, the first extracting unit 302 performs multi-mode feature extraction on the video content to obtain the mode feature of each mode, the second extracting unit 303 performs feature extraction on the mode feature of each mode to obtain the mode content feature of each mode, the text retrieving unit 304 fuses the mode content features to obtain the video feature of the video content, and searches the target text content corresponding to the video content in the preset content set according to the video feature, and because the scheme performs multi-mode feature extraction on the video content first, then extracts the mode video feature from the mode feature corresponding to each mode, thereby improving the accuracy of the mode video feature in the video, and fuses the mode video feature to obtain the video feature of the video content, so that the extracted video feature can better express the information in the video, thereby improving the accuracy of content search.
The embodiment of the invention also provides an electronic device, as shown in fig. 9, which shows a schematic structural diagram of the electronic device according to the embodiment of the invention, specifically:
The electronic device may include one or more processing cores 'processors 401, one or more computer-readable storage media's memory 402, power supply 403, and input unit 404, among other components. It will be appreciated by those skilled in the art that the electronic device structure shown in fig. 9 is not limiting of the electronic device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components. Wherein:
The processor 401 is a control center of the electronic device, connects various parts of the entire electronic device using various interfaces and lines, and performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 402, and calling data stored in the memory 402, thereby controlling the electronic device as a whole. Optionally, the processor 401 may include one or more processing cores, and preferably the processor 401 may integrate an application processor and a modem processor, wherein the application processor mainly processes an operating system, a user interface, an application program, etc., and the modem processor mainly processes wireless communication. It will be appreciated that the modem processor described above may not be integrated into the processor 401.
The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by executing the software programs and modules stored in the memory 402. The memory 402 may mainly include a storage program area that may store an operating system, application programs required for at least one function (such as a sound playing function, an image playing function, etc.), etc., and a storage data area that may store data created according to the use of the electronic device, etc. In addition, memory 402 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 with access to the memory 402.
The electronic device further comprises a power supply 403 for supplying power to the various components, preferably the power supply 403 may be logically connected to the processor 401 by a power management system, so that functions of managing charging, discharging, and power consumption are performed by the power management system. The power supply 403 may also include one or more of any of a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.
The electronic device may further comprise an input unit 404, which input unit 404 may be used for receiving input digital or character information and generating keyboard, mouse, joystick, optical or trackball signal inputs in connection with user settings and function control.
Although not shown, the electronic device may further include a display unit or the like, which is not described herein. In particular, in this embodiment, the processor 401 in the electronic device loads executable files corresponding to the processes of one or more application programs into the memory 402 according to the following instructions, and the processor 401 executes the application programs stored in the memory 402, so as to implement various functions as follows:
When the content to be searched is video content, carrying out multi-mode feature extraction on the video content to obtain the mode feature of each mode, carrying out feature extraction on the mode feature of each mode to obtain the mode content feature of each mode, fusing the mode content features to obtain the video feature of the video content, and searching the target text content corresponding to the video content in a preset content set according to the video feature.
For example, the electronic device receives the content to be searched sent by the user through the terminal, or may acquire the content to be searched from the network or the third party database, or when the memory of the content to be searched is large or the number of the content to be searched is large, receives a content search request, where the content search request carries a memory address of the content to be searched, and acquires the content to be searched from the memory, the cache or the third party database according to the memory address. When the content to be searched is video content, performing multi-mode feature extraction on the video content by adopting a trained content search model to obtain initial mode features of each mode in the video content, extracting video frames from the video content, performing multi-mode feature extraction on the video frames by adopting the trained content search model to obtain basic mode features of each video frame, screening out target mode features corresponding to each mode from the basic mode features, and fusing the target mode features with the corresponding initial mode features to obtain mode features of each mode. And identifying a target video feature extraction network corresponding to each mode in the video feature extraction network of the trained content retrieval model according to the mode of the mode feature, and carrying out feature extraction on the mode feature by adopting the target video feature extraction network to obtain the mode content feature corresponding to each mode. Combining the modal content characteristics of each mode to obtain a sample modal content characteristic set of the video content, inputting the modal content characteristic set into a transducer model for encoding so as to calculate the association weight of the modal content characteristics, weighting the modal content characteristics according to the association weight, fusing the weighted modal content characteristics to obtain the video characteristics of the video content, respectively calculating the characteristic similarity between the video characteristics and the text characteristics of candidate text contents in the preset content set, and screening out target text contents corresponding to the video content from the candidate text contents according to the characteristic similarity.
The specific implementation of each operation may be referred to the previous embodiments, and will not be described herein.
As can be seen from the above, after obtaining the content to be searched for searching the target content, when the content to be searched is the video content, the embodiment of the invention performs multi-mode feature extraction on the video content to obtain the mode feature of each mode, performs feature extraction on the mode feature of each mode to obtain the mode content feature of each mode, fuses the mode content features to obtain the video feature of the video content, and searches the target text content corresponding to the video content in the preset content set according to the video feature.
Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.
To this end, embodiments of the present invention provide a computer readable storage medium having stored therein a plurality of instructions capable of being loaded by a processor to perform the steps of any of the content retrieval methods provided by embodiments of the present invention. For example, the instructions may perform the steps of:
When the content to be searched is video content, carrying out multi-mode feature extraction on the video content to obtain the mode feature of each mode, carrying out feature extraction on the mode feature of each mode to obtain the mode content feature of each mode, fusing the mode content features to obtain the video feature of the video content, and searching the target text content corresponding to the video content in a preset content set according to the video feature.
For example, the content to be searched sent by the user through the terminal is received, or the content to be searched can be obtained from a network or a third party database, or when the memory of the content to be searched is large or the number of the memory is large, a content search request is received, the content search request carries the memory address of the content to be searched, and the content to be searched is obtained from the memory, the cache or the third party database according to the memory address. When the content to be searched is video content, performing multi-mode feature extraction on the video content by adopting a trained content search model to obtain initial mode features of each mode in the video content, extracting video frames from the video content, performing multi-mode feature extraction on the video frames by adopting the trained content search model to obtain basic mode features of each video frame, screening out target mode features corresponding to each mode from the basic mode features, and fusing the target mode features with the corresponding initial mode features to obtain mode features of each mode. And identifying a target video feature extraction network corresponding to each mode in the video feature extraction network of the trained content retrieval model according to the mode of the mode feature, and carrying out feature extraction on the mode feature by adopting the target video feature extraction network to obtain the mode content feature corresponding to each mode. Combining the modal content characteristics of each mode to obtain a sample modal content characteristic set of the video content, inputting the modal content characteristic set into a transducer model for encoding so as to calculate the association weight of the modal content characteristics, weighting the modal content characteristics according to the association weight, fusing the weighted modal content characteristics to obtain the video characteristics of the video content, respectively calculating the characteristic similarity between the video characteristics and the text characteristics of candidate text contents in the preset content set, and screening out target text contents corresponding to the video content from the candidate text contents according to the characteristic similarity.
The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.
The computer readable storage medium may include, among others, read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disks, and the like.
Because the instructions stored in the computer readable storage medium can execute the steps in any content retrieval method provided by the embodiments of the present invention, the beneficial effects that any content retrieval method provided by the embodiments of the present invention can be achieved, and detailed descriptions of the foregoing embodiments are omitted herein.
Wherein according to an aspect of the application, a computer program product or a computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, cause the computer device to perform the methods provided in various alternative implementations of the content retrieval aspect or the video/text bi-directional retrieval aspect described above.
The foregoing describes in detail a content retrieval method, apparatus and computer readable storage medium according to embodiments of the present invention, and specific examples are set forth herein to illustrate the principles and implementations of the present invention, and the above examples are provided to assist in understanding the method and core ideas of the present invention, and meanwhile, to those skilled in the art, according to the ideas of the present invention, there are variations in the specific implementations and application scope, and in light of the above, the disclosure should not be construed as limiting the scope of the present invention.