Movatterモバイル変換


[0]ホーム

URL:


CN113821687B - A content retrieval method, device and computer-readable storage medium - Google Patents

A content retrieval method, device and computer-readable storage medium
Download PDF

Info

Publication number
CN113821687B
CN113821687BCN202110733613.6ACN202110733613ACN113821687BCN 113821687 BCN113821687 BCN 113821687BCN 202110733613 ACN202110733613 ACN 202110733613ACN 113821687 BCN113821687 BCN 113821687B
Authority
CN
China
Prior art keywords
content
feature
video
sample
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110733613.6A
Other languages
Chinese (zh)
Other versions
CN113821687A (en
Inventor
王文哲
张梦丹
彭湃
孙星
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co LtdfiledCriticalTencent Technology Shenzhen Co Ltd
Priority to CN202110733613.6ApriorityCriticalpatent/CN113821687B/en
Publication of CN113821687ApublicationCriticalpatent/CN113821687A/en
Application grantedgrantedCritical
Publication of CN113821687BpublicationCriticalpatent/CN113821687B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本发明实施例公开了一种内容检索方法、装置和计算机可读存储介质;本发明实施例在获取用于检索目标内容的待检索内容后,当待检索内容为视频内容时,对视频内容进行多模态特征提取,得到每一模态的模态特征,分别对每一模态的模态特征进行特征提取,得到每一模态的模态内容特征,将模态内容特征进行融合,得到视频内容的视频特征,并根据视频特征,在预设内容集合中检索出视频内容对应的目标文本内容;该方案可以提升内容检索的准确率。

The embodiments of the present invention disclose a content retrieval method, device and computer-readable storage medium. After obtaining the content to be retrieved for retrieving target content, when the content to be retrieved is video content, multimodal feature extraction is performed on the video content to obtain the modal features of each modality, feature extraction is performed on the modal features of each modality to obtain the modal content features of each modality, the modal content features are fused to obtain the video features of the video content, and according to the video features, the target text content corresponding to the video content is retrieved from a preset content set. This scheme can improve the accuracy of content retrieval.

Description

Content retrieval method, device and computer readable storage medium
Technical Field
The present invention relates to the field of communications technologies, and in particular, to a content retrieval method, apparatus, and computer readable storage medium.
Background
In recent years, a huge amount of contents have been produced on the internet, and these contents may include various types, for example, text, video, and the like. In order to better retrieve the desired content in the mass content, it is common to retrieve one type of content that matches it, for example, text content that matches it may be retrieved by video content provided by the user. The existing content retrieval usually adopts a feature extraction network to directly extract video features and text features for feature matching to complete the content retrieval.
In the research and practice process of the prior art, the inventor discovers that the video contains multiple modes and complex semantics, and the video features extracted by adopting a single feature extraction network have insufficient accuracy, so that the video features cannot be in one-to-one correspondence with text semantics, and therefore, the content retrieval accuracy is insufficient.
Disclosure of Invention
The embodiment of the invention provides a content retrieval method, a content retrieval device and a computer readable storage medium, which can improve the accuracy of content retrieval.
A content retrieval method, comprising:
acquiring content to be searched for searching for target content;
When the content to be retrieved is video content, carrying out multi-mode feature extraction on the video content to obtain the mode feature of each mode;
respectively extracting the characteristics of the mode characteristics of each mode to obtain the mode content characteristics of each mode;
And fusing the modal content characteristics to obtain video characteristics of the video content, and searching target text content corresponding to the video content in a preset content set according to the video characteristics.
Accordingly, an embodiment of the present invention provides a content retrieval device, including:
An acquisition unit configured to acquire content to be retrieved for retrieving a target content;
The first extraction unit is used for extracting multi-mode characteristics of the video content when the content to be retrieved is the video content, so as to obtain the mode characteristics of each mode;
the second extraction unit is used for extracting the characteristics of the mode characteristics of each mode respectively to obtain the mode video characteristics of each mode;
and the text retrieval unit is used for fusing the modal video characteristics to obtain video characteristics of the video content, and retrieving target text content corresponding to the video content from the preset content set according to the video characteristics.
Optionally, in some embodiments, the first extracting unit may be specifically configured to perform multi-modal feature extraction on the video content by using a trained content retrieval model to obtain an initial modal feature of each mode in the video content, extract a video frame from the video content, perform multi-modal feature extraction on the video frame by using the trained content retrieval model to obtain a basic modal feature of each video frame, screen out a target modal feature corresponding to each mode from the basic modal features, and fuse the target modal feature with the corresponding initial modal feature to obtain a modal feature of the video content of each mode.
Optionally, in some embodiments, the second extracting unit may be specifically configured to identify a target video feature extraction network corresponding to each mode in the video feature extraction network of the trained content retrieval model, and perform feature extraction on the mode feature by using the target video feature extraction network to obtain a mode video feature of each mode.
Optionally, in some embodiments, the content retrieval device may further include a training unit, where the training unit may be specifically configured to obtain a content sample set, where the content sample set includes a video sample and a text sample, where the text sample includes at least one text word, perform multi-modal feature extraction on the video sample by using a preset content retrieval model to obtain sample modal features of each mode, perform feature extraction on the sample modal features of each mode to obtain sample modal content features of the video sample, and fuse the sample modal content features to obtain sample video features of the video sample, perform feature extraction on the text sample to obtain sample text features and text word features corresponding to each text word, and converge the preset content retrieval model according to the sample modal video features, the sample text features, and the text word features, so as to obtain the trained content retrieval model.
Optionally, in some embodiments, the training unit may be specifically configured to determine feature loss information of the content sample set according to the sample modal content feature and the text word feature, determine content loss information of the content sample set based on the sample video feature and the sample text feature, fuse the feature loss information and the content loss information, and converge a preset content retrieval model based on the fused loss information, to obtain a trained content retrieval model.
Optionally, in some embodiments, the training unit may be specifically configured to calculate a feature similarity between the sample modal content feature and the text word feature to obtain a first feature similarity, determine a sample similarity between the video sample and the text sample according to the first feature similarity, and calculate a feature distance between the video sample and the text sample based on the sample similarity to obtain feature loss information of the content sample set.
Optionally, in some embodiments, the training unit may be specifically configured to perform feature interaction on the sample modal content feature and the text word feature according to the first feature similarity to obtain an post-interaction video feature and an post-interaction text word feature, calculate feature similarity between the post-interaction video feature and the post-interaction text word feature to obtain a second feature similarity, and fuse the second feature similarity to obtain a sample similarity between the video sample and the text sample.
Optionally, in some embodiments, the training unit may be specifically configured to perform normalization processing on the first feature similarity to obtain a target feature similarity, determine, according to the target feature similarity, an association weight of the sample modal content feature, where the association weight is used to indicate an association relationship between the sample modal content feature and a text word feature, weight the sample modal content feature based on the association weight, and update the text word feature based on the weighted sample modal content feature to obtain the post-interaction video feature and the post-interaction text word feature.
Optionally, in some embodiments, the training unit may be specifically configured to use the weighted sample modal content feature as an initial post-interaction video feature, update the text word feature based on the initial post-interaction video feature to obtain an initial post-interaction text word feature, calculate feature similarity between the initial post-interaction video feature and the initial post-interaction text word feature to obtain a third feature similarity, and update the initial post-interaction video feature and the initial post-interaction text word feature according to the third feature similarity to obtain the post-interaction video feature and the post-interaction text word feature.
Optionally, in some embodiments, the training unit may be specifically configured to perform feature interaction on the initial post-interaction video feature and the initial post-interaction text word feature according to the third feature similarity, so as to obtain a target post-interaction video feature and a target post-interaction text word feature, take the target post-interaction video feature as the initial post-interaction video feature and take the target post-interaction text word feature as the initial post-interaction text word feature, and return to the step of executing the calculation of the feature similarity between the initial post-interaction video feature and the initial post-interaction text word feature until the feature interaction times of the initial post-interaction video feature and the initial post-interaction text word feature reach a preset time, so as to obtain the post-interaction video feature and the post-interaction text word feature.
Optionally, in some embodiments, the training unit may be specifically configured to obtain a preset feature boundary value corresponding to the content sample set, screen, according to the sample similarity, a first content sample pair in which a video sample is matched with a text sample and a second content sample pair in which a video sample is not matched with a text sample from the content sample set, and calculate, based on the preset feature boundary value, a feature distance between the first content sample pair and the second content sample pair, so as to obtain feature loss information of the content sample set.
Optionally, in some embodiments, the training unit may be specifically configured to screen a content sample pair with a maximum sample similarity from the second content sample pair to obtain a target content sample pair, calculate a similarity difference between the sample similarity of the first content sample pair and the sample similarity of the target content sample pair to obtain a first similarity difference, and fuse the preset feature boundary value with the first similarity difference to obtain feature loss information of the content sample set.
Optionally, in some embodiments, the training unit may be specifically configured to calculate a feature similarity between the sample video feature and the text feature to obtain a content similarity between the video sample and the text sample, screen, according to the content similarity, a third content sample pair in which the video sample is matched with the text sample and a fourth content sample pair in which the video sample is not matched with the content sample from the content sample set, obtain a preset content boundary value corresponding to the content sample set, and calculate a content difference value between the third content sample pair and the fourth content sample pair according to the preset content boundary value, so as to obtain content loss information of the content sample set.
Optionally, in some embodiments, the training unit may be specifically configured to calculate a similarity difference between the content similarity of the third content sample pair and the content similarity of the fourth content sample pair to obtain a second similarity difference, fuse the second similarity difference with a preset content boundary value to obtain a content difference between the third content sample pair and the fourth content sample pair, and perform normalization processing on the content difference to obtain content loss information of the content sample set.
Optionally, in some embodiments, the content retrieval device may further include a video retrieval unit, where the video retrieval unit may be specifically configured to perform feature extraction on the text content to obtain text features of the text content when the content to be retrieved is the text content, and retrieve, according to the text features, a target video content corresponding to the text content in the preset content set. In addition, the embodiment of the invention also provides electronic equipment, which comprises a processor and a memory, wherein the memory stores an application program, and the processor is used for running the application program in the memory to realize the content retrieval method provided by the embodiment of the invention.
In addition, the embodiment of the invention also provides a computer readable storage medium, which stores a plurality of instructions, the instructions are suitable for being loaded by a processor to execute the steps in any of the content retrieval methods provided by the embodiment of the invention.
According to the embodiment of the application, after the content to be searched for searching the target content is obtained, when the content to be searched is video content, multi-modal feature extraction is carried out on the video content to obtain modal features of each mode, the modal features of each mode are respectively subjected to feature extraction to obtain the modal content features of each mode, the modal content features are fused to obtain the video features of the video content, and the target text content corresponding to the video content is searched in the preset content set according to the video features.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic view of a scenario of a content retrieval method according to an embodiment of the present invention;
FIG. 2 is a flow chart of a content retrieval method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of performing modal feature extraction on video content according to an embodiment of the present invention;
FIG. 4 is a training schematic diagram of a preset content retrieval model according to an embodiment of the present invention;
FIG. 5 is another flow chart of a content retrieval method according to an embodiment of the present invention;
Fig. 6 is a schematic structural diagram of a content retrieval device according to an embodiment of the present invention;
fig. 7 is another schematic structural diagram of a content retrieval device according to an embodiment of the present invention;
fig. 8 is another schematic structural diagram of a content retrieval device according to an embodiment of the present invention;
Fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.
The embodiment of the invention provides a content retrieval method, a content retrieval device and a computer readable storage medium. The content search device may be integrated in an electronic device, which may be a server or a terminal.
The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, network acceleration services (Content Delivery Network, CDN), basic cloud computing services such as big data and an artificial intelligent platform. The terminal may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the present application is not limited herein.
For example, referring to fig. 1, taking the case that the content retrieval device is integrated in an electronic device, after the electronic device obtains the content to be retrieved for retrieving the target content, when the content to be retrieved is the video content, multi-mode feature extraction is performed on the video content to obtain mode features of each mode, feature extraction is performed on the mode features of each mode to obtain mode content features of each mode, the mode content features are fused to obtain video features of the video content, and the target text content corresponding to the video content is retrieved in the preset content set according to the video features, so that the accuracy of content retrieval is improved.
It should be noted that, the content retrieval method provided by the embodiment of the present application relates to a computer vision technology in the field of artificial intelligence, that is, in the embodiment of the present application, feature extraction may be performed on text content and video content by using the computer vision technology of artificial intelligence, and based on the extracted features, target content may be screened out from a preset content set.
Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is a theory, method, technique, and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other directions.
The Computer Vision technology (CV) is a science of researching how to make a machine "look at", and more specifically, to replace a human eye with a camera and a Computer to perform machine Vision such as recognition, follow-up and measurement on a target, and further perform graphic processing, so that the Computer processing becomes an image more suitable for the human eye to observe or transmit to an instrument for detection. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision technologies typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning and mapping, autopilot, intelligent transportation, etc., as well as common biometric technologies such as face recognition, fingerprint recognition, etc.
The following will describe in detail. The following description of the embodiments is not intended to limit the preferred embodiments.
The embodiment will be described from the perspective of a content retrieval apparatus, which may be integrated in an electronic device, which may be a server or a terminal, where the terminal may include a tablet computer, a notebook computer, a personal computer (PC, personal Computer), a wearable device, a virtual reality device, or other devices capable of performing content retrieval.
A content retrieval method, comprising:
When the content to be searched is video content, carrying out multi-mode feature extraction on the video content to obtain the mode feature of each mode, carrying out feature extraction on the mode feature of each mode to obtain the mode content feature of each mode, fusing the mode content features to obtain the video feature of the video content, and searching the target text content corresponding to the video content in a preset content set according to the video feature.
As shown in fig. 2, the specific flow of the content retrieval method is as follows:
101. and acquiring the content to be searched for searching the target content.
The content to be searched may be understood as a content in a search condition for searching the target content, and the content type of the content to be searched may be various, for example, may be text content or video content.
The manner of obtaining the content to be retrieved may be various, and specifically may be as follows:
for example, the content to be searched sent by the user through the terminal may be directly received, or the content to be searched may be obtained from a network or a third party database, or when the memory of the content to be searched is large or the number of the memories is large, a content search request is received, where the content search request carries a memory address of the content to be searched, and the content to be searched is obtained from the memory, the cache or the third party database according to the memory address.
102. When the content to be retrieved is video content, multi-mode feature extraction is carried out on the video content to obtain mode features corresponding to a plurality of modes.
The mode characteristics may be understood as characteristic information corresponding to each mode in the video content, and the video content may include a plurality of modes, for example, description actions, audio, scenes, faces, OCR lectures, entities, and/or the like.
The multi-modal feature extraction method for the video content may be various, and specifically may be as follows:
For example, a trained content retrieval model is adopted to conduct multi-mode feature extraction on video content to obtain initial mode features of each mode in the video content, video frames are extracted from the video content, a trained content retrieval model is adopted to conduct multi-mode feature extraction on the video frames to obtain basic mode features of each video frame, target mode features corresponding to each mode are screened out of the basic mode features, and the target mode features and the corresponding initial mode features are fused to obtain mode features of each mode.
The video content and the video frames contain multiple modes, for different modes, different feature extraction methods can be used for extracting multi-mode features of the video content and the video frames, for example, for describing an action mode, a pre-trained S3D (an action recognition model) model on an action recognition data set can be used for extracting features, for an audio mode, a pre-trained VGGish (an audio extraction model) model can be used for extracting features, for a scene mode, a pre-trained DenseNet-161 (a depth model) model can be used for extracting features, for a face mode, a pre-trained SSD model and a ResNet model can be used for extracting features, for a face mode, a Google API (a feature extraction network) can be used for extracting features, and for a physical mode, a pre-trained SENet-154 (a feature extraction network) can be used for extracting features. The extracted initial modality features and basic modality features may include image features, expert features, temporal features, and the like.
The target modal feature and the corresponding initial modal feature may be fused in various manners, for example, the image feature (F), the expert feature (E), and the time feature (T) in the target modal feature and the initial modal feature may be added to obtain a modal feature (Ω) of each mode, which may be specifically shown in fig. 3. Or the weighting coefficients of the target modal characteristics and the initial modal characteristics can be obtained, the target modal characteristics and the initial modal characteristics are weighted according to the weighting coefficients, and the weighted target modal characteristics and the initial modal characteristics are fused to obtain the modal characteristics of each mode.
The trained content retrieval model can be set according to the requirements of practical applications, and in addition, it should be noted that the trained content retrieval model can be preset by maintenance personnel, and also can be trained by a content retrieval device, that is, before the step of performing multi-mode feature extraction on video content by using the trained content retrieval model to obtain initial mode features of each mode in the video content, the content retrieval method can further include:
Acquiring a content sample set, wherein the content sample set comprises a video sample and a text sample, the text sample comprises at least one text word, a preset content retrieval model is adopted to extract multi-modal features of the video sample, sample modal features of each mode are obtained, feature extraction is respectively carried out on the sample modal features of each mode, sample modal content features of the video sample are obtained, the sample modal content features are fused, sample video features of the video sample are obtained, feature extraction is carried out on the text sample, sample text features and text word features corresponding to each text word are obtained, and convergence is carried out on the preset content retrieval model according to the sample modal video features, the sample text features and the text word features, so that a trained content retrieval model is obtained, and the method comprises the following steps:
S1, acquiring a content sample set.
Wherein the set of content samples includes a video sample and a text sample, the text sample including at least one text word.
The manner of acquiring the content sample set may be various, and specifically may be as follows:
For example, a video sample and a text sample may be directly obtained to obtain a content sample set, or an original video content and an original text content may be obtained, then the original video content and the original text content are sent to a labeling server, a matching tag between the original video content and the original text content returned by the labeling server is received, the matching tag is added to the original video content and the original text content, so as to obtain a video sample and a text sample, the video sample and the text sample are combined to obtain a content sample set, or when the number of content samples in the content sample set is large or the memory is large, a model training request may be received, where the model training request carries a storage address of the content sample set, and the content sample set is obtained in a memory, a cache or a third party database according to the storage address.
S2, carrying out multi-mode feature extraction on the video sample by adopting a preset content retrieval model to obtain sample mode features of each mode.
For example, a preset content retrieval model is adopted to perform multi-mode feature extraction on a video sample to obtain initial sample mode features of each mode in the video sample, a video frame is extracted from the video sample, a preset content retrieval model is adopted to perform multi-mode feature extraction on the video frame to obtain basic sample mode features of each video frame, target sample mode features corresponding to each mode are screened out from the basic sample mode features, and the target sample mode features and the corresponding initial sample mode features are fused to obtain mode sample mode features of each mode, which can be seen from the above, and details are not repeated here.
And S3, respectively carrying out feature extraction on the sample mode features of each mode to obtain sample mode content features of the video sample, and fusing the sample mode content features to obtain sample video features of the video sample.
For example, according to the mode of the sample mode characteristics, a target video characteristic extraction network corresponding to each mode is identified in a video characteristic extraction network of a preset content retrieval model, and the sample mode characteristics are subjected to characteristic extraction by adopting the target video characteristic extraction network to obtain sample mode content characteristics corresponding to each mode. And fusing the sample modal content characteristics to obtain sample video characteristics of the video sample.
The mode of the video feature extraction network of the preset content retrieval model is fixed, so that the video feature extraction network corresponding to the mode can be identified only according to the mode of the sample mode feature, and the identified video feature extraction network is used as a target video feature extraction network.
After identifying the target video feature extraction network, the target video feature extraction network may be used to perform feature extraction on the modal features, and the feature extraction process may be multiple, for example, the target video feature extraction network may encode the sample modal features for an encoder of a mode-specific transducer (a conversion network), so as to extract the sample modal content features of each mode.
After the sample modal content features are extracted, the sample modal content features can be fused, for example, the sample modal content features of each mode can be combined to obtain a sample modal content feature set of the video sample, the sample modal content feature set is input into a transducer for encoding so as to calculate the association weight of the sample modal content features, the sample modal content features are weighted according to the association weight, and the weighted sample modal content features are fused to obtain the sample video features of the video sample.
And S4, extracting features of the text sample to obtain sample text features and text word features corresponding to each text word, and converging a preset content retrieval model according to the sample modal content features, the sample video features, the sample text features and the text word features to obtain a trained content retrieval model.
For example, a text feature extraction network of a preset content retrieval model is adopted to perform feature extraction on a text sample to obtain text features of the text sample and text word features of text words, and then the preset content retrieval model is converged according to sample modal content features, sample video features, sample text features and text word features to obtain a trained content retrieval model.
The text sample may be extracted by various ways, for example, a text encoder may be used to extract text features of the text sample to obtain text features and text word features, and the text encoder may be various types, for example, may include Bert (a text encoder) or word2vector (a word vector generation model), and so on.
After extracting the text features and the text word features, the preset content retrieval model can be converged according to the sample modal content features, the sample video features, the sample text features and the text word features, and various convergence modes can be adopted, and specifically can be as follows:
For example, feature loss information of a content sample set may be determined according to sample modal content features and text word features, content loss information of the content sample set may be determined based on sample video features and sample text features, the feature loss information and the content loss information may be fused, and a preset content retrieval model may be converged based on the fused loss information to obtain a trained content retrieval model, which may be specifically as follows:
(1) And determining the feature loss information of the content sample set according to the sample modal content features and the text word features.
For example, feature similarity between the sample modal content feature and the text word feature can be calculated to obtain first feature similarity, sample similarity between the video sample and the text sample is determined according to the first feature similarity, and feature distance between the video sample and the text sample is calculated based on the sample similarity to obtain feature loss information of the content sample set.
The method for calculating the feature similarity between the sample modal content feature and the text word feature may be various, for example, cosine similarity between the sample modal content feature and the text word feature may be calculated, and the cosine similarity is used as the first feature similarity, and specifically, the method may be shown in the formula (1):
Wherein Sij is a first feature similarity, wi is a text word feature,Is a sample modality video feature.
After the first feature similarity is calculated, the sample similarity between the video sample and the text sample can be determined according to the first feature similarity, and various determining modes can be adopted, for example, feature interaction can be carried out on sample mode content features and text word features according to the first feature similarity to obtain post-interaction video features and post-interaction text word features, feature similarity between the post-interaction video features and the post-interaction text word features is calculated to obtain second feature similarity, and the second feature similarity is fused to obtain the sample similarity between the video sample and the text sample.
The method for performing feature interaction on the sample modal content features and the text word features can be various, for example, the first feature similarity can be subjected to standardized processing to obtain target feature similarity, the association weight of the sample modal content features is determined according to the target feature similarity, the association weight is used for indicating the association relationship between the sample modal content features and the text word features, the sample modal content features are weighted based on the association weight, and the text word features are updated based on the weighted sample modal content features to obtain the post-interaction video features and the post-interaction text word features.
The method for normalizing the first feature similarity may include, for example, normalizing the first feature similarity by using an activation function, where the activation function may include a variety of kinds, for example, reLU (x) =max (0, x)), and the normalization process may be as shown in formula (2):
Wherein, theFor the target feature similarity, Sij is the first feature similarity, relu is the activation function.
The method for determining the association weight of the sample modal content features according to the target feature similarity may be various, for example, a preset association parameter may be obtained, the association parameter and the target feature similarity are fused to obtain the association weight, and the association weight may be understood as an attention weight, and may be specifically shown as formula (3):
wherein aij is an association weight, λ is a preset association parameter, which may be an inverse temperature parameter of softmax,Is the similarity of the target features.
After determining the association weight of the sample modal content features, weighting the sample modal visual content features based on the association weight, and fusing the weighted sample modal content features to obtain weighted modal content features, wherein the weighted modal content features are used as initial interactive video features of the video samples, and specifically, the method can be shown by referring to the formula (4):
Where ai is the initial post-interaction video feature, aij is the associated weight,Is a sample modality video feature.
After the initial post-interaction video feature is calculated, the text word feature can be updated based on the initial post-interaction video feature to obtain an post-interaction video feature and an post-interaction text word feature, for example, the text word feature can be updated based on the initial post-interaction video feature to obtain an initial post-interaction text word feature, the feature similarity of the initial post-interaction video feature and the initial post-interaction text word feature is calculated to obtain a third feature similarity, and the initial post-interaction video feature and the initial post-interaction text word feature are updated according to the third feature similarity to obtain an post-interaction video feature and an post-interaction text word feature.
The method for updating the text word features based on the video features after the initial interaction can be various, for example, preset updating parameters can be obtained, the preset updating parameters, the video features after the initial interaction and the text word features are fused to obtain the text word features after the initial interaction, and the method specifically can be as shown in a formula (5):
Wherein F9wi,ai) is a text word feature after initial interaction, wi is a text word feature, ai is a video feature after initial interaction, gi is a gate operation for selecting the most useful information, oi is a fusion feature for enhancing interaction of the text word feature with the video feature after initial interaction, and Fg、bg、Fo and bo are preset update parameters. For multi-step operation (multiple feature interaction), multiple updates of text word features are involved, so that the formula (5) can be integrated to obtain Fa, and then the formula in the process of K feature interactions (mutual cross attention operation) can be obtained, as shown in the formula (6):
wherein, K represents the Kth time,AndThe text word features of the Kth and (K-1) th are respectively represented, Ak is the video feature after the interaction of the Kth, and Vrep is the modal video feature.
The method comprises the steps of updating the initial post-interaction video feature and the initial post-interaction text word feature according to the third feature similarity, for example, feature interaction can be performed on the initial post-interaction video feature and the initial post-interaction text word feature according to the third feature similarity to obtain a target post-interaction video feature and a target post-interaction text word feature, the target post-interaction video feature is used as the initial post-interaction video feature, the target post-interaction text word feature is used as the initial post-interaction text word feature, and the step of calculating the feature similarity of the initial post-interaction video feature and the initial post-interaction text word feature is performed until the feature interaction times of the initial post-interaction video feature and the initial post-interaction text word feature reach a preset time, so that the post-interaction video feature and the post-interaction text word feature are obtained.
The process of feature interaction can be regarded as multi-step cross attention calculation, so that the post-interaction video features and the post-interaction text word features are obtained. The number of feature interactions may be set according to the actual application, and may be generally.
After the post-interaction video feature and the post-interaction text word feature are obtained, the sample similarity between the video sample and the text sample can be calculated, and various calculation modes can be adopted, for example, the feature similarity between the post-interaction video feature and the post-interaction text word feature can be calculated to obtain a second feature similarity, the second feature similarity is fused to obtain the sample similarity between the video sample and the text sample, and the sample similarity is shown in a formula (7):
where S (Tw,rep) is sample similarity, wki is post-interaction text word feature, and aki is post-interaction video feature.
After the sample similarity is calculated, the feature distance between the video sample and the text sample can be calculated, so that feature loss information of the content sample set can be obtained, for example, a preset feature boundary value corresponding to the content sample set can be obtained, a first content sample pair matched with the text sample and a second content sample pair not matched with the video sample are screened out from the content sample set according to the sample similarity, and feature loss information of the content sample set can be obtained by calculating the feature distance between the first content sample pair and the second content sample pair based on the preset feature boundary value.
According to the sample similarity, a plurality of modes can be adopted for screening the first content sample pair and the second content sample pair from the content sample set, for example, the sample similarity can be compared with a preset similarity threshold, a video sample and a corresponding text sample, the sample similarity of which exceeds the preset similarity threshold, are screened from the content sample set, so that the first content sample pair can be obtained, and a video sample and a corresponding text sample, the sample similarity of which does not exceed the preset similarity threshold, are screened from the content sample set, so that the second content sample pair can be obtained.
After the first content sample pair and the second content sample pair are screened out, the feature distance between the first content sample pair and the second content sample pair can be calculated, and the calculation modes can be various, for example, the content sample pair with the largest sample similarity can be screened out from the second content sample pair to obtain the target content sample pair, the similarity difference between the sample similarity of the first content sample pair and the sample similarity of the target content sample pair is calculated to obtain a first similarity difference, and the preset feature boundary value and the first similarity difference are fused to obtain feature loss information of the content sample set, as shown in formula (8):
Wherein LTri is feature loss information, Δ is a preset feature boundary value, B is the number of content sample sets, B represents matching of video samples and text samples, and B represents a difficult negative sample, i.e. a video sample or a text sample in a content sample pair with the largest sample similarity in the second content sample. It can be found that there may be two target content sample pairs, and after fusing the preset feature boundary value and the first similarity difference value, the fused similarity difference value may be further subjected to standardization processing, and the standardized similarity difference value is fused again, so that feature loss information of the content sample set may be obtained.
The feature loss information can be regarded as loss information obtained by back propagation and parameter updating by adopting a triple loss (a loss function), and the feature loss information is mainly used for shortening the distance of the matched video text in a feature space.
(2) Content loss information for a set of content samples is determined based on the sample video features and the sample text features.
For example, feature similarity between the sample video features and the text features can be calculated to obtain content similarity between the video samples and the text samples, a third content sample pair matched with the video samples and a fourth content sample pair not matched with the video samples and the text samples are selected from the content sample set according to the content similarity, a preset content boundary value corresponding to the content sample set is obtained, and a content difference value between the third content sample pair and the fourth content sample pair is calculated according to the preset content boundary value to obtain content loss information of the content sample set.
The method for calculating the content difference between the third content sample and the fourth content sample pair to obtain the content loss information of the content sample set may be various, for example, the similarity difference between the content similarity of the third content sample pair and the content similarity of the fourth content sample pair may be calculated to obtain a second similarity difference, the second similarity difference is fused with a preset content boundary value to obtain the content difference between the third content sample pair and the fourth content sample pair, and the content difference is subjected to standardization processing to obtain the content loss information of the content sample set, as shown in formula (9):
Wherein Lmar is content loss information, B is the number of content sample sets, B represents that video samples are matched with text samples, d represents that content samples except B in the content sample sets, Θ is a preset content boundary value, Sh is content similarity,As a feature of the text it is,Is in combination withVideo features of the matched video samples. Content loss information can be seen as loss information obtained by back-propagation and parameter updating with bi-directional max-MARGIN RANKING loss (a loss function).
(3) And fusing the characteristic loss information and the content loss information, and converging the preset content retrieval model based on the fused loss information to obtain the trained content retrieval model.
For example, a preset balance parameter may be obtained, the preset balance parameter and the feature loss information are fused to obtain balanced feature loss information, and the balanced feature loss information and the content loss information are added to obtain fused loss information, as shown in formula (10):
L=Lmar+β*LTri (10)
Wherein L is post-fusion loss information, Lmar is content loss information, LTri is feature loss information, and β is a preset balance parameter for dimensionally balancing the two loss functions.
Optionally, weighting parameters of the feature loss information and the content loss information can be obtained, the feature loss information and the content loss information are weighted based on the weighting parameters, the weighted feature loss information and the weighted content loss information are fused, so as to obtain fused loss information,
After the fused loss information is obtained, the preset content retrieval model can be converged based on the fused loss information, and various convergence modes can be adopted, for example, the network parameters in the preset content retrieval model can be updated by adopting a gradient descent algorithm according to the fused loss information so as to converge the preset content retrieval model, and a trained content retrieval model can be obtained, or other algorithms can be adopted, and the network parameters in the preset content retrieval model can be updated by adopting the fused loss information so as to converge the preset content retrieval model, and a trained content retrieval model can be obtained.
In the training process of the content retrieval model, the text sample and the video sample are subjected to multi-step cross attention calculation and content similarity calculation, and the back propagation and parameter updating are respectively performed by adopting a triple loss and a bidirectional max-MARGIN RANKING loss, so that a trained content retrieval model is obtained, which can be shown in fig. 4.
103. And respectively extracting the characteristics of the mode corresponding to each mode to obtain the mode content characteristics corresponding to each mode.
Wherein the modality content characteristics may be an overall characteristic of each modality content for indicating the content characteristics under the modality.
The mode of extracting the mode features may be various, and specifically may be as follows:
For example, according to the modes of the mode features, a target video feature extraction network corresponding to each mode can be identified in the video feature extraction network of the trained content retrieval model, and the mode features are extracted by adopting the target video feature extraction network to obtain the mode content features corresponding to each mode.
The video feature extraction network of the content retrieval model after training is fixed in mode, so that the video feature extraction network corresponding to the mode can be identified only according to the mode of the mode feature, and the identified video feature extraction network is used as a target video feature extraction network.
After the target video feature extraction network is identified, the target video feature extraction network may be used to perform feature extraction on the modal features, and the feature extraction process may be multiple, for example, the target video feature extraction network may encode the modal features for an encoder of a transducer specific to a modality, so as to extract the modal content features corresponding to the video content of each modality.
104. And fusing the modal content characteristics to obtain video characteristics of the video content, and searching target text content corresponding to the video content in the preset content set according to the video characteristics.
The mode of fusing the modal video features may be various, and specifically may be as follows:
For example, the mode content features of each mode may be combined to obtain a mode content feature set of the video content, the mode content feature set is input into a transform model to be encoded, so as to calculate an association weight of the mode content features, the mode content features are weighted according to the association weight, the weighted mode content features are fused to obtain the video features of the video content, or a weighting parameter corresponding to each mode is obtained, the mode content features are weighted based on the weighting parameter, and the weighted mode content features are fused to obtain the video features of the video content, or the mode content features are directly spliced to obtain the video features of the video content.
After the video features of the video content are obtained, the target text content corresponding to the video content can be searched in the preset content set according to the video features, and the searching mode can be various, for example, feature similarity between the video features and text features of candidate text content in the preset content set can be calculated respectively, and the target text content corresponding to the video content is screened out from the candidate text content according to the feature similarity.
The text feature extraction method for the candidate text content may be various, for example, a text encoder may be used to extract features of the candidate text content to obtain text features of the candidate text content, the text encoder may be various in type, for example, may include a Bert and a word2vector, or may also extract features of each text word in the candidate text content, then calculate an association weight between each text word, and weight the text word features based on the association weight, so as to obtain the text features of the candidate text content. The time for extracting the text features of the candidate text contents in the preset content set can be various, for example, the time can be real-time extraction, for example, when the acquired content to be searched is video content, the text features of the candidate text contents can be extracted to obtain the text features of the candidate text contents, or the text features of the candidate text contents in the preset content set can be extracted to obtain the text features of the candidate text contents before the content to be searched is acquired, so that the feature similarity between the text features and the video features can be calculated offline, and the target text contents corresponding to the video content can be screened out from the candidate text contents more quickly.
The method for calculating the feature similarity between the video feature and the text feature of the candidate text content may also be various, for example, cosine similarity between the video feature and the text feature of the candidate text content may be calculated, so that the feature similarity may be obtained, or feature distance between the video feature and the text feature of the candidate text content may also be calculated, and the feature similarity between the video feature and the text feature may be determined according to the feature distance.
After calculating the feature similarity, the target text content corresponding to the video content can be screened out from the candidate text contents according to the feature similarity, for example, the candidate visual text content with the feature similarity exceeding the preset similarity threshold can be screened out from the candidate text contents, the screened candidate text contents are sorted, the sorted candidate text content is used as the target text content corresponding to the video content, or the candidate text contents can be sorted according to the feature similarity, the target text content corresponding to the video content is screened out from the sorted candidate text contents, the screened target text content can be one or a plurality of target text contents, when the number of the target text contents is one, the candidate text content with the largest feature similarity to the video feature is used as the target text content, and when the number of the target text content is a plurality of candidate text contents, TOP N candidate text contents with the TOP ranking with the feature similarity to the video feature rank can be screened out of the sorted candidate text contents.
Optionally, when the content to be retrieved is text content, feature extraction may be further performed on the text content, and according to the extracted text feature, the target video content corresponding to the text content may be retrieved from the preset content set, which may specifically be as follows:
For example, when the content to be retrieved is text content, the text content is extracted by adopting a text feature extraction network of the trained content retrieval model, so as to obtain text features of the text content. And respectively calculating the feature similarity between the text feature and the video feature of the candidate video content in the preset content set, and screening out target video content corresponding to the text content from the candidate video content according to the feature similarity.
The text content may be extracted in various ways, for example, a text encoder may be used to extract overall features in the text content to obtain text features, the text encoder may be of various types, for example, may include Bert and word2vector, or may also extract features of each text word in the text content, then calculate association weights between each text word, and weight the text word features based on the association weights to obtain text features.
After extracting the text features of the text content, the feature similarity between the text features and the video features can be calculated, and the feature similarity can be calculated in various manners, for example, feature extraction can be performed on candidate video contents in a preset content set to obtain video features of each candidate video content, and then cosine similarity between the text features and the video features is calculated, so that the feature similarity can be obtained.
The method for extracting the video features from the candidate video content may be various, for example, a trained content retrieval model may be used to perform multi-mode feature extraction on the candidate video content to obtain mode features corresponding to multiple modes, feature extraction is performed on the mode features corresponding to each mode to obtain mode video features corresponding to each mode, and the mode video features are fused to obtain the video features of each candidate video content. The time for extracting the video features of the candidate video content may be various, for example, the video features of the candidate video content may be extracted in real time, for example, each time the content to be retrieved is obtained, the video features may be extracted from the candidate video content, or before the content to be retrieved is obtained, feature extraction may be performed on each candidate video content in the preset content set, and the video features may be extracted, so that feature similarity between the text features and the video features may be calculated offline, and thus, the target video content corresponding to the text content may be screened out from the candidate video content more quickly.
According to the feature similarity, there may be multiple manners of screening the target video content corresponding to the text content from the candidate video contents, for example, the candidate video content with the feature similarity exceeding the preset similarity threshold is screened from the candidate video contents, the screened candidate video content is ranked, the ranked candidate video content is used as the target video content corresponding to the text content, or the candidate video content may be ranked according to the feature similarity, the target video content corresponding to the text content is screened from the ranked candidate video contents, the screened target video content may be one or more, when the number of the target video content is one, the candidate video content with the largest feature similarity with the text feature may be used as the target video content, and when the number of the target video content is more, TOP N candidate video contents with the feature similarity with the text feature ranked ahead may be selected from the ranked candidate video contents as the target video content. In the scheme, the multi-mode information in the video is better extracted, and more important words in the search text are better focused, so that a better search result is achieved. On the data sets MSR-VTT, LSMDC and ACTIVITYNET, the content retrieval performance is greatly improved compared with the currently mainstream method, and the results are shown in tables 1,2 and 3. In the table, R1, R5, R10 and R50 represent the recognition rates of row 1, row 5, row 10 and row 50, respectively, and MdR and MnR are the mean and median of the recognition rates.
TABLE 1 results on MSR-VTT datasets
Table 2LSMDC results on dataset
Table 3ACTIVITYNET results on dataset
As can be seen from the above, after obtaining the content to be searched for searching the target content, when the content to be searched is the video content, the embodiment of the application performs multi-mode feature extraction on the video content to obtain the mode feature of each mode, performs feature extraction on the mode feature of each mode to obtain the mode content feature of each mode, fuses the mode content features to obtain the video feature of the video content, and searches the target text content corresponding to the video content in the preset content set according to the video feature.
According to the method described in the above embodiments, examples are described in further detail below.
In this embodiment, the content search device is specifically integrated in an electronic device, and the electronic device is exemplified as a server.
Server training content retrieval model
And C1, the server acquires a content sample set.
For example, the server may directly obtain the video sample and the text sample to obtain a content sample set, or may obtain the original video content and the original text content, then send the original video content and the original text content to the labeling server, receive a matching tag between the original video content and the original text content returned by the labeling server, add the matching tag to the original video content and the original text content, thereby obtaining the video sample and the text sample, combine the video sample and the text sample to obtain a content sample set, or when the number of content samples in the content sample set is greater or the memory is greater, may receive a model training request, where the model training request carries a storage address of the content sample set, and obtain the content sample set in the memory, the cache, or the third party database according to the storage address.
And C2, the server adopts a preset content retrieval model to extract multi-mode characteristics of the video sample, and sample mode characteristics of each mode are obtained.
For example, the server performs multi-mode feature extraction on the video sample by using a preset content retrieval model to obtain initial sample mode features of each mode in the video sample, extracts video frames from the video sample, performs multi-mode feature extraction on the video frames by using the preset content retrieval model to obtain basic sample mode features of each video frame, screens out target sample mode features corresponding to each mode from the basic sample mode features, and fuses the target sample mode features and the corresponding initial sample mode features to obtain mode feature of each mode.
And C3, the server respectively performs feature extraction on the sample modal features of each mode to obtain sample modal content features of the video sample, and fuses the sample modal content features to obtain sample video features of the video sample.
For example, the server identifies a transducer network corresponding to each mode in a video feature extraction network of a preset content retrieval model according to the mode of the sample mode feature, and encodes the sample mode feature by adopting an encoder of the transducer network, thereby extracting the sample mode content feature of each mode. Combining the sample mode content features of each mode to obtain a sample mode content feature set of the video sample, inputting the sample mode content feature set into the whole transducer network for encoding so as to calculate the association weight of the sample mode content features, weighting the sample mode content features according to the association weight, and fusing the weighted sample mode content features to obtain the sample video features of the video sample.
And C4, extracting features of the text sample by the server to obtain sample text features and text word features corresponding to each text word, and converging a preset content retrieval model according to the sample modal content features, the sample video features, the sample text features and the text word features to obtain a trained content retrieval model.
For example, the server may use a text encoder such as Bert or word2vector to extract features of text features of the text sample, so as to obtain text features and text word features. According to the sample modal content characteristics and the text word characteristics, determining the characteristic loss information of the content sample set, determining the content loss information of the content sample set based on the sample video characteristics and the sample text characteristics, fusing the characteristic loss information and the content loss information, and converging a preset content retrieval model based on the fused loss information to obtain a trained content retrieval model, wherein the method specifically comprises the following steps:
(1) And the server determines the feature loss information of the content sample set according to the sample modal content features and the text word features.
For example, the server may calculate cosine similarity between the sample modal content feature and the text word feature, and the cosine similarity is used as the first feature similarity, which may be specifically shown by referring to formula (1). The first feature similarity is normalized by using an activation function, for example, the activation function may be of various types, for example, reLU (x) =max (0, x), the normalization process may be as shown in formula (2), further, the normalized target feature similarity is obtained, a preset association parameter is obtained, the association parameter and the target feature similarity are fused, and an association weight is obtained, and the association weight may also be understood as an attention weight, and may be specifically shown in formula (3). And weighting the sample modal content characteristics based on the association weight, and fusing the weighted sample modal content characteristics to obtain weighted modal video characteristics, wherein the weighted modal content characteristics are used as initial interaction video characteristics of the video samples, and the method can be specifically shown by referring to a formula (4).
After calculating the video characteristics after the initial interaction, the server can acquire preset updating parameters, and fuse the preset updating parameters, the video characteristics after the initial interaction and the text word characteristics to obtain the text word characteristics after the initial interaction, wherein the text word characteristics after the initial interaction can be specifically shown as a formula (5). Calculating the feature similarity of the video feature after initial interaction and the text word feature after initial interaction to obtain a third feature similarity, carrying out feature interaction on the video feature after initial interaction and the text word feature after initial interaction according to the third feature similarity to obtain a target video feature after target interaction and the text word feature after target interaction, taking the video feature after target interaction as the video feature after initial interaction, taking the text word feature after target interaction as the text word feature after initial interaction, and returning to the step of executing the feature similarity of the video feature after initial interaction and the text word feature after initial interaction until the feature interaction times of the video feature after initial interaction and the text word feature after initial interaction reach the preset times to obtain the video feature after interaction and the text word feature after interaction.
After the server obtains the video feature after interaction and the text word feature after interaction, the server can calculate the feature similarity between the video feature after interaction and the text word feature after interaction to obtain a second feature similarity, and the second feature similarity is fused to obtain the sample similarity between the video sample and the text sample, as shown in a formula (7). Comparing the sample similarity with a preset similarity threshold, screening video samples and corresponding text samples, wherein the sample similarity of the video samples and the corresponding text samples exceeds the preset similarity threshold, so that a first content sample pair can be obtained, and screening video samples and corresponding text samples, wherein the sample similarity of the video samples and the corresponding text samples does not exceed the preset similarity threshold, in the content sample set, so that a second content sample pair can be obtained. Obtaining a preset characteristic boundary value corresponding to the content sample set, screening out a content sample pair with the maximum sample similarity from the second content sample pair to obtain a target content sample pair, calculating a similarity difference value between the sample similarity of the first content sample pair and the sample similarity of the target content sample pair to obtain a first similarity difference value, and fusing the preset characteristic boundary value and the first similarity difference value to obtain characteristic loss information of the content sample set, as shown in a formula (8).
(2) The server determines content loss information for the set of content samples based on the sample video features and the sample text features.
For example, the server may calculate feature similarity between the sample video feature and the text feature, obtain content similarity between the video sample and the text sample, screen out a third content sample pair matching the video sample and the text sample from the content sample set according to the content similarity, and obtain a preset content boundary value corresponding to the content sample set. Calculating a similarity difference between the content similarity of the third content sample pair and the content similarity of the fourth content sample pair to obtain a second similarity difference, fusing the second similarity difference with a preset content boundary value to obtain a content difference between the third content sample pair and the fourth content sample pair, and carrying out standardization processing on the content difference to obtain content loss information of the content sample set, wherein the content loss information is shown in a formula (9).
(3) The server fuses the characteristic loss information and the content loss information, and converges the preset content retrieval model based on the fused loss information to obtain a trained content retrieval model.
For example, the server may obtain a preset balance parameter, fuse the preset balance parameter with the feature loss information to obtain balanced feature loss information, and add the balanced feature loss information to the content loss information to obtain fused loss information, as shown in formula (10). And then, according to the fused loss information, updating network parameters in the preset content retrieval model by adopting a gradient descent algorithm so as to converge the preset content retrieval model to obtain a trained content retrieval model, or adopting other algorithms, and updating the network parameters in the preset content retrieval model by adopting the fused loss information so as to converge the preset content retrieval model to obtain the trained content retrieval model.
As shown in fig. 5, a content retrieval method specifically includes the following steps:
201. The server acquires the content to be retrieved for retrieving the target content.
For example, the server may directly receive the content to be searched sent by the user through the terminal, or may acquire the content to be searched from the network or the third party database, or when the memory of the content to be searched is large or the number of the content to be searched is large, receive a content search request, where the content search request carries a memory address of the content to be searched, and acquire the content to be searched from the memory, the cache or the third party database according to the memory address.
202. When the content to be retrieved is video content, the server performs multi-mode feature extraction on the video content to obtain mode features corresponding to a plurality of modes.
For example, when the content to be retrieved is video content, the server performs multi-modal feature extraction on the video content by using a trained content retrieval model to obtain initial modal features of each mode in the video content, extracts video frames from the video content, performs multi-modal feature extraction on the video frames by using the trained content retrieval model to obtain basic modal features of each video frame, screens out target modal features corresponding to each mode from the basic modal features, and fuses the target modal features with the corresponding initial modal features to obtain modal features of each mode.
The video content and the video frames in the video content may include multiple modes, for describing an action mode, feature extraction may be performed by using an S3D model that is pre-trained on an action recognition dataset, for an audio mode, feature extraction may be performed by using a VGGish model that is pre-trained, for a scene mode, feature extraction may be performed by using a DenseNet-161 model that is pre-trained, for a face mode, feature extraction may be performed by using an SSD model that is pre-trained and a ResNet model, for a face mode, feature extraction may be performed by using a Google API, and for an entity mode, feature extraction may be performed by using SENet-154 that is pre-trained. The extracted initial modality features and basic modality features may include image features, expert features, temporal features, and the like.
203. And the server respectively performs feature extraction on the modal features corresponding to each mode to obtain the modal content features corresponding to each mode.
For example, according to the mode of the mode feature, a transducer network corresponding to each mode can be identified in the video feature extraction network of the trained content retrieval model as a target video feature extraction network, and the mode feature is encoded by adopting an encoder of a transducer specific to the mode, so that the mode content feature corresponding to each mode is extracted.
204. And the server fuses the modal content characteristics to obtain the video characteristics of the video content.
For example, the server may combine the modal content features of each mode to obtain a sample modal content feature set of the video content, input the modal view content feature set into a transform model to encode, so as to calculate an association weight of the modal content features, weight the modal content features according to the association weight, and fuse the weighted modal content features to obtain video features of the video content, or obtain a weighting parameter corresponding to each mode, weight the modal content features based on the weighting parameter, and fuse the weighted modal content features to obtain video features of the video content, or directly splice the modal video features to obtain video features of the video content.
205. And the server searches the target text content corresponding to the video content in the preset content set according to the video characteristics.
For example, the server may extract features of the candidate text content by using a text encoder such as Bert or word2vector to obtain text features of the candidate text content, or may extract features of each text word in the candidate text content, then calculate an association weight between each text word, and weight the text word features based on the association weight, so as to obtain the text features of the candidate text content.
The server calculates cosine similarity between the video feature and the text feature of the candidate text content, so that feature similarity can be obtained, or can also calculate feature distance between the video feature and the text feature of the candidate text content, and determine feature similarity between the video feature and the text feature according to the feature distance.
The server screens out candidate visual text contents with feature similarity exceeding a preset similarity threshold value from the candidate text contents, sorts the screened candidate text contents, takes the sorted candidate text contents as target text contents corresponding to the video contents, or sorts the candidate text contents according to the feature similarity, screens out target text contents corresponding to the video contents from the sorted candidate text contents, wherein one or more target text contents can be screened out, when the number of the target text contents is one, the candidate text contents with the largest feature similarity with the video features can be used as target text contents, and when the number of the target text contents is a plurality of target text contents, TOP N candidate text contents with the feature similarity with the video features, which are ranked at the front, can be screened out from the sorted candidate text contents as target text contents.
The time for extracting the text features of the candidate text contents in the preset content set may be various, for example, the time may be real-time extraction, for example, when the acquired content to be searched is video content, the text features of the candidate text contents may be extracted, so as to obtain the text features of the candidate text contents, or the text features of the candidate text contents in the preset content set may be extracted before the content to be searched is acquired, so as to obtain the text features of the candidate text contents, thereby realizing offline calculation of feature similarity between the text features and the video features, and further screening the target text contents corresponding to the video contents from the candidate text contents more quickly.
206. When the content to be searched is text content, the server performs feature extraction on the text content, and searches target video content corresponding to the text content in a preset content set according to the extracted text features.
For example, when the content to be retrieved is text content, the server may extract the overall feature in the text content by using a text encoder such as Bert or word2vector, to obtain the text feature of the text content. And carrying out multi-mode feature extraction on the candidate video content by adopting the trained content retrieval model to obtain mode features corresponding to a plurality of modes, respectively carrying out feature extraction on the mode features corresponding to each mode to obtain mode video features corresponding to each mode, and fusing the mode video features to obtain the video features of each candidate video content. Then, the cosine similarity between the text feature and the video feature is calculated, so that the feature similarity can be obtained. Candidate video contents with feature similarity exceeding a preset similarity threshold are screened out of the candidate video contents, the screened candidate video contents are ranked, the ranked candidate video contents are used as target video contents corresponding to text contents, or the candidate video contents can be ranked according to the feature similarity, the target video contents corresponding to the text contents are screened out of the ranked candidate video contents, one or more candidate video contents can be screened out, when the number of the candidate video contents is one, the candidate video contents with the largest feature similarity with the text features can be used as target video contents, and when the number of the target video contents is more, TOP N candidate video contents with the front feature similarity ranking with the text features can be screened out of the ranked candidate video contents as target video contents.
The time for extracting the video features of the candidate video content may be various, for example, the video features of the candidate video content may be extracted in real time, for example, each time the content to be searched is obtained, the video features may be extracted from the candidate video content, or before the content to be searched is obtained, feature extraction may be performed on each candidate video content in the preset content set, so as to extract the video features, thereby implementing off-line calculation of feature similarity between the text features and the video features, and further, more quickly screening out the target video content corresponding to the text content from the candidate video content.
As can be seen from the above, after obtaining the content to be searched for the target content, when the content to be searched for is the video content, the server in the embodiment of the application performs multi-mode feature extraction on the video content to obtain the mode feature of each mode, performs feature extraction on the mode feature of each mode to obtain the mode content feature of each mode, fuses the mode content features to obtain the video feature of the video content, searches the target text content corresponding to the video content in the preset content set according to the video feature, performs feature extraction on the text content when the content to be searched for is the text content, and searches the target video content corresponding to the text content in the preset content set according to the extracted text feature.
In order to better implement the above method, the embodiment of the present invention further provides a content retrieval device, where the content retrieval device may be integrated into an electronic device, such as a server or a terminal, where the terminal may include a tablet computer, a notebook computer, and/or a personal computer.
For example, as shown in fig. 6, the content retrieval device may include an acquisition unit 301, a first extraction unit 302, a second extraction unit 303, and a text retrieval unit 304, as follows:
(1) An acquisition unit 301;
an acquisition unit 301 for acquiring content to be retrieved for retrieving the target content.
For example, the obtaining unit 301 may be specifically configured to receive content to be retrieved sent by a user through a terminal, or may obtain the content to be retrieved from a network or a third party database, or when the memory of the content to be retrieved is large or the number of the content to be retrieved is large, receive a content retrieval request, where the content retrieval request carries a storage address of the content to be retrieved, and obtain the content to be retrieved from the memory, the cache, or the third party database according to the storage address.
(2) A first extraction unit 302;
The first extracting unit 302 is configured to perform multi-mode feature extraction on the video content when the content to be retrieved is the video content, so as to obtain a mode feature of each mode.
For example, the first extraction unit 302 may specifically be configured to, when the content to be retrieved is video content, perform multi-modal feature extraction on the video content by using a trained content retrieval model to obtain initial modal features of each mode in the video content, extract video frames from the video content, perform multi-modal feature extraction on the video frames by using a trained content retrieval model to obtain basic modal features of each video frame, screen out target modal features corresponding to each mode from the basic modal features, and fuse the target modal features with the corresponding initial modal features to obtain modal features of the video content of each mode.
(3) A second extraction unit 303;
The second extracting unit 303 is configured to perform feature extraction on the modal feature of each mode, so as to obtain a modal content feature of each mode.
For example, the second extraction unit 303 may be specifically configured to identify, according to the modes of the mode features, a target video feature extraction network corresponding to each mode in the video feature extraction networks of the trained content retrieval model, and perform feature extraction on the mode features by using the target video feature extraction network to obtain the mode content feature of each mode.
(4) A text retrieval unit 304;
The text retrieval unit 304 is configured to fuse the modal content features to obtain video features of the video content, and retrieve target text content corresponding to the video content from the preset content set according to the video features.
For example, the text retrieval unit 304 may specifically be configured to combine the modal content features of each mode to obtain a sample modal content feature set of the video content, input the modal view content feature set into a transform model for encoding, so as to calculate an association weight of the modal content features, weight the modal content features according to the association weight, fuse the weighted modal content features to obtain video features of the video content, respectively calculate feature similarities between the video features and text features of candidate text content in a preset content set, and screen out target text content corresponding to the video content from the candidate text content according to the feature similarities.
Optionally, the content retrieval device may further include a training unit 305, as shown in fig. 7, specifically may be as follows:
the training unit 305 is configured to train the preset content retrieval model to obtain a trained content retrieval model.
For example, the training unit 305 may specifically be configured to obtain a content sample set, where the content sample set includes a video sample and a text sample, the text sample includes at least one text word, a preset content search model is used to perform multi-modal feature extraction on the video sample to obtain sample modal features of each mode, feature extraction is performed on the sample modal features of each mode to obtain sample modal content features of the video sample, the sample modal content features are fused to obtain sample video features of the video sample, feature extraction is performed on the text sample to obtain sample text features and text word features corresponding to each text word, and the preset content search model is converged according to the sample modal content features, the sample video features, the sample text features and the text word features to obtain a trained content search model.
Optionally, the content retrieval device may further include a video retrieval unit 306, as shown in fig. 8, and specifically may be as follows:
the video retrieving unit 306 is configured to perform feature extraction on the text content when the content to be retrieved is the text content, and retrieve, according to the extracted text feature, the target video content corresponding to the text content from the preset content set.
For example, the video retrieving unit 306 may be specifically configured to, when the content to be retrieved is text content, perform feature extraction on the text content by using a text feature extraction network of the trained content retrieval model, so as to obtain text features of the text content. And respectively calculating the feature similarity between the text feature and the video feature of the candidate video content in the preset content set, and screening out target video content corresponding to the text content from the candidate video content according to the feature similarity.
In the implementation, each unit may be implemented as an independent entity, or may be implemented as the same entity or several entities in any combination, and the implementation of each unit may be referred to the foregoing method embodiment, which is not described herein again.
As can be seen from the foregoing, in this embodiment, after the obtaining unit 301 obtains the content to be searched for the target content, when the content to be searched is the video content, the first extracting unit 302 performs multi-mode feature extraction on the video content to obtain the mode feature of each mode, the second extracting unit 303 performs feature extraction on the mode feature of each mode to obtain the mode content feature of each mode, the text retrieving unit 304 fuses the mode content features to obtain the video feature of the video content, and searches the target text content corresponding to the video content in the preset content set according to the video feature, and because the scheme performs multi-mode feature extraction on the video content first, then extracts the mode video feature from the mode feature corresponding to each mode, thereby improving the accuracy of the mode video feature in the video, and fuses the mode video feature to obtain the video feature of the video content, so that the extracted video feature can better express the information in the video, thereby improving the accuracy of content search.
The embodiment of the invention also provides an electronic device, as shown in fig. 9, which shows a schematic structural diagram of the electronic device according to the embodiment of the invention, specifically:
The electronic device may include one or more processing cores 'processors 401, one or more computer-readable storage media's memory 402, power supply 403, and input unit 404, among other components. It will be appreciated by those skilled in the art that the electronic device structure shown in fig. 9 is not limiting of the electronic device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components. Wherein:
The processor 401 is a control center of the electronic device, connects various parts of the entire electronic device using various interfaces and lines, and performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 402, and calling data stored in the memory 402, thereby controlling the electronic device as a whole. Optionally, the processor 401 may include one or more processing cores, and preferably the processor 401 may integrate an application processor and a modem processor, wherein the application processor mainly processes an operating system, a user interface, an application program, etc., and the modem processor mainly processes wireless communication. It will be appreciated that the modem processor described above may not be integrated into the processor 401.
The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by executing the software programs and modules stored in the memory 402. The memory 402 may mainly include a storage program area that may store an operating system, application programs required for at least one function (such as a sound playing function, an image playing function, etc.), etc., and a storage data area that may store data created according to the use of the electronic device, etc. In addition, memory 402 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 with access to the memory 402.
The electronic device further comprises a power supply 403 for supplying power to the various components, preferably the power supply 403 may be logically connected to the processor 401 by a power management system, so that functions of managing charging, discharging, and power consumption are performed by the power management system. The power supply 403 may also include one or more of any of a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.
The electronic device may further comprise an input unit 404, which input unit 404 may be used for receiving input digital or character information and generating keyboard, mouse, joystick, optical or trackball signal inputs in connection with user settings and function control.
Although not shown, the electronic device may further include a display unit or the like, which is not described herein. In particular, in this embodiment, the processor 401 in the electronic device loads executable files corresponding to the processes of one or more application programs into the memory 402 according to the following instructions, and the processor 401 executes the application programs stored in the memory 402, so as to implement various functions as follows:
When the content to be searched is video content, carrying out multi-mode feature extraction on the video content to obtain the mode feature of each mode, carrying out feature extraction on the mode feature of each mode to obtain the mode content feature of each mode, fusing the mode content features to obtain the video feature of the video content, and searching the target text content corresponding to the video content in a preset content set according to the video feature.
For example, the electronic device receives the content to be searched sent by the user through the terminal, or may acquire the content to be searched from the network or the third party database, or when the memory of the content to be searched is large or the number of the content to be searched is large, receives a content search request, where the content search request carries a memory address of the content to be searched, and acquires the content to be searched from the memory, the cache or the third party database according to the memory address. When the content to be searched is video content, performing multi-mode feature extraction on the video content by adopting a trained content search model to obtain initial mode features of each mode in the video content, extracting video frames from the video content, performing multi-mode feature extraction on the video frames by adopting the trained content search model to obtain basic mode features of each video frame, screening out target mode features corresponding to each mode from the basic mode features, and fusing the target mode features with the corresponding initial mode features to obtain mode features of each mode. And identifying a target video feature extraction network corresponding to each mode in the video feature extraction network of the trained content retrieval model according to the mode of the mode feature, and carrying out feature extraction on the mode feature by adopting the target video feature extraction network to obtain the mode content feature corresponding to each mode. Combining the modal content characteristics of each mode to obtain a sample modal content characteristic set of the video content, inputting the modal content characteristic set into a transducer model for encoding so as to calculate the association weight of the modal content characteristics, weighting the modal content characteristics according to the association weight, fusing the weighted modal content characteristics to obtain the video characteristics of the video content, respectively calculating the characteristic similarity between the video characteristics and the text characteristics of candidate text contents in the preset content set, and screening out target text contents corresponding to the video content from the candidate text contents according to the characteristic similarity.
The specific implementation of each operation may be referred to the previous embodiments, and will not be described herein.
As can be seen from the above, after obtaining the content to be searched for searching the target content, when the content to be searched is the video content, the embodiment of the invention performs multi-mode feature extraction on the video content to obtain the mode feature of each mode, performs feature extraction on the mode feature of each mode to obtain the mode content feature of each mode, fuses the mode content features to obtain the video feature of the video content, and searches the target text content corresponding to the video content in the preset content set according to the video feature.
Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.
To this end, embodiments of the present invention provide a computer readable storage medium having stored therein a plurality of instructions capable of being loaded by a processor to perform the steps of any of the content retrieval methods provided by embodiments of the present invention. For example, the instructions may perform the steps of:
When the content to be searched is video content, carrying out multi-mode feature extraction on the video content to obtain the mode feature of each mode, carrying out feature extraction on the mode feature of each mode to obtain the mode content feature of each mode, fusing the mode content features to obtain the video feature of the video content, and searching the target text content corresponding to the video content in a preset content set according to the video feature.
For example, the content to be searched sent by the user through the terminal is received, or the content to be searched can be obtained from a network or a third party database, or when the memory of the content to be searched is large or the number of the memory is large, a content search request is received, the content search request carries the memory address of the content to be searched, and the content to be searched is obtained from the memory, the cache or the third party database according to the memory address. When the content to be searched is video content, performing multi-mode feature extraction on the video content by adopting a trained content search model to obtain initial mode features of each mode in the video content, extracting video frames from the video content, performing multi-mode feature extraction on the video frames by adopting the trained content search model to obtain basic mode features of each video frame, screening out target mode features corresponding to each mode from the basic mode features, and fusing the target mode features with the corresponding initial mode features to obtain mode features of each mode. And identifying a target video feature extraction network corresponding to each mode in the video feature extraction network of the trained content retrieval model according to the mode of the mode feature, and carrying out feature extraction on the mode feature by adopting the target video feature extraction network to obtain the mode content feature corresponding to each mode. Combining the modal content characteristics of each mode to obtain a sample modal content characteristic set of the video content, inputting the modal content characteristic set into a transducer model for encoding so as to calculate the association weight of the modal content characteristics, weighting the modal content characteristics according to the association weight, fusing the weighted modal content characteristics to obtain the video characteristics of the video content, respectively calculating the characteristic similarity between the video characteristics and the text characteristics of candidate text contents in the preset content set, and screening out target text contents corresponding to the video content from the candidate text contents according to the characteristic similarity.
The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.
The computer readable storage medium may include, among others, read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disks, and the like.
Because the instructions stored in the computer readable storage medium can execute the steps in any content retrieval method provided by the embodiments of the present invention, the beneficial effects that any content retrieval method provided by the embodiments of the present invention can be achieved, and detailed descriptions of the foregoing embodiments are omitted herein.
Wherein according to an aspect of the application, a computer program product or a computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, cause the computer device to perform the methods provided in various alternative implementations of the content retrieval aspect or the video/text bi-directional retrieval aspect described above.
The foregoing describes in detail a content retrieval method, apparatus and computer readable storage medium according to embodiments of the present invention, and specific examples are set forth herein to illustrate the principles and implementations of the present invention, and the above examples are provided to assist in understanding the method and core ideas of the present invention, and meanwhile, to those skilled in the art, according to the ideas of the present invention, there are variations in the specific implementations and application scope, and in light of the above, the disclosure should not be construed as limiting the scope of the present invention.

Claims (16)

Translated fromChinese
1.一种内容检索方法,其特征在于,包括:1. A content retrieval method, comprising:获取用于检索目标内容的待检索内容;Obtaining content to be retrieved for retrieving target content;当所述待检索内容为视频内容时,对所述视频内容进行多模态特征提取,得到每一模态的模态特征,所述模态特征为所述视频内容中每一模态对应的特征信息,所述视频内容中包含多个模态,所述多个模态包括描述动作、音频、场景、人脸和/或实体;When the content to be retrieved is video content, multimodal feature extraction is performed on the video content to obtain modal features of each modality, where the modal features are feature information corresponding to each modality in the video content, and the video content contains multiple modalities, and the multiple modalities include descriptions of actions, audio, scenes, faces, and/or entities;分别对每一模态的模态特征进行特征提取,得到每一模态的模态内容特征,所述模态内容特征为每一模态内容的总体特征,用于指示每一模态下的内容特征;Extracting features of the modal features of each modality respectively to obtain modal content features of each modality, wherein the modal content features are overall features of the content of each modality and are used to indicate content features under each modality;将所述模态内容特征进行融合,得到所述视频内容的视频特征,并根据所述视频特征,在预设内容集合中检索出所述视频内容对应的目标文本内容;The modal content features are merged to obtain video features of the video content, and according to the video features, target text content corresponding to the video content is retrieved from a preset content set;其中,所述对所述视频内容进行多模态特征提取,得到每一模态的模态特征,包括:The step of extracting multimodal features from the video content to obtain modal features of each modality includes:采用训练后内容检索模型对所述视频内容进行多模态特征提取,得到所述视频内容中每一模态的初始模态特征;Using the trained content retrieval model to perform multimodal feature extraction on the video content to obtain initial modal features of each modality in the video content;在所述视频内容中提取出视频帧,并采用所述训练后内容检索模型对所述视频帧进行多模态特征提取,得到每一视频帧的基础模态特征;Extracting video frames from the video content, and performing multimodal feature extraction on the video frames using the trained content retrieval model to obtain basic modal features of each video frame;在所述基础模态特征中筛选出每一模态对应的目标模态特征,并将所述目标模态特征和对应的初始模态特征进行融合,得到每一模态的视频内容对应的模态特征。The target modal features corresponding to each modality are screened out from the basic modal features, and the target modal features are fused with the corresponding initial modal features to obtain the modal features corresponding to the video content of each modality.2.根据权利要求1所述的内容检索方法,其特征在于,所述分别对每一模态的模态特征进行特征提取,得到每一模态的模态视频特征,包括:2. The content retrieval method according to claim 1, characterized in that the extracting the modal features of each modality to obtain the modal video features of each modality comprises:在所述训练后内容检索模型的视频特征提取网络中识别出每一模态对应的目标视频特征提取网络;Identifying a target video feature extraction network corresponding to each modality in the video feature extraction network of the trained content retrieval model;采用所述目标视频特征提取网络对所述模态特征进行特征提取,得到每一模态的模态视频特征。The target video feature extraction network is used to extract the modality features to obtain the modality video features of each modality.3.根据权利要求1所述的内容检索方法,其特征在于,所述采用训练后内容检索模型对所述视频内容进行多模态特征提取,得到所述视频内容中每一模态的初始模特特征之前,还包括:3. The content retrieval method according to claim 1, characterized in that before the step of extracting multimodal features of the video content using the trained content retrieval model to obtain initial model features of each modality in the video content, the method further comprises:获取内容样本集合,所述内容样本集合包括视频样本和文本样本,所述文本样本包括至少一个文本词;Acquire a content sample set, the content sample set comprising a video sample and a text sample, the text sample comprising at least one text word;采用预设内容检索模型对所述视频样本进行多模态特征提取,得到每一模态的样本模态特征;Using a preset content retrieval model to perform multimodal feature extraction on the video sample to obtain sample modal features of each modality;分别对每一模态的样本模态特征进行特征提取,得到所述视频样本的样本模态内容特征,并将所述样本模态内容特征进行融合,得到所述视频样本的样本视频特征;Extracting features of sample modal features of each modality respectively to obtain sample modal content features of the video sample, and fusing the sample modal content features to obtain sample video features of the video sample;对所述文本样本进行特征提取,得到样本文本特征和每一文本词对应的文本词特征,并根据所述样本模态内容特征、样本视频特征、样本文本特征和文本词特征,对所述预设内容检索模型进行收敛,得到所述训练后内容检索模型。Feature extraction is performed on the text sample to obtain sample text features and text word features corresponding to each text word, and the preset content retrieval model is converged based on the sample modal content features, sample video features, sample text features and text word features to obtain the trained content retrieval model.4.根据权利要求3所述的内容检索方法,其特征在于,所述根据所述样本模态内容特征、样本视频特征、样本文本特征和文本词特征,对所述预设内容检索模型进行收敛,得到所述训练后内容检索模型,包括:4. The content retrieval method according to claim 3 is characterized in that the step of converging the preset content retrieval model according to the sample modal content features, sample video features, sample text features and text word features to obtain the trained content retrieval model comprises:根据所述样本模态内容特征和文本词特征,确定所述内容样本集合的特征损失信息;Determining feature loss information of the content sample set according to the sample modal content features and text word features;基于所述样本视频特征和样本文本特征,确定所述内容样本集合的内容损失信息;Determining content loss information of the content sample set based on the sample video features and the sample text features;将所述特征损失信息和内容损失信息进行融合,并基于融合后损失信息,对预设内容检索模型进行收敛,得到训练后内容检索模型。The feature loss information and the content loss information are fused, and based on the fused loss information, a preset content retrieval model is converged to obtain a trained content retrieval model.5.根据权利要求4所述的内容检索方法,其特征在于,所述根据所述样本模态内容特征和文本词特征,确定所述内容样本集合的特征损失信息,包括:5. The content retrieval method according to claim 4, characterized in that the step of determining the feature loss information of the content sample set according to the sample modal content features and text word features comprises:计算所述样本模态内容特征和文本词特征之间的特征相似度,得到第一特征相似度;Calculating the feature similarity between the sample modal content feature and the text word feature to obtain a first feature similarity;根据所述第一特征相似度,确定所述视频样本和文本样本之间的样本相似度;Determining a sample similarity between the video sample and the text sample according to the first feature similarity;基于所述样本相似度,计算所述视频样本与文本样本之间的特征距离,以得到所述内容样本集合的特征损失信息。Based on the sample similarity, the feature distance between the video sample and the text sample is calculated to obtain feature loss information of the content sample set.6.根据权利要求5所述的内容检索方法,其特征在于,所述根据所述第一特征相似度,确定所述视频样本和文本样本之间的样本相似度,包括:6. The content retrieval method according to claim 5, characterized in that the determining the sample similarity between the video sample and the text sample based on the first feature similarity comprises:根据所述第一特征相似度,将所述样本模态内容特征与文本词特征进行特征交互,得到交互后视频特征和交互后文本词特征;According to the first feature similarity, performing feature interaction between the sample modal content feature and the text word feature to obtain a post-interaction video feature and a post-interaction text word feature;计算所述交互后视频特征与交互后文本词特征之间的特征相似度,得到第二特征相似度;Calculating the feature similarity between the post-interaction video feature and the post-interaction text word feature to obtain a second feature similarity;将所述第二特征相似度进行融合,得到所述视频样本和文本样本之间的样本相似度。The second feature similarities are fused to obtain the sample similarity between the video sample and the text sample.7.根据权利要求6所述的内容检索方法,其特征在于,所述根据所述第一特征相似度,将所述样本模态内容特征与文本词特征进行特征交互,得到交互后视频特征和交互后文本词特征,包括:7. The content retrieval method according to claim 6, characterized in that the step of performing feature interaction between the sample modal content feature and the text word feature according to the first feature similarity to obtain the post-interaction video feature and the post-interaction text word feature comprises:对所述第一特征相似度进行标准化处理,得到目标特征相似度;Performing normalization processing on the first feature similarity to obtain a target feature similarity;根据所述目标特征相似度,确定所述样本模态内容特征的关联权重,所述关联权重用于指示所述样本模态内容特征与文本词特征之间的关联关系;Determining, according to the target feature similarity, an association weight of the sample modal content feature, wherein the association weight is used to indicate an association relationship between the sample modal content feature and a text word feature;基于所述关联权重,对所述样本模态内容特征进行加权,并基于加权后样本模态内容特征对所述文本词特征进行更新,以得到所述交互后视频特征和交互后文本词特征。Based on the association weight, the sample modality content feature is weighted, and based on the weighted sample modality content feature, the text word feature is updated to obtain the post-interaction video feature and the post-interaction text word feature.8.根据权利要求7所述的内容检索方法,其特征在于,所述基于加权后样本模态内容特征对所述文本词特征进行更新,以得到所述交互后视频特征和交互后文本词特征,包括:8. The content retrieval method according to claim 7, characterized in that the updating of the text word features based on the weighted sample modal content features to obtain the post-interaction video features and post-interaction text word features comprises:将所述加权后样本模态内容特征作为初始交互后视频特征,并基于所述初始交互后视频特征对所述文本词特征进行更新,得到初始交互后文本词特征;The weighted sample modal content feature is used as an initial post-interaction video feature, and the text word feature is updated based on the initial post-interaction video feature to obtain an initial post-interaction text word feature;计算所述初始交互后视频特征与初始交互后文本词特征的特征相似度,得到第三特征相似度;Calculating the feature similarity between the video feature after the initial interaction and the text word feature after the initial interaction to obtain a third feature similarity;根据所述第三特征相似度,对所述初始交互后视频特征和初始交互后文本词特征进行更新,得到所述交互后视频特征和交互后文本词特征。According to the third feature similarity, the initial post-interaction video feature and the initial post-interaction text word feature are updated to obtain the post-interaction video feature and the post-interaction text word feature.9.根据权利要求8所述的内容检索方法,其特征在于,所述根据所述第三特征相似度,对所述初始交互后视频特征和初始交互后文本词特征进行更新,得到所述交互后视频特征和交互后文本词特征,包括:9. The content retrieval method according to claim 8, characterized in that the updating of the initial post-interaction video feature and the initial post-interaction text word feature according to the third feature similarity to obtain the post-interaction video feature and the post-interaction text word feature comprises:根据所述第三特征相似度,将所述初始交互后视频特征和初始交互后文本词特征进行特征交互,以得到目标交互后视频特征和目标交互后文本词特征;According to the third feature similarity, the initial post-interaction video feature and the initial post-interaction text word feature are subjected to feature interaction to obtain a target post-interaction video feature and a target post-interaction text word feature;将所述目标交互后视频特征作为初始交互后视频特征,并将所述目标交互后文本词特征作为初始交互后文本词特征;Using the target post-interaction video feature as the initial post-interaction video feature, and using the target post-interaction text word feature as the initial post-interaction text word feature;返回执行所述计算所述初始交互后视频特征与初始交互后文本词特征的特征相似度的步骤,直至所述初始交互后视频特征和初始交互后文本词特征的特征交互次数达到预设次数为止,得到所述交互后视频特征和交互后文本词特征。Return to the step of calculating the feature similarity between the post-initial interaction video feature and the post-initial interaction text word feature until the number of feature interactions between the post-initial interaction video feature and the post-initial interaction text word feature reaches a preset number, and obtain the post-interaction video feature and the post-interaction text word feature.10.根据权利要求5所述的内容检索方法,其特征在于,所述基于所述样本相似度,计算所述视频样本与文本样本之间的特征距离,以得到所述内容样本集合的特征损失信息,包括:10. The content retrieval method according to claim 5, characterized in that the step of calculating the feature distance between the video sample and the text sample based on the sample similarity to obtain feature loss information of the content sample set comprises:获取所述内容样本集合对应的预设特征边界值;Obtaining a preset feature boundary value corresponding to the content sample set;根据所述样本相似度,在所述内容样本集合中筛选出视频样本与文本样本匹配的第一内容样本对、以及视频样本与文本样本不匹配的第二内容样本对;According to the sample similarity, screening out a first content sample pair in which the video sample matches the text sample and a second content sample pair in which the video sample does not match the text sample from the content sample set;基于所述预设特征边界值,计算所述第一内容样本对与第二内容样本对之间的特征距离,得到所述内容样本集合的特征损失信息。Based on the preset feature boundary value, a feature distance between the first content sample pair and the second content sample pair is calculated to obtain feature loss information of the content sample set.11.根据权利要求10所述的内容检索方法,其特征在于,所述基于所述预设特征边界值,计算所述第一内容样本对于第二内容样本对之间的特征距离,得到所述内容样本集合的特征损失信息,包括:11. The content retrieval method according to claim 10, characterized in that the step of calculating the feature distance between the first content sample and the second content sample based on the preset feature boundary value to obtain the feature loss information of the content sample set comprises:在所述第二内容样本对中筛选出样本相似度最大的内容样本对,得到目标内容样本对;Selecting a content sample pair with the greatest sample similarity from the second content sample pairs to obtain a target content sample pair;计算所述第一内容样本对的样本相似度与目标内容样本对的样本相似度之间的相似度差值,得到第一相似度差值;Calculating a similarity difference between the sample similarity of the first content sample pair and the sample similarity of the target content sample pair to obtain a first similarity difference;将所述预设特征边界值与第一相似度差值进行融合,以得到所述内容样本集合的特征损失信息。The preset feature boundary value is merged with the first similarity difference value to obtain feature loss information of the content sample set.12.根据权利要求4所述的内容检索方法,其特征在于,所述基于所述样本视频特征和样本文本特征,确定所述内容样本集合的内容损失信息,包括:12. The content retrieval method according to claim 4, characterized in that the step of determining the content loss information of the content sample set based on the sample video features and the sample text features comprises:计算所述样本视频特征和文本特征之间的特征相似度,得到所述视频样本与文本样本之间的内容相似度;Calculating the feature similarity between the sample video features and the text features to obtain the content similarity between the video sample and the text sample;根据所述内容相似度,在所述内容样本集合中筛选出视频样本与文本样本匹配的第三内容样本对、以及视频样本与内容样本不匹配的第四内容样本对;According to the content similarity, selecting from the content sample set a third content sample pair in which the video sample matches the text sample and a fourth content sample pair in which the video sample does not match the content sample;获取所述内容样本集合对应的预设内容边界值,并根据所述预设内容边界值,计算所述第三内容样本对与第四内容样本对之间的内容差值,以得到所述内容样本集合的内容损失信息。A preset content boundary value corresponding to the content sample set is obtained, and a content difference between the third content sample pair and the fourth content sample pair is calculated according to the preset content boundary value to obtain content loss information of the content sample set.13.根据权利要求12所述的内容检索方法,其特征在于,所述根据所述预设内容边界值,计算所述第三内容样本对于第四内容样本对之间的内容差值,以得到所述内容样本集合的内容损失信息,包括:13. The content retrieval method according to claim 12, characterized in that the step of calculating the content difference between the third content sample and the fourth content sample pair according to the preset content boundary value to obtain the content loss information of the content sample set comprises:计算所述第三内容样本对的内容相似度与第四内容样本对的内容相似度之间的相似度差值,得到第二相似度差值;Calculating a similarity difference between the content similarity of the third content sample pair and the content similarity of the fourth content sample pair to obtain a second similarity difference;将所述第二相似度差值与预设内容边界值进行融合,得到所述第三内容样本对与第四内容样本对之间的内容差值;Merging the second similarity difference value with a preset content boundary value to obtain a content difference value between the third content sample pair and the fourth content sample pair;对所述内容差值进行标准化处理,得到所述内容样本集合的内容损失信息。The content difference is normalized to obtain content loss information of the content sample set.14.根据权利要求1所述的内容检索方法,其特征在于,还包括:14. The content retrieval method according to claim 1, further comprising:当所述待检索内容为文本内容时,对所述文本内容进行特征提取,得到所述文本内容的文本特征;When the content to be retrieved is text content, feature extraction is performed on the text content to obtain text features of the text content;根据所述文本特征,在所述预设内容集合中检索出所述文本内容对应的目标视频内容。According to the text feature, the target video content corresponding to the text content is retrieved from the preset content set.15.一种内容检索装置,其特征在于,包括:15. A content retrieval device, comprising:获取单元,用于获取用于检索目标内容的待检索内容;An acquisition unit, used for acquiring content to be retrieved for retrieving target content;第一提取单元,用于当所述待检索内容为视频内容时,对所述视频内容进行多模态特征提取,得到每一模态的模态特征,所述模态特征为所述视频内容中每一模态对应的特征信息,所述视频内容中包含多个模态,所述多个模态包括描述动作、音频、场景、人脸和/或实体;A first extraction unit is used for, when the content to be retrieved is video content, performing multimodal feature extraction on the video content to obtain modal features of each modality, wherein the modal features are feature information corresponding to each modality in the video content, and the video content contains multiple modalities, and the multiple modalities include descriptions of actions, audio, scenes, faces and/or entities;第二提取单元,用于分别对每一模态的模态特征进行特征提取,得到每一模态的模态内容特征,所述模态内容特征为每一模态内容的总体特征,用于指示每一模态下的内容特征;A second extraction unit is used to extract features from the modal features of each modality to obtain modal content features of each modality, where the modal content features are overall features of each modal content and are used to indicate content features under each modality;文本检索单元,用于将所述模态内容特征进行融合,得到所述视频内容的视频特征,并根据所述视频特征,在预设内容集合中检索出所述视频内容对应的目标文本内容;A text retrieval unit, configured to fuse the modal content features to obtain video features of the video content, and retrieve target text content corresponding to the video content from a preset content set based on the video features;其中,所述对所述视频内容进行多模态特征提取,得到每一模态的模态特征,包括:The step of extracting multimodal features from the video content to obtain modal features of each modality includes:采用训练后内容检索模型对所述视频内容进行多模态特征提取,得到所述视频内容中每一模态的初始模态特征;Using the trained content retrieval model to perform multimodal feature extraction on the video content to obtain initial modal features of each modality in the video content;在所述视频内容中提取出视频帧,并采用所述训练后内容检索模型对所述视频帧进行多模态特征提取,得到每一视频帧的基础模态特征;Extracting video frames from the video content, and performing multimodal feature extraction on the video frames using the trained content retrieval model to obtain basic modal features of each video frame;在所述基础模态特征中筛选出每一模态对应的目标模态特征,并将所述目标模态特征和对应的初始模态特征进行融合,得到每一模态的视频内容对应的模态特征。The target modal features corresponding to each modality are screened out from the basic modal features, and the target modal features are fused with the corresponding initial modal features to obtain the modal features corresponding to the video content of each modality.16.一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有多条指令,所述指令适于处理器进行加载,以执行权利要求1至14任一项所述的内容检索方法中的步骤。16. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a plurality of instructions, wherein the instructions are suitable for being loaded by a processor to execute the steps in the content retrieval method according to any one of claims 1 to 14.
CN202110733613.6A2021-06-302021-06-30 A content retrieval method, device and computer-readable storage mediumActiveCN113821687B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202110733613.6ACN113821687B (en)2021-06-302021-06-30 A content retrieval method, device and computer-readable storage medium

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202110733613.6ACN113821687B (en)2021-06-302021-06-30 A content retrieval method, device and computer-readable storage medium

Publications (2)

Publication NumberPublication Date
CN113821687A CN113821687A (en)2021-12-21
CN113821687Btrue CN113821687B (en)2025-06-27

Family

ID=78924061

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202110733613.6AActiveCN113821687B (en)2021-06-302021-06-30 A content retrieval method, device and computer-readable storage medium

Country Status (1)

CountryLink
CN (1)CN113821687B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN114461987B (en)*2022-01-192025-06-27腾讯科技(深圳)有限公司 Content detection method, device and computer readable storage medium
CN116756676B (en)*2022-03-032025-01-21腾讯科技(深圳)有限公司 A method for generating a summary and a related device
CN117033720A (en)*2022-09-012023-11-10腾讯科技(深圳)有限公司Model training method, device, computer equipment and storage medium
CN115964467B (en)*2023-01-022025-08-12西北工业大学Visual context-fused rich-semantic dialogue generation method

Citations (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN111930999A (en)*2020-07-212020-11-13山东省人工智能研究院Method for implementing text query and positioning video clip by frame-by-frame cross-modal similarity correlation

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US10417498B2 (en)*2016-12-302019-09-17Mitsubishi Electric Research Laboratories, Inc.Method and system for multi-modal fusion model
CN109783657B (en)*2019-01-072022-12-30北京大学深圳研究生院Multi-step self-attention cross-media retrieval method and system based on limited text space
CN110162669B (en)*2019-04-042021-07-02腾讯科技(深圳)有限公司Video classification processing method and device, computer equipment and storage medium
CN111221984B (en)*2020-01-152024-03-01北京百度网讯科技有限公司Multi-mode content processing method, device, equipment and storage medium
CN111309971B (en)*2020-01-192022-03-25浙江工商大学Multi-level coding-based text-to-video cross-modal retrieval method
CN112738556B (en)*2020-12-222023-03-31上海幻电信息科技有限公司Video processing method and device

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN111930999A (en)*2020-07-212020-11-13山东省人工智能研究院Method for implementing text query and positioning video clip by frame-by-frame cross-modal similarity correlation

Also Published As

Publication numberPublication date
CN113821687A (en)2021-12-21

Similar Documents

PublicationPublication DateTitle
CN113762322B (en)Video classification method, device and equipment based on multi-modal representation and storage medium
CN113821687B (en) A content retrieval method, device and computer-readable storage medium
CN113761220B (en)Information acquisition method, device, equipment and storage medium
CN114282059B (en) Video retrieval method, device, equipment and storage medium
CN114741581A (en)Image classification method and device, computer equipment and medium
CN114358109B (en) Feature extraction model training, sample retrieval method, device and computer equipment
CN114329029B (en)Object retrieval method, device, equipment and computer storage medium
CN113704528B (en)Cluster center determining method, device and equipment and computer storage medium
CN113128526B (en)Image recognition method and device, electronic equipment and computer-readable storage medium
CN115129908B (en) A model optimization method, device, equipment, storage medium and program product
CN112765387A (en)Image retrieval method, image retrieval device and electronic equipment
CN111783903A (en)Text processing method, text model processing method and device and computer equipment
CN114328800B (en) Text processing method, device, electronic device and computer-readable storage medium
CN117609418B (en) Document processing method, device, electronic device and storage medium
CN117033720A (en)Model training method, device, computer equipment and storage medium
CN116955707A (en) Content tag determination methods, devices, equipment, media and program products
CN114329004A (en)Digital fingerprint generation method, digital fingerprint generation device, data push method, data push device and storage medium
CN116910201A (en) Dialog data generation method and related equipment
CN116955599B (en) A method for determining a category, a related device, an apparatus, and a storage medium
CN118035945A (en)Label recognition model processing method and related device
CN120296408A (en) A data labeling and model training method and related device
CN115114545B (en) Content processing method and related equipment
CN114708429B (en) Image processing method, device, computer equipment and computer readable storage medium
CN116975451A (en)Article recommendation method, device, equipment and storage medium
CN118779478A (en) Video text retrieval method, device, electronic device and storage medium

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp