Movatterモバイル変換


[0]ホーム

URL:


CN113806588A - Method and device for searching video - Google Patents

Method and device for searching video
Download PDF

Info

Publication number
CN113806588A
CN113806588ACN202111104970.2ACN202111104970ACN113806588ACN 113806588 ACN113806588 ACN 113806588ACN 202111104970 ACN202111104970 ACN 202111104970ACN 113806588 ACN113806588 ACN 113806588A
Authority
CN
China
Prior art keywords
video
candidate
features
user
videos
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111104970.2A
Other languages
Chinese (zh)
Other versions
CN113806588B (en
Inventor
冯博豪
刘雨鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co LtdfiledCriticalBeijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202111104970.2ApriorityCriticalpatent/CN113806588B/en
Publication of CN113806588ApublicationCriticalpatent/CN113806588A/en
Application grantedgrantedCritical
Publication of CN113806588BpublicationCriticalpatent/CN113806588B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Images

Classifications

Landscapes

Abstract

Translated fromChinese

本公开提供了搜索视频的方法和装置,涉及人工智能领域,尤其涉及智能搜索、视频技术领域。具体实现方案为:获取待搜索的视频片段;获取视频片段的视频标签;从视频片段中提取出视频特征;基于视频标签和视频特征从候选视频集合选择目标视频进行输出。该实施方式提高了视频搜索的速度和准确性。

Figure 202111104970

The present disclosure provides a method and an apparatus for searching video, which relate to the field of artificial intelligence, and in particular, to the technical fields of intelligent search and video. The specific implementation scheme is: acquiring the video segment to be searched; acquiring the video tag of the video segment; extracting the video feature from the video segment; selecting the target video from the candidate video set based on the video tag and the video feature for output. This implementation improves the speed and accuracy of video searches.

Figure 202111104970

Description

Method and device for searching video
Technical Field
The present disclosure relates to the field of artificial intelligence, and in particular, to the field of intelligent search and video technologies, and in particular, to a method and an apparatus for searching a video.
Background
With the advent of the internet age, the amount of information on networks has increased explosively. Today, with the rapid development of information technology, a great deal of data such as characters, images, audio, video and the like are published and transmitted on an information network every day. The visual data generally comes from various social websites and mobile phone applications, the user quantity of the social services is more than hundred million, people share and transmit images and videos in a social mode, and the shared and transmitted visual data often has different subjects, different types, different labels and different meanings. The huge and complicated data brings rich content and also brings great challenges to information retrieval.
Disclosure of Invention
The present disclosure provides a method, apparatus, device, storage medium, and computer program product for searching for a video.
According to a first aspect of the present disclosure, there is provided a method of searching for a video, including: acquiring a video clip to be searched; acquiring a video label of the video clip; extracting video features from the video clips; and selecting a target video from the candidate video set for output based on the video label and the video characteristics.
According to a second aspect of the present disclosure, there is provided an apparatus for searching for a video, including: a first acquisition unit configured to acquire a video clip to be searched; a second acquisition unit configured to acquire a video tag of the video clip; a feature extraction unit configured to extract video features from the video segments; an output unit configured to select a target video from a candidate video set for output based on the video tag and the video feature.
According to a third aspect of the present disclosure, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect.
According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of the first aspect.
According to a fifth aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of the first aspect.
According to the method and the device for searching the video, the video tag and the video feature are extracted for matching search, the video search range is narrowed, and the search speed and accuracy are improved.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present disclosure may be applied;
FIG. 2 is a flow diagram for one embodiment of a method of searching videos, according to the present disclosure;
3a-3c are schematic diagrams of application scenarios of the method of searching for video according to the present disclosure;
FIG. 4 is a flow diagram of yet another embodiment of a method of searching for videos, according to the present disclosure;
FIG. 5 is a schematic block diagram illustrating one embodiment of an apparatus for searching for video according to the present disclosure;
FIG. 6 is a schematic block diagram of a computer system suitable for use with an electronic device implementing embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Fig. 1 illustrates anexemplary system architecture 100 to which embodiments of the method of searching for video or the apparatus for searching for video of the present disclosure may be applied.
As shown in fig. 1, thesystem architecture 100 may includeterminal devices 101, 102, 103, anetwork 104, and aserver 105. Thenetwork 104 serves as a medium for providing communication links between theterminal devices 101, 102, 103 and theserver 105.Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use theterminal devices 101, 102, 103 to interact with theserver 105 via thenetwork 104 to receive or send messages or the like. Theterminal devices 101, 102, 103 may have various communication client applications installed thereon, such as a search application, a video playing application, a web browser application, a shopping application, an instant messaging tool, a mailbox client, social platform software, and the like.
Theterminal apparatuses 101, 102, and 103 may be hardware or software. When theterminal devices 101, 102, 103 are hardware, they may be various electronic devices having a display screen and supporting video playing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, mpeg compression standard Audio Layer 3), MP4 players (Moving Picture Experts Group Audio Layer IV, mpeg compression standard Audio Layer 4), laptop portable computers, desktop computers, and the like. When theterminal apparatuses 101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.
Theserver 105 may be a server providing various services, such as a background search server providing support for videos displayed on theterminal devices 101, 102, 103. The background search server can analyze and process the received data such as the video search request and feed back the search result to the terminal equipment.
Theserver 105 is provided with a video search system, which comprises an application display layer, a core processing layer and a data storage layer.
The application display layer mainly provides a visual interface for a user to complete the interaction between the user and the system. The core function of the system is retrieval, and the retrieval of the video is supported by audio and video clips. And uploading the content to be queried to the core processing layer by the user. And after the processing of the core processing layer is finished, the returned search results are displayed to the user in a list form.
The middle layer is a core processing layer and comprises functions of multi-modal data feature extraction, feature transformation, similarity search and the like. The core processing layer firstly receives original information transmitted from the application display layer, extracts feature representation through a feature extraction algorithm, then calculates similarity among multi-modal data, retrieves data similar to the content to be retrieved in the database, and generates a sorting table according to the similarity.
The data storage layer stores the search data, the model file and the search record into a database.
The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein. The server may also be a server of a distributed system, or a server incorporating a blockchain. The server can also be a cloud server, or an intelligent cloud computing server or an intelligent cloud host with artificial intelligence technology.
It should be noted that the method for searching for a video provided by the embodiment of the present disclosure is generally performed by theserver 105, and accordingly, the apparatus for searching for a video is generally disposed in theserver 105.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to FIG. 2, aflow 200 of one embodiment of a method of searching for video in accordance with the present disclosure is shown. The method for searching the video comprises the following steps:
step 201, obtaining a video clip to be searched.
In this embodiment, an execution subject (for example, a server shown in fig. 1) of the method for searching for a video may receive a search request including a video clip to be searched from a terminal with which a user plays a video through a wired connection manner or a wireless connection manner. Besides the image frames, the video segments may also include subtitles and audio.
Step 202, obtaining a video label of the video clip.
In this embodiment, the video clip itself may carry a video tag, e.g., the name of the video clip. The video tag may also be input by the user. Or the video tags may be automatically generated by the search system.
And step 203, extracting video features from the video clips.
In the present embodiment, if only image frames are included in a video clip, image features may be extracted from the video clip as video features. If the video segment also includes subtitles (or other text content) or audio, text features or audio features can be extracted from the video segment. The image feature and at least one of the text feature and the audio feature can be fused to obtain the video feature, namely the image feature and the text feature can be fused to obtain the video feature, the image feature and the audio feature can be fused to obtain the video feature, and the image feature, the text feature and the audio feature can be fused to obtain the video feature.
Image features may be extracted for each video frame. Or extracting video frames according to a certain time interval, filtering out similar video frames to obtain key frames, and extracting only the image characteristics and text characteristics of the key frames and the audio characteristics of audio clips among the key frames.
Image features of the video frame can be extracted through a deep neural network. For example, the image feature extraction may employ a VGGNet network model. The network model has 5 convolutional layers and 3 fully-connected layers. The first 7 layers all use ReLU as the activation function, and the 8 th layer uses the identity function as the activation function. The output of the network is the image feature vector. The specific calculation method is as follows: first, for a volumeThe calculation formula of the output of the lamination is that y is equal to F ((W)c*X)+bc) Where X represents the input of the layer, WcIs a convolution kernel, bcIs the bias, F is the activation function, and is the convolution operation. For a fully connected layer, the output calculation formula is y ═ G ((W)fX)+bf) Where X represents the input of the layer, WfIs a weight vector, bfIs the bias and G is the activation function. The input image can obtain image characteristics through the convolution layers and the full-connection layer.
And (3) audio feature extraction process: the audio frequency can be transferred from a time domain to a frequency domain through a Mel Frequency Cepstrum Coefficient (MFCC), and then denoising, smoothing and further expressing are carried out through an audio frequency network, so that effective audio frequency characteristics are extracted, and the purpose of characteristic dimension reduction is achieved. Let S denote speech. The extraction process of MFCC features can be expressed as vmfccMFCC (S), wherein vmfccRepresenting the MFCC characteristics of speech. Then, the MFCC characteristics v are setmfccAs an input to the audio network. Audio characteristics can be extracted by using an AudioNet, which comprises 3 convolutional layers, 1 pooling layer and 1 full connection layer. The mel-frequency cepstral coefficient (MFCC) can be further represented by AudioNet.
And recognizing text information from the video clip through a text recognition model, and extracting text features from the text information through a pre-trained text feature extraction model. The Text recognition model can use FOTS (Fast organized Text Spotting) algorithm. FOTS is a rapid end-to-end integration detection and identification framework, and FOTS has a faster speed compared with other two-stage methods. The overall structure of FOTS is composed of four parts, namely a convolution sharing branch, a text detection branch, a RoIRote (region of interest rotation) operation branch and a text recognition branch. The backbone of the convolution sharing network is ResNet-50, and the role of convolution sharing is to connect low-level feature maps and high-level semantic feature maps. The RoIRote operation mainly functions to convert a text block with an angle inclination into a horizontal text block after affine transformation.
Compared with other character detection and identification algorithms, the method has the characteristics of small model, high speed, high precision and support of multiple angles.
After the content or the subtitles (text information) of the video are acquired through the text recognition model, the text features in the text information are extracted through the text feature extraction model. The text feature extraction model may be a pre-trained language model, such as BERT (Bidirectional Encoder representation from converters).
And step 204, selecting a target video from the candidate video set based on the video label and the video characteristics for outputting.
In the present embodiment, the candidate video set is stored in the database. Each candidate video is provided with a video tag and video features. The video features of the candidate video are extracted in the same manner as instep 203. The video features of the candidate videos can be extracted and stored in the database in advance for later use.
And sequentially calculating the similarity between the video label of the video clip and the video label of each candidate video, and calculating the matching degree between the video characteristics of the video clip and the video characteristics of each candidate video. And selecting videos with the similarity of the video labels larger than a preset similarity threshold and the video feature matching degree larger than a preset matching degree threshold from the candidate video set, and outputting and displaying the videos to the user. Not only can similar videos be output, but also the starting point and the ending point of the video segment in the similar videos can be located.
The video tags of the video segments and the video tags of the candidate videos can be converted into vectors respectively, and then cosine similarity (or distance based on other algorithms) between the vectors can be calculated.
The degree of match between the video features of the video segment and the video features of the candidate video may be calculated by a matching model. And may output the probabilities of the predicted starting and ending points.
The matching model may use a two-way LSTM (Long Short-Term Memory, Long Short-Term Memory network). As shown in fig. 3c, the specific implementation steps are as follows:
1) and (5) vector conversion. There is unimportant information in the video vector. Therefore, the input vector can be converted using an attentiveness mechanism. After transformation, the effect is more representative of the video than the original video vector.
2) Bilinear matching. And inputting the short video vectors (video features of the video clips) and the video vectors (video features of the candidate videos) in the candidate video library into the LSTM model for matching to obtain the matching degree of the final video matching.
3) And a positioning layer. And predicting the probability that each time point in the candidate videos is a starting point and an end point according to the video matching result. In addition to this, the probability that a point in time is within or not within the relevant video segment can be predicted. The prediction can be done using the LSTM + Softmax function.
The method provided by the embodiment of the disclosure supports the user to search the video through the audio and video clips. The search result integrates audio, text and image information, so that the search accuracy is higher.
In some optional implementations of this embodiment, obtaining the video tag of the video clip includes: extracting at least one key frame from the video clip; for each key frame, outputting the description information of the key frame through a picture description model; extracting candidate video tags from the description information of each key frame; and determining a preset number of candidate video labels with the most repeated times in the candidate video labels as the video labels of the video clips.
For an input video clip, description information of the video clip can be generated, so that the subsequent text information matching is facilitated. The specific flow of generating the description label of the video clip is as follows:
(1) and video frame cutting, wherein the video frame cutting aims to obtain key frames of the video. A definition of a video key frame is a collection of pictures that reflect the characteristics of the video content. The specific implementation mode is that video frame cutting is carried out firstly, then similar video frame filtering is carried out, and finally effective key frames are obtained.
(2) The content of the key frame. This task is the one described with reference to the figures. This task can be done using the NCPIC (look-at-talk network model in synthetic paradigm) model. The NCPIC model divides the process of generating the picture description into a semantic analysis part and a syntactic analysis part, and adds the internal structure information of the sentence in the syntactic analysis, so that the sentence is more consistent with semantic rules, and the effect is better than that of a similar model in a task described by a picture. The specific process of generating the picture description by applying the NCPIC model is as follows:
firstly, extracting an object on a picture through a target detection algorithm, and forming a simple phrase. Such as "football", "grass", "haystick", "rose".
And generating a sentence for describing the object in the picture by using the connecting words in the corpus and the object information in the picture. For example, "puppies play soccer on grass".
And judging whether the generated sentence is a sentence according with a grammatical rule. If the sentence is a reasonable sentence, directly outputting the sentence; if not, repeating the step b and updating the connection words until reasonable sentences are output
(3) And generating a candidate video label. And extracting candidate video tags from the description of each video key frame by using an EmbedRank algorithm. The tags are extracted from the content of the video, so that the content of the video can be effectively described.
(4) Forming a final set of tags. Counting the number of the candidate tags of the video key frame, and taking the tag with the repetition number TopN in the candidate video tags as the final tag of the video.
Through the above processing, a description tag of an input video clip can be generated. The problem that the video clip does not have the label can be solved, and the problems of inaccuracy and incompleteness of the label can also be solved. Thereby improving the speed and accuracy of video searching.
In some optional implementations of this embodiment, extracting video features from the video segment includes: extracting image features from the video clips; and/or extracting audio features from the video clips; and/or identifying text information from the video clip, and extracting text features from the text information.
The algorithm instep 203 may be used to extract image features, audio features, and text features. The video search based on the image characteristics can be realized, and the problem of low video retrieval accuracy based on the video label is solved.
In some optional implementations of this embodiment, the method further includes: and performing feature fusion on at least two items in the extracted features to obtain video features.
The content and image characteristics of the previous video frame may be fused. The process of feature fusion can refer to fig. 3 b. The method comprises the following concrete steps:
1. a cross attention mechanism (cross attention) is first used to react one high dimensional input vector (e.g., image features) with another high dimensional vector (e.g., text features) to generate a 1024 dimensional hidden vector. In doing so, the input high dimensional data can be mapped to a low dimension by a low dimensional attention mechanism and then fed into the depth transform.
2. A transform is used to convert one hidden vector of 1024 dimensions into another hidden vector of the same size.
3. And setting a fixed step length t, and repeating the process based on the high-dimensionality input vector (audio characteristic) to obtain the final fused vector. The step t is used to determine the time period of the audio and can be set to the time interval of the key frame.
The search is carried out through the fused video characteristics, and the search results integrate audio, text and image information, so that the search accuracy is higher.
In some optional implementations of this embodiment, after extracting the video features from the video segment, the method further includes: inputting the video label into a text classification model to obtain a first class probability; inputting the video characteristics into a video classification model to obtain a second class probability; determining a category of the video clip based on the first category probability and the second category probability. Optionally, the category of the video clip is determined based on a weighted sum of the first category probability and the second category probability. Optionally, if the category is the violation category, ending the search and outputting warning information
The videos are classified into different categories, so that the accuracy of subsequent video retrieval can be improved, bad videos can be intercepted, the bad videos are prevented from being uploaded, and video quality inspection is achieved. Based on the artificial intelligence technology, various types of garbage information such as political affairs, pornography, customs, riot, military police, advertisements, night fields and the like in the video can be accurately and efficiently found.
The classification process comprises the following steps:
1. classifying texts by using the video labels obtained in thestep 203 and a text classification model (such as a TextCnn model) to obtain a first class probability P1
2. And classifying through a video classification model (such as a full connection layer) by using the fused video features. Finally, the second class probability P of the video is output2
3. And weighting the two probabilities to obtain the final probability of the video category, thereby finishing the classification of the video clips. And if the classified result is illegal videos related to politics, pornography, vulgar and the like, ending the search and outputting warning information.
In some optional implementations of this embodiment, selecting a target video from the candidate video set for output based on the video tag and the video feature includes: filtering candidate videos which are not matched with the category from the candidate video set to obtain a first target sub-candidate video set; calculating the text similarity of the video label of each candidate video in the first target sub-candidate video set and the video label of the video clip; filtering candidate videos with text similarity smaller than a preset similarity threshold value from the first target sub-candidate video set to obtain a second target sub-candidate video set; calculating the matching degree of the video characteristics of each candidate video in the candidate video set and the video characteristics of the video clips; and determining the candidate video with the matching degree larger than a preset matching degree threshold value as the target video, and outputting the target video.
The video search process may incorporate video tags, video features, and video clip categories. The method comprises the following concrete steps:
and narrowing the range of the candidate videos through the categories of the videos.
And performing text similarity matching by using the video label and the video label description of the candidate video. Further narrowing the search.
And performing multi-mode matching on the video by utilizing the video characteristics. And obtaining a final video, wherein the video segments correspond to the starting point and the ending point in the video.
With continuing reference to fig. 3a-3c, fig. 3a-3c are schematic diagrams of application scenarios of the method of searching for video according to the present embodiment. In the application scenario of fig. 3a, a user inputs a video clip to a search engine (server) via a terminal device. The search engine extracts a key frame after cutting the video clip into frames, and the key frame of the video clip is shown in the figure. And extracts subtitle content from the key frames. An audio clip with a time period between the key frame and its preceding key frame (step size t) is also truncated. Image features are extracted from the keyframes by the convolutional layer. Text features are extracted from the subtitle content by BERT. Audio features are extracted from the audio clip by Audionet. Feature fusion is then performed through the network structure shown in fig. 3 b. The feature one (image feature) and the feature two (text feature) can be converted into low-dimensional vectors through Attention calculation and then sent into a depth Transformer to obtain intermediate features. Then, the intermediate features and the third features (audio features) are converted into low-dimensional vectors through Attention calculation and then sent into a depth Transformer to obtain final fusion features, namely video features. The video characteristics of the candidate videos extracted by the method are stored in the database. Candidate videos can be filtered in advance through video tags and categories, and the search range is narrowed. And matching the video characteristics of the video clips with the filtered candidate videos one by one. As shown in fig. 3c, the video features are vector converted and then subjected to matching search using LSTM. Finding out matched video and determining the starting point and the ending point of the video segment in the candidate video
With further reference to fig. 4, aflow 400 of yet another embodiment of a method of searching for videos is shown. Theprocess 400 of the method for searching for a video includes the following steps:
step 401, obtaining a video clip to be searched.
Step 402, obtaining a video label of a video clip.
And step 403, extracting video features from the video clips.
And step 404, selecting a target video from the candidate video set to output based on the video label and the video characteristics.
Thesteps 401 and 404 are substantially the same as thesteps 201 and 204, and therefore, the description thereof is omitted.
Step 405, analyzing the user's preference according to the user's search record, and storing the video of the user's preference.
In this embodiment, the search record includes video tags and video features of the video segments searched by the user. The video tags of the video clips searched by the user and the video tags of the videos with high video feature matching degrees can be analyzed through the search records, and the user preference, such as pet videos, can be determined. And storing the videos which have similar video labels and high video feature matching degree with the videos searched by the user.
Step 406, selecting a first predetermined number of videos from the videos preferred by the user for recommendation.
In the embodiment, a part of videos preferred by the user is selected and recommended to the user.
Step 407, analyzing the similar users of the user according to the search records of the user, and saving the videos watched by the similar users.
In this embodiment, the search record includes video tags and video features of the video segments searched by the user. And finding users similar to the interest preference of the target user through searching the record (for example, a user who searches the same comedy short video is similar through video feature analysis, or a user who contains the same actor and actor in the searched video is similar through video tag analysis), and then storing all videos watched by the users.
And step 408, selecting a second preset number of videos from the videos watched by the similar users for recommendation.
In this embodiment, videos that are not selected instep 406 are selected from videos that are viewed by similar users for recommendation. That is, the final set of recommended videos is the union of the videos saved instep 405 and the videos obtained instep 407. This union is the set of videos that are to be pushed to the user last. The recommendation sets may be generated and sent to the user after both selection modes are selected.
As can be seen from fig. 4, compared with the embodiment corresponding to fig. 2, theflow 400 of the method for searching for a video in the present embodiment embodies the steps of video recommendation. Therefore, the scheme described in the embodiment not only includes a video search function, but also provides a video recommendation function, and can recommend the same type of video to the user.
In some optional implementations of this embodiment, the method further includes: receiving feedback information of a user; and recommending the video again according to the feedback information. The process of user feedback is actually the process of labeling. The functions of all the modules in the system can be optimized through the marking of the user. Through the feedback of the user, the system can more accurately judge the search content of the user, and the recommendation matching degree is higher. The feedback information may be an index such as a click rate (volume), a browsing duration, a collection volume (rate), and a praise volume (rate) of the user.
With further reference to fig. 5, as an implementation of the methods shown in the above-mentioned figures, the present disclosure provides an embodiment of an apparatus for searching for a video, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable to various electronic devices.
As shown in fig. 5, theapparatus 500 for searching for a video of the present embodiment includes: afirst acquisition unit 501, asecond acquisition unit 502, afeature extraction unit 503, and anoutput unit 504. The first obtainingunit 501 is configured to obtain a video segment to be searched; a second obtainingunit 502 configured to obtain a video tag of the video clip; afeature extraction unit 503 configured to extract video features from the video segments; and anoutput unit 504 configured to select a target video from the candidate video set for output based on the video tag and the video feature.
In this embodiment, the specific processing of the first acquiringunit 501, the second acquiringunit 502, thefeature extracting unit 503 and theoutput unit 504 of theapparatus 500 for searching for a video may refer to step 201,step 202,step 203 and step 204 in the corresponding embodiment of fig. 2.
In some optional implementations of the present embodiment, the second obtainingunit 502 is further configured to: extracting at least one key frame from the video clip; for each key frame, outputting the description information of the key frame through a picture description model; extracting candidate video tags from the description information of each key frame; and determining a preset number of candidate video labels with the most repeated times in the candidate video labels as the video labels of the video clips.
In some optional implementations of this embodiment, thefeature extraction unit 503 is further configured to: extracting image features from the video clips; and/or extracting audio features from the video clips; and/or identifying text information from the video clip and extracting text features from the text information.
In some optional implementations of this embodiment, theapparatus 500 further comprises a fusion unit (not shown in the drawings) configured to: and performing feature fusion on at least two extracted features to obtain the video features.
In some optional implementations of this embodiment, theapparatus 500 further comprises a classification unit (not shown in the drawings) configured to: after video features are extracted from the video clips, inputting video labels into a text classification model to obtain a first class probability; inputting the video characteristics into a video classification model to obtain a second class probability; determining a category of the video clip based on the first category probability and the second category probability.
In some optional implementations of this embodiment, theoutput unit 504 is further configured to: filtering candidate videos which are not matched with the category from the candidate video set to obtain a first target sub-candidate video set; calculating the text similarity of the video label of each candidate video in the first target sub-candidate video set and the video label of the video clip; filtering candidate videos with text similarity smaller than a preset similarity threshold value from the first target sub-candidate video set to obtain a second target sub-candidate video set; calculating the matching degree of the video features of each candidate video in the second target sub-candidate video set and the video features of the video clips; and determining the candidate video with the matching degree larger than a preset matching degree threshold value as the target video, and outputting the target video.
In some optional implementations of this embodiment, theapparatus 500 further comprises a first recommending unit (not shown in the drawings) configured to: analyzing the user preference according to a search record of a user, and storing a video preferred by the user, wherein the search record comprises video tags and video features of video clips searched by the user; and selecting a first preset number of videos from the videos favored by the user for recommendation.
In some optional implementations of this embodiment, theapparatus 500 further comprises a second recommending unit (not shown in the drawings) configured to: analyzing similar users of the users according to search records, and storing videos watched by the similar users, wherein the search records comprise video labels and video features of video clips searched by the users; and selecting a second preset number of videos from the videos watched by the similar users for recommendation.
In some optional implementations of the present embodiment, theapparatus 500 further comprises a feedback modification unit (not shown in the drawings) configured to: receiving feedback information of a user; and recommending the video again according to the feedback information.
In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
An electronic device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method offlows 200 or 400.
A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method offlow 200 or 400.
A computer program product comprising a computer program which, when executed by a processor, implements the method offlow 200 or 400.
FIG. 6 illustrates a schematic block diagram of an exampleelectronic device 600 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 6, theapparatus 600 includes acomputing unit 601, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)602 or a computer program loaded from astorage unit 608 into a Random Access Memory (RAM) 603. In theRAM 603, various programs and data required for the operation of thedevice 600 can also be stored. Thecalculation unit 601, theROM 602, and theRAM 603 are connected to each other via abus 604. An input/output (I/O)interface 605 is also connected tobus 604.
A number of components in thedevice 600 are connected to the I/O interface 605, including: aninput unit 606 such as a keyboard, a mouse, or the like; anoutput unit 607 such as various types of displays, speakers, and the like; astorage unit 608, such as a magnetic disk, optical disk, or the like; and acommunication unit 609 such as a network card, modem, wireless communication transceiver, etc. Thecommunication unit 609 allows thedevice 600 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
Thecomputing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of thecomputing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. Thecalculation unit 601 performs the respective methods and processes described above, such as a method of searching for a video. For example, in some embodiments, the method of searching for videos may be implemented as a computer software program tangibly embodied on a machine-readable medium, such asstorage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto thedevice 600 via theROM 602 and/or thecommunication unit 609. When the computer program is loaded into theRAM 603 and executed by thecomputing unit 601, one or more steps of the method of searching for video described above may be performed. Alternatively, in other embodiments, thecomputing unit 601 may be configured by any other suitable means (e.g. by means of firmware) to perform the method of searching for video.
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (21)

Translated fromChinese
1.一种搜索视频的方法,包括:1. A method of searching for a video, comprising:获取待搜索的视频片段;Get the video clip to be searched;获取所述视频片段的视频标签;obtaining the video tag of the video clip;从所述视频片段中提取出视频特征;extracting video features from the video segment;基于所述视频标签和所述视频特征从候选视频集合选择目标视频进行输出。The target video is selected from the candidate video set for output based on the video tag and the video feature.2.根据权利要求1所述的方法,其中,所述获取所述视频片段的视频标签,包括:2. The method according to claim 1, wherein the obtaining the video tag of the video clip comprises:从所述视频片段中提取出至少一个关键帧;extracting at least one key frame from the video clip;对于每个关键帧,通过图片描述模型输出该关键帧的描述信息;For each key frame, output the description information of the key frame through the picture description model;从每个关键帧的描述信息中提取出候选视频标签;Extract candidate video tags from the description information of each key frame;将候选视频标签中重复次数最多的预定数目个候选视频标签确定为所述视频片段的视频标签。A predetermined number of candidate video tags with the largest number of repetitions among the candidate video tags are determined as the video tags of the video segment.3.根据权利要求1所述的方法,其中,所述从所述视频片段中提取出视频特征,包括:3. The method according to claim 1, wherein the extracting video features from the video segment comprises:从所述视频片段中提取出图像特征;和/或extracting image features from the video clip; and/or从所述视频片段中提取出音频特征;和/或extract audio features from the video clip; and/or从所述视频片段中识别出文本信息,并从所述文本信息中提取出文本特征。Text information is identified from the video clip, and text features are extracted from the text information.4.根据权利要求3所述的方法,其中,所述方法还包括:4. The method of claim 3, wherein the method further comprises:将提取出的特征中的至少两项进行特征融合,得到所述视频特征。Perform feature fusion on at least two of the extracted features to obtain the video features.5.根据权利要求1所述的方法,其中,在所述从所述视频片段中提取出视频特征之后,所述方法还包括:5. The method of claim 1, wherein, after the extracting video features from the video segment, the method further comprises:将所述视频标签输入文本分类模型,得到第一类别概率;Inputting the video tag into a text classification model to obtain the first category probability;将所述视频特征输入视频分类模型,得到第二类别概率;Inputting the video features into a video classification model to obtain a second category probability;基于所述第一类别概率和所述第二类别概率确定所述视频片段的类别。The category of the video segment is determined based on the first category probability and the second category probability.6.根据权利要求5所述的方法,其中,所述基于所述视频标签和所述视频特征从候选视频集合选择目标视频进行输出,包括:6. The method according to claim 5, wherein the selecting a target video from a candidate video set based on the video tag and the video feature for output, comprising:从候选视频集合中过滤掉与所述类别不匹配的候选视频,得到第一目标子候选视频集合;Filter out candidate videos that do not match the category from the candidate video set to obtain a first target sub-candidate video set;计算所述第一目标子候选视频集合中每个候选视频的视频标签与所述视频片段的视频标签的文本相似度;Calculate the text similarity between the video tag of each candidate video in the first target sub-candidate video set and the video tag of the video segment;从所述第一目标子候选视频集合中过滤掉文本相似度小于预定相似度阈值的候选视频,得到第二目标子候选视频集合;Filter out candidate videos whose text similarity is less than a predetermined similarity threshold from the first target sub-candidate video set to obtain a second target sub-candidate video set;计算所述第二目标子候选视频集合中每个候选视频的视频特征与所述视频片段的视频特征的匹配度;Calculate the degree of matching between the video feature of each candidate video in the second target sub-candidate video set and the video feature of the video segment;确定匹配度大于预定匹配度阈值的候选视频为所述目标视频,并进行输出。A candidate video whose matching degree is greater than a predetermined matching degree threshold is determined as the target video, and is output.7.根据权利要求1所述的方法,其中,所述方法还包括:7. The method of claim 1, wherein the method further comprises:根据用户的搜索记录分析所述用户的喜好,保存所述用户喜好的视频,其中,所述搜索记录包括所述用户搜索过的视频片段的视频标签和视频特征;Analyze the user's preference according to the user's search record, and save the video that the user prefers, wherein the search record includes video tags and video features of the video clips searched by the user;从所述用户喜好的视频中选择第一预定数目个视频进行推荐。A first predetermined number of videos are selected from the videos preferred by the user for recommendation.8.根据权利要求1所述的方法,其中,所述方法还包括:8. The method of claim 1, wherein the method further comprises:根据用户的搜索记录分析出所述用户的相似用户,保存所述相似用户看过的视频,其中,所述搜索记录包括所述用户搜索过的视频片段的视频标签和视频特征;Analyze similar users of the user according to the user's search records, and save the videos watched by the similar users, wherein the search records include video tags and video features of the video clips searched by the user;从所述相似用户看过的视频中选择第二预定数目个视频进行推荐。A second predetermined number of videos are selected from the videos watched by the similar users for recommendation.9.根据权利要求7或8所述的方法,其中,所述方法还包括:9. The method of claim 7 or 8, wherein the method further comprises:接收用户的反馈信息;Receive feedback from users;根据所述反馈信息重新推荐视频。The video is re-recommended according to the feedback information.10.一种搜索视频的装置,包括:10. A device for searching video, comprising:第一获取单元,被配置成获取待搜索的视频片段;a first acquiring unit, configured to acquire the video clip to be searched;第二获取单元,被配置成获取所述视频片段的视频标签;a second acquiring unit, configured to acquire the video tag of the video clip;特征提取单元,被配置成从所述视频片段中提取出视频特征;a feature extraction unit configured to extract video features from the video segment;输出单元,被配置成基于所述视频标签和所述视频特征从候选视频集合选择目标视频进行输出。An output unit configured to select a target video from a candidate video set for outputting based on the video tag and the video feature.11.根据权利要求10所述的装置,其中,所述第二获取单元进一步被配置成:11. The apparatus of claim 10, wherein the second obtaining unit is further configured to:从所述视频片段中提取出至少一个关键帧;extracting at least one key frame from the video clip;对于每个关键帧,通过图片描述模型输出该关键帧的描述信息;For each key frame, output the description information of the key frame through the picture description model;从每个关键帧的描述信息中提取出候选视频标签;Extract candidate video tags from the description information of each key frame;将候选视频标签中重复次数最多的预定数目个候选视频标签确定为所述视频片段的视频标签。A predetermined number of candidate video tags with the largest number of repetitions among the candidate video tags are determined as the video tags of the video segment.12.根据权利要求10所述的装置,其中,所述特征提取单元进一步被配置成:12. The apparatus of claim 10, wherein the feature extraction unit is further configured to:从所述视频片段中提取出图像特征;和/或extracting image features from the video clip; and/or从所述视频片段中提取出音频特征;和/或extract audio features from the video clip; and/or从所述视频片段中识别出文本信息,并从所述文本信息中提取出文本特征。Text information is identified from the video clip, and text features are extracted from the text information.13.根据权利要求12所述的装置,其中,所述装置还包括融合单元,被配置成:13. The apparatus of claim 12, wherein the apparatus further comprises a fusion unit configured to:将提取出的特征中的至少两项进行特征融合,得到所述视频特征。Perform feature fusion on at least two of the extracted features to obtain the video features.14.根据权利要求10所述的装置,其中,所述装置还包括分类单元,被配置成:14. The apparatus of claim 10, wherein the apparatus further comprises a classification unit configured to:在所述从所述视频片段中提取出视频特征之后,将所述视频标签输入文本分类模型,得到第一类别概率;After the video feature is extracted from the video segment, the video tag is input into a text classification model to obtain a first class probability;将所述视频特征输入视频分类模型,得到第二类别概率;Inputting the video features into a video classification model to obtain a second category probability;基于所述第一类别概率和所述第二类别概率确定所述视频片段的类别。The category of the video segment is determined based on the first category probability and the second category probability.15.根据权利要求14所述的装置,其中,所述输出单元进一步被配置成:15. The apparatus of claim 14, wherein the output unit is further configured to:从候选视频集合中过滤掉与所述类别不匹配的候选视频,得到第一目标子候选视频集合;Filter out candidate videos that do not match the category from the candidate video set to obtain a first target sub-candidate video set;计算所述第一目标子候选视频集合中每个候选视频的视频标签与所述视频片段的视频标签的文本相似度;Calculate the text similarity between the video tag of each candidate video in the first target sub-candidate video set and the video tag of the video segment;从所述第一目标子候选视频集合中过滤掉文本相似度小于预定相似度阈值的候选视频,得到第二目标子候选视频集合;Filter out candidate videos whose text similarity is less than a predetermined similarity threshold from the first target sub-candidate video set to obtain a second target sub-candidate video set;计算所述第二目标子候选视频集合中每个候选视频的视频特征与所述视频片段的视频特征的匹配度;Calculate the degree of matching between the video feature of each candidate video in the second target sub-candidate video set and the video feature of the video segment;确定匹配度大于预定匹配度阈值的候选视频为所述目标视频,并进行输出。A candidate video whose matching degree is greater than a predetermined matching degree threshold is determined as the target video, and is output.16.根据权利要求10所述的装置,其中,所述装置还包括第一推荐单元,被配置成:16. The apparatus of claim 10, wherein the apparatus further comprises a first recommendation unit configured to:根据用户的搜索记录分析所述用户的喜好,保存所述用户喜好的视频,其中,所述搜索记录包括所述用户搜索过的视频片段的视频标签和视频特征;Analyze the user's preference according to the user's search record, and save the video that the user prefers, wherein the search record includes video tags and video features of the video clips searched by the user;从所述用户喜好的视频中选择第一预定数目个视频进行推荐。A first predetermined number of videos are selected from the videos preferred by the user for recommendation.17.根据权利要求10所述的装置,其中,所述装置还包括第二推荐单元,被配置成:17. The apparatus of claim 10, wherein the apparatus further comprises a second recommendation unit configured to:根据用户的搜索记录分析出所述用户的相似用户,保存所述相似用户看过的视频;Analyze similar users of the user according to the user's search records, and save the videos watched by the similar users;从所述相似用户看过的视频中选择第二预定数目个视频进行推荐。A second predetermined number of videos are selected from the videos watched by the similar users for recommendation.18.根据权利要求16或17所述的装置,其中,所述装置还包括反馈修改单元,被配置成:18. The apparatus of claim 16 or 17, wherein the apparatus further comprises a feedback modification unit configured to:接收用户的反馈信息;Receive feedback from users;根据所述反馈信息重新推荐视频。The video is re-recommended according to the feedback information.19.一种电子设备,包括:19. An electronic device comprising:至少一个处理器;以及at least one processor; and与所述至少一个处理器通信连接的存储器;其中,a memory communicatively coupled to the at least one processor; wherein,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求1-9中任一项所述的方法。The memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the execution of any of claims 1-9 Methods.20.一种存储有计算机指令的非瞬时计算机可读存储介质,其中,所述计算机指令用于使所述计算机执行根据权利要求1-9中任一项所述的方法。20. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any of claims 1-9.21.一种计算机程序产品,包括计算机程序,所述计算机程序在被处理器执行时实现根据权利要求1-9中任一项所述的方法。21. A computer program product comprising a computer program which, when executed by a processor, implements the method of any of claims 1-9.
CN202111104970.2A2021-09-222021-09-22 Method and device for searching videosActiveCN113806588B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202111104970.2ACN113806588B (en)2021-09-222021-09-22 Method and device for searching videos

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202111104970.2ACN113806588B (en)2021-09-222021-09-22 Method and device for searching videos

Publications (2)

Publication NumberPublication Date
CN113806588Atrue CN113806588A (en)2021-12-17
CN113806588B CN113806588B (en)2024-04-12

Family

ID=78896129

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202111104970.2AActiveCN113806588B (en)2021-09-222021-09-22 Method and device for searching videos

Country Status (1)

CountryLink
CN (1)CN113806588B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN114625918A (en)*2022-03-182022-06-14腾讯科技(深圳)有限公司Video recommendation method, device, equipment, storage medium and program product
CN115460459A (en)*2022-09-022022-12-09百度时代网络技术(北京)有限公司Video generation method and device based on AI (Artificial Intelligence) and electronic equipment
CN116628257A (en)*2023-07-252023-08-22北京欣博电子科技有限公司Video retrieval method, device, computer equipment and storage medium
CN116775937A (en)*2023-05-192023-09-19江西财经大学Video recommendation method and device based on micro-doctor big data and storage medium
CN116775938A (en)*2023-08-152023-09-19腾讯科技(深圳)有限公司Method, device, electronic equipment and storage medium for retrieving comment video

Citations (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
WO2017114388A1 (en)*2015-12-302017-07-06腾讯科技(深圳)有限公司Video search method and device
CN109117777A (en)*2018-08-032019-01-01百度在线网络技术(北京)有限公司The method and apparatus for generating information
CN110121093A (en)*2018-02-062019-08-13优酷网络技术(北京)有限公司The searching method and device of target object in video
CN112115299A (en)*2020-09-172020-12-22北京百度网讯科技有限公司Video searching method and device, recommendation method, electronic device and storage medium
US20210211784A1 (en)*2020-04-102021-07-08Beijing Baidu Netcom Science And Technology Co., Ltd.Method and apparatus for retrieving teleplay content

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
WO2017114388A1 (en)*2015-12-302017-07-06腾讯科技(深圳)有限公司Video search method and device
CN110121093A (en)*2018-02-062019-08-13优酷网络技术(北京)有限公司The searching method and device of target object in video
CN109117777A (en)*2018-08-032019-01-01百度在线网络技术(北京)有限公司The method and apparatus for generating information
US20210211784A1 (en)*2020-04-102021-07-08Beijing Baidu Netcom Science And Technology Co., Ltd.Method and apparatus for retrieving teleplay content
CN112115299A (en)*2020-09-172020-12-22北京百度网讯科技有限公司Video searching method and device, recommendation method, electronic device and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张;沈兵虎;王李冬;: "视频节目层次化搜索和推荐方法的研究", 计算机技术与发展, no. 07*

Cited By (10)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN114625918A (en)*2022-03-182022-06-14腾讯科技(深圳)有限公司Video recommendation method, device, equipment, storage medium and program product
CN114625918B (en)*2022-03-182025-01-17腾讯科技(深圳)有限公司 Video recommendation method, device, equipment, storage medium and program product
CN115460459A (en)*2022-09-022022-12-09百度时代网络技术(北京)有限公司Video generation method and device based on AI (Artificial Intelligence) and electronic equipment
CN115460459B (en)*2022-09-022024-02-27百度时代网络技术(北京)有限公司Video generation method and device based on AI and electronic equipment
CN116775937A (en)*2023-05-192023-09-19江西财经大学Video recommendation method and device based on micro-doctor big data and storage medium
CN116775937B (en)*2023-05-192024-04-26厦门市美亚柏科信息股份有限公司Video recommendation method and device based on micro-doctor big data and storage medium
CN116628257A (en)*2023-07-252023-08-22北京欣博电子科技有限公司Video retrieval method, device, computer equipment and storage medium
CN116628257B (en)*2023-07-252023-12-01北京欣博电子科技有限公司Video retrieval method, device, computer equipment and storage medium
CN116775938A (en)*2023-08-152023-09-19腾讯科技(深圳)有限公司Method, device, electronic equipment and storage medium for retrieving comment video
CN116775938B (en)*2023-08-152024-05-17腾讯科技(深圳)有限公司Method, device, electronic equipment and storage medium for retrieving comment video

Also Published As

Publication numberPublication date
CN113806588B (en)2024-04-12

Similar Documents

PublicationPublication DateTitle
CN112131350B (en)Text label determining method, device, terminal and readable storage medium
CN110232152B (en)Content recommendation method, device, server and storage medium
CN107133345B (en) Artificial intelligence-based interaction method and device
WO2023065211A1 (en)Information acquisition method and apparatus
CN113806588B (en) Method and device for searching videos
JP6361351B2 (en) Method, program and computing system for ranking spoken words
CN107832338B (en)Method and system for recognizing core product words
US9639633B2 (en)Providing information services related to multimodal inputs
CN111279334A (en)Search query enhancement with contextual analysis
KR20210091076A (en)Method and apparatus for processing video, electronic device, medium and computer program
CN112989212B (en)Media content recommendation method, device and equipment and computer storage medium
US10915756B2 (en)Method and apparatus for determining (raw) video materials for news
CN113051911B (en) Methods, devices, equipment, media and program products for extracting sensitive words
CN116977701A (en)Video classification model training method, video classification method and device
CN110990598B (en)Resource retrieval method and device, electronic equipment and computer-readable storage medium
CN114661951B (en) Video processing method, device, computer equipment and storage medium
WO2023168997A1 (en)Cross-modal retrieval method and related device
CN120014648A (en) Video resource representation method, coding model training method and device
KR20210120203A (en)Method for generating metadata based on web page
CN115017325B (en)Text-based entity linking, recognition method, electronic device, and storage medium
CN117290544A (en)Cross-mode short video recommendation method, system, terminal and storage medium
CN115114460B (en)Method and device for pushing multimedia content
CN117390219A (en)Video searching method, device, computer equipment and storage medium
CN117009170A (en)Training sample generation method, device, equipment and storage medium
CN113849688A (en) Resource processing method, resource processing device, electronic device, and storage medium

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp