CN113806588A

Movatterモバイル変換

Info

Publication number: CN113806588A
Application number: CN202111104970.2A
Authority: CN
Inventors: 冯博豪; 刘雨鑫
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-09-22
Filing date: 2021-09-22
Publication date: 2021-12-17
Anticipated expiration: 2041-09-22
Also published as: CN113806588B

Abstract

Translated fromChinese

本公开提供了搜索视频的方法和装置，涉及人工智能领域，尤其涉及智能搜索、视频技术领域。具体实现方案为：获取待搜索的视频片段；获取视频片段的视频标签；从视频片段中提取出视频特征；基于视频标签和视频特征从候选视频集合选择目标视频进行输出。该实施方式提高了视频搜索的速度和准确性。

The present disclosure provides a method and an apparatus for searching video, which relate to the field of artificial intelligence, and in particular, to the technical fields of intelligent search and video. The specific implementation scheme is: acquiring the video segment to be searched; acquiring the video tag of the video segment; extracting the video feature from the video segment; selecting the target video from the candidate video set based on the video tag and the video feature for output. This implementation improves the speed and accuracy of video searches.

Description

Method and device for searching video

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to the field of intelligent search and video technologies, and in particular, to a method and an apparatus for searching a video.

Background

With the advent of the internet age, the amount of information on networks has increased explosively. Today, with the rapid development of information technology, a great deal of data such as characters, images, audio, video and the like are published and transmitted on an information network every day. The visual data generally comes from various social websites and mobile phone applications, the user quantity of the social services is more than hundred million, people share and transmit images and videos in a social mode, and the shared and transmitted visual data often has different subjects, different types, different labels and different meanings. The huge and complicated data brings rich content and also brings great challenges to information retrieval.

Disclosure of Invention

The present disclosure provides a method, apparatus, device, storage medium, and computer program product for searching for a video.

According to a first aspect of the present disclosure, there is provided a method of searching for a video, including: acquiring a video clip to be searched; acquiring a video label of the video clip; extracting video features from the video clips; and selecting a target video from the candidate video set for output based on the video label and the video characteristics.

According to a second aspect of the present disclosure, there is provided an apparatus for searching for a video, including: a first acquisition unit configured to acquire a video clip to be searched; a second acquisition unit configured to acquire a video tag of the video clip; a feature extraction unit configured to extract video features from the video segments; an output unit configured to select a target video from a candidate video set for output based on the video tag and the video feature.

According to a third aspect of the present disclosure, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect.

According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of the first aspect.

According to a fifth aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of the first aspect.

According to the method and the device for searching the video, the video tag and the video feature are extracted for matching search, the video search range is narrowed, and the search speed and accuracy are improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present disclosure may be applied;

FIG. 2 is a flow diagram for one embodiment of a method of searching videos, according to the present disclosure;

3a-3c are schematic diagrams of application scenarios of the method of searching for video according to the present disclosure;

FIG. 4 is a flow diagram of yet another embodiment of a method of searching for videos, according to the present disclosure;

FIG. 5 is a schematic block diagram illustrating one embodiment of an apparatus for searching for video according to the present disclosure;

FIG. 6 is a schematic block diagram of a computer system suitable for use with an electronic device implementing embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 illustrates anexemplary system architecture 100 to which embodiments of the method of searching for video or the apparatus for searching for video of the present disclosure may be applied.

As shown in fig. 1, thesystem architecture 100 may include

terminal devices

101, 102, 103, anetwork 104, and aserver 105. Thenetwork 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and theserver 105.Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with theserver 105 via thenetwork 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have various communication client applications installed thereon, such as a search application, a video playing application, a web browser application, a shopping application, an instant messaging tool, a mailbox client, social platform software, and the like.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices having a display screen and supporting video playing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, mpeg compression standard Audio Layer 3), MP4 players (Moving Picture Experts Group Audio Layer IV, mpeg compression standard Audio Layer 4), laptop portable computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

Theserver 105 may be a server providing various services, such as a background search server providing support for videos displayed on the

terminal devices

101, 102, 103. The background search server can analyze and process the received data such as the video search request and feed back the search result to the terminal equipment.

Theserver 105 is provided with a video search system, which comprises an application display layer, a core processing layer and a data storage layer.

The application display layer mainly provides a visual interface for a user to complete the interaction between the user and the system. The core function of the system is retrieval, and the retrieval of the video is supported by audio and video clips. And uploading the content to be queried to the core processing layer by the user. And after the processing of the core processing layer is finished, the returned search results are displayed to the user in a list form.

The middle layer is a core processing layer and comprises functions of multi-modal data feature extraction, feature transformation, similarity search and the like. The core processing layer firstly receives original information transmitted from the application display layer, extracts feature representation through a feature extraction algorithm, then calculates similarity among multi-modal data, retrieves data similar to the content to be retrieved in the database, and generates a sorting table according to the similarity.

The data storage layer stores the search data, the model file and the search record into a database.

The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein. The server may also be a server of a distributed system, or a server incorporating a blockchain. The server can also be a cloud server, or an intelligent cloud computing server or an intelligent cloud host with artificial intelligence technology.

It should be noted that the method for searching for a video provided by the embodiment of the present disclosure is generally performed by theserver 105, and accordingly, the apparatus for searching for a video is generally disposed in theserver 105.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, aflow 200 of one embodiment of a method of searching for video in accordance with the present disclosure is shown. The method for searching the video comprises the following steps:

step 201, obtaining a video clip to be searched.

In this embodiment, an execution subject (for example, a server shown in fig. 1) of the method for searching for a video may receive a search request including a video clip to be searched from a terminal with which a user plays a video through a wired connection manner or a wireless connection manner. Besides the image frames, the video segments may also include subtitles and audio.

Step 202, obtaining a video label of the video clip.

In this embodiment, the video clip itself may carry a video tag, e.g., the name of the video clip. The video tag may also be input by the user. Or the video tags may be automatically generated by the search system.

And step 203, extracting video features from the video clips.

In the present embodiment, if only image frames are included in a video clip, image features may be extracted from the video clip as video features. If the video segment also includes subtitles (or other text content) or audio, text features or audio features can be extracted from the video segment. The image feature and at least one of the text feature and the audio feature can be fused to obtain the video feature, namely the image feature and the text feature can be fused to obtain the video feature, the image feature and the audio feature can be fused to obtain the video feature, and the image feature, the text feature and the audio feature can be fused to obtain the video feature.

Image features may be extracted for each video frame. Or extracting video frames according to a certain time interval, filtering out similar video frames to obtain key frames, and extracting only the image characteristics and text characteristics of the key frames and the audio characteristics of audio clips among the key frames.

Image features of the video frame can be extracted through a deep neural network. For example, the image feature extraction may employ a VGGNet network model. The network model has 5 convolutional layers and 3 fully-connected layers. The first 7 layers all use ReLU as the activation function, and the 8 th layer uses the identity function as the activation function. The output of the network is the image feature vector. The specific calculation method is as follows: first, for a volumeThe calculation formula of the output of the lamination is that y is equal to F ((W)_c*X)+b_c) Where X represents the input of the layer, W_cIs a convolution kernel, b_cIs the bias, F is the activation function, and is the convolution operation. For a fully connected layer, the output calculation formula is y ═ G ((W)_fX)+b_f) Where X represents the input of the layer, W_fIs a weight vector, b_fIs the bias and G is the activation function. The input image can obtain image characteristics through the convolution layers and the full-connection layer.

And (3) audio feature extraction process: the audio frequency can be transferred from a time domain to a frequency domain through a Mel Frequency Cepstrum Coefficient (MFCC), and then denoising, smoothing and further expressing are carried out through an audio frequency network, so that effective audio frequency characteristics are extracted, and the purpose of characteristic dimension reduction is achieved. Let S denote speech. The extraction process of MFCC features can be expressed as v_mfccMFCC (S), wherein v_mfccRepresenting the MFCC characteristics of speech. Then, the MFCC characteristics v are set_mfccAs an input to the audio network. Audio characteristics can be extracted by using an AudioNet, which comprises 3 convolutional layers, 1 pooling layer and 1 full connection layer. The mel-frequency cepstral coefficient (MFCC) can be further represented by AudioNet.

And recognizing text information from the video clip through a text recognition model, and extracting text features from the text information through a pre-trained text feature extraction model. The Text recognition model can use FOTS (Fast organized Text Spotting) algorithm. FOTS is a rapid end-to-end integration detection and identification framework, and FOTS has a faster speed compared with other two-stage methods. The overall structure of FOTS is composed of four parts, namely a convolution sharing branch, a text detection branch, a RoIRote (region of interest rotation) operation branch and a text recognition branch. The backbone of the convolution sharing network is ResNet-50, and the role of convolution sharing is to connect low-level feature maps and high-level semantic feature maps. The RoIRote operation mainly functions to convert a text block with an angle inclination into a horizontal text block after affine transformation.

Compared with other character detection and identification algorithms, the method has the characteristics of small model, high speed, high precision and support of multiple angles.

After the content or the subtitles (text information) of the video are acquired through the text recognition model, the text features in the text information are extracted through the text feature extraction model. The text feature extraction model may be a pre-trained language model, such as BERT (Bidirectional Encoder representation from converters).

And step 204, selecting a target video from the candidate video set based on the video label and the video characteristics for outputting.

In the present embodiment, the candidate video set is stored in the database. Each candidate video is provided with a video tag and video features. The video features of the candidate video are extracted in the same manner as instep 203. The video features of the candidate videos can be extracted and stored in the database in advance for later use.

And sequentially calculating the similarity between the video label of the video clip and the video label of each candidate video, and calculating the matching degree between the video characteristics of the video clip and the video characteristics of each candidate video. And selecting videos with the similarity of the video labels larger than a preset similarity threshold and the video feature matching degree larger than a preset matching degree threshold from the candidate video set, and outputting and displaying the videos to the user. Not only can similar videos be output, but also the starting point and the ending point of the video segment in the similar videos can be located.

The video tags of the video segments and the video tags of the candidate videos can be converted into vectors respectively, and then cosine similarity (or distance based on other algorithms) between the vectors can be calculated.

The degree of match between the video features of the video segment and the video features of the candidate video may be calculated by a matching model. And may output the probabilities of the predicted starting and ending points.

The matching model may use a two-way LSTM (Long Short-Term Memory, Long Short-Term Memory network). As shown in fig. 3c, the specific implementation steps are as follows:

1) and (5) vector conversion. There is unimportant information in the video vector. Therefore, the input vector can be converted using an attentiveness mechanism. After transformation, the effect is more representative of the video than the original video vector.

2) Bilinear matching. And inputting the short video vectors (video features of the video clips) and the video vectors (video features of the candidate videos) in the candidate video library into the LSTM model for matching to obtain the matching degree of the final video matching.

3) And a positioning layer. And predicting the probability that each time point in the candidate videos is a starting point and an end point according to the video matching result. In addition to this, the probability that a point in time is within or not within the relevant video segment can be predicted. The prediction can be done using the LSTM + Softmax function.

The method provided by the embodiment of the disclosure supports the user to search the video through the audio and video clips. The search result integrates audio, text and image information, so that the search accuracy is higher.

In some optional implementations of this embodiment, obtaining the video tag of the video clip includes: extracting at least one key frame from the video clip; for each key frame, outputting the description information of the key frame through a picture description model; extracting candidate video tags from the description information of each key frame; and determining a preset number of candidate video labels with the most repeated times in the candidate video labels as the video labels of the video clips.

For an input video clip, description information of the video clip can be generated, so that the subsequent text information matching is facilitated. The specific flow of generating the description label of the video clip is as follows:

(1) and video frame cutting, wherein the video frame cutting aims to obtain key frames of the video. A definition of a video key frame is a collection of pictures that reflect the characteristics of the video content. The specific implementation mode is that video frame cutting is carried out firstly, then similar video frame filtering is carried out, and finally effective key frames are obtained.

(2) The content of the key frame. This task is the one described with reference to the figures. This task can be done using the NCPIC (look-at-talk network model in synthetic paradigm) model. The NCPIC model divides the process of generating the picture description into a semantic analysis part and a syntactic analysis part, and adds the internal structure information of the sentence in the syntactic analysis, so that the sentence is more consistent with semantic rules, and the effect is better than that of a similar model in a task described by a picture. The specific process of generating the picture description by applying the NCPIC model is as follows:

firstly, extracting an object on a picture through a target detection algorithm, and forming a simple phrase. Such as "football", "grass", "haystick", "rose".

And generating a sentence for describing the object in the picture by using the connecting words in the corpus and the object information in the picture. For example, "puppies play soccer on grass".

And judging whether the generated sentence is a sentence according with a grammatical rule. If the sentence is a reasonable sentence, directly outputting the sentence; if not, repeating the step b and updating the connection words until reasonable sentences are output

(3) And generating a candidate video label. And extracting candidate video tags from the description of each video key frame by using an EmbedRank algorithm. The tags are extracted from the content of the video, so that the content of the video can be effectively described.

(4) Forming a final set of tags. Counting the number of the candidate tags of the video key frame, and taking the tag with the repetition number TopN in the candidate video tags as the final tag of the video.

Through the above processing, a description tag of an input video clip can be generated. The problem that the video clip does not have the label can be solved, and the problems of inaccuracy and incompleteness of the label can also be solved. Thereby improving the speed and accuracy of video searching.

In some optional implementations of this embodiment, extracting video features from the video segment includes: extracting image features from the video clips; and/or extracting audio features from the video clips; and/or identifying text information from the video clip, and extracting text features from the text information.

The algorithm instep 203 may be used to extract image features, audio features, and text features. The video search based on the image characteristics can be realized, and the problem of low video retrieval accuracy based on the video label is solved.

In some optional implementations of this embodiment, the method further includes: and performing feature fusion on at least two items in the extracted features to obtain video features.

The content and image characteristics of the previous video frame may be fused. The process of feature fusion can refer to fig. 3 b. The method comprises the following concrete steps:

1. a cross attention mechanism (cross attention) is first used to react one high dimensional input vector (e.g., image features) with another high dimensional vector (e.g., text features) to generate a 1024 dimensional hidden vector. In doing so, the input high dimensional data can be mapped to a low dimension by a low dimensional attention mechanism and then fed into the depth transform.

2. A transform is used to convert one hidden vector of 1024 dimensions into another hidden vector of the same size.

3. And setting a fixed step length t, and repeating the process based on the high-dimensionality input vector (audio characteristic) to obtain the final fused vector. The step t is used to determine the time period of the audio and can be set to the time interval of the key frame.

The search is carried out through the fused video characteristics, and the search results integrate audio, text and image information, so that the search accuracy is higher.

In some optional implementations of this embodiment, after extracting the video features from the video segment, the method further includes: inputting the video label into a text classification model to obtain a first class probability; inputting the video characteristics into a video classification model to obtain a second class probability; determining a category of the video clip based on the first category probability and the second category probability. Optionally, the category of the video clip is determined based on a weighted sum of the first category probability and the second category probability. Optionally, if the category is the violation category, ending the search and outputting warning information

The videos are classified into different categories, so that the accuracy of subsequent video retrieval can be improved, bad videos can be intercepted, the bad videos are prevented from being uploaded, and video quality inspection is achieved. Based on the artificial intelligence technology, various types of garbage information such as political affairs, pornography, customs, riot, military police, advertisements, night fields and the like in the video can be accurately and efficiently found.

The classification process comprises the following steps:

1. classifying texts by using the video labels obtained in thestep 203 and a text classification model (such as a TextCnn model) to obtain a first class probability P₁；

2. And classifying through a video classification model (such as a full connection layer) by using the fused video features. Finally, the second class probability P of the video is output₂。

3. And weighting the two probabilities to obtain the final probability of the video category, thereby finishing the classification of the video clips. And if the classified result is illegal videos related to politics, pornography, vulgar and the like, ending the search and outputting warning information.

In some optional implementations of this embodiment, selecting a target video from the candidate video set for output based on the video tag and the video feature includes: filtering candidate videos which are not matched with the category from the candidate video set to obtain a first target sub-candidate video set; calculating the text similarity of the video label of each candidate video in the first target sub-candidate video set and the video label of the video clip; filtering candidate videos with text similarity smaller than a preset similarity threshold value from the first target sub-candidate video set to obtain a second target sub-candidate video set; calculating the matching degree of the video characteristics of each candidate video in the candidate video set and the video characteristics of the video clips; and determining the candidate video with the matching degree larger than a preset matching degree threshold value as the target video, and outputting the target video.

The video search process may incorporate video tags, video features, and video clip categories. The method comprises the following concrete steps:

and narrowing the range of the candidate videos through the categories of the videos.

And performing text similarity matching by using the video label and the video label description of the candidate video. Further narrowing the search.

And performing multi-mode matching on the video by utilizing the video characteristics. And obtaining a final video, wherein the video segments correspond to the starting point and the ending point in the video.

With continuing reference to fig. 3a-3c, fig. 3a-3c are schematic diagrams of application scenarios of the method of searching for video according to the present embodiment. In the application scenario of fig. 3a, a user inputs a video clip to a search engine (server) via a terminal device. The search engine extracts a key frame after cutting the video clip into frames, and the key frame of the video clip is shown in the figure. And extracts subtitle content from the key frames. An audio clip with a time period between the key frame and its preceding key frame (step size t) is also truncated. Image features are extracted from the keyframes by the convolutional layer. Text features are extracted from the subtitle content by BERT. Audio features are extracted from the audio clip by Audionet. Feature fusion is then performed through the network structure shown in fig. 3 b. The feature one (image feature) and the feature two (text feature) can be converted into low-dimensional vectors through Attention calculation and then sent into a depth Transformer to obtain intermediate features. Then, the intermediate features and the third features (audio features) are converted into low-dimensional vectors through Attention calculation and then sent into a depth Transformer to obtain final fusion features, namely video features. The video characteristics of the candidate videos extracted by the method are stored in the database. Candidate videos can be filtered in advance through video tags and categories, and the search range is narrowed. And matching the video characteristics of the video clips with the filtered candidate videos one by one. As shown in fig. 3c, the video features are vector converted and then subjected to matching search using LSTM. Finding out matched video and determining the starting point and the ending point of the video segment in the candidate video

With further reference to fig. 4, aflow 400 of yet another embodiment of a method of searching for videos is shown. Theprocess 400 of the method for searching for a video includes the following steps:

step 401, obtaining a video clip to be searched.

Step 402, obtaining a video label of a video clip.

And step 403, extracting video features from the video clips.

And step 404, selecting a target video from the candidate video set to output based on the video label and the video characteristics.

The

steps

401 and 404 are substantially the same as the

steps

201 and 204, and therefore, the description thereof is omitted.

Step 405, analyzing the user's preference according to the user's search record, and storing the video of the user's preference.

In this embodiment, the search record includes video tags and video features of the video segments searched by the user. The video tags of the video clips searched by the user and the video tags of the videos with high video feature matching degrees can be analyzed through the search records, and the user preference, such as pet videos, can be determined. And storing the videos which have similar video labels and high video feature matching degree with the videos searched by the user.

Step 406, selecting a first predetermined number of videos from the videos preferred by the user for recommendation.

In the embodiment, a part of videos preferred by the user is selected and recommended to the user.

Step 407, analyzing the similar users of the user according to the search records of the user, and saving the videos watched by the similar users.

In this embodiment, the search record includes video tags and video features of the video segments searched by the user. And finding users similar to the interest preference of the target user through searching the record (for example, a user who searches the same comedy short video is similar through video feature analysis, or a user who contains the same actor and actor in the searched video is similar through video tag analysis), and then storing all videos watched by the users.

And step 408, selecting a second preset number of videos from the videos watched by the similar users for recommendation.

In this embodiment, videos that are not selected instep 406 are selected from videos that are viewed by similar users for recommendation. That is, the final set of recommended videos is the union of the videos saved instep 405 and the videos obtained instep 407. This union is the set of videos that are to be pushed to the user last. The recommendation sets may be generated and sent to the user after both selection modes are selected.

As can be seen from fig. 4, compared with the embodiment corresponding to fig. 2, theflow 400 of the method for searching for a video in the present embodiment embodies the steps of video recommendation. Therefore, the scheme described in the embodiment not only includes a video search function, but also provides a video recommendation function, and can recommend the same type of video to the user.

In some optional implementations of this embodiment, the method further includes: receiving feedback information of a user; and recommending the video again according to the feedback information. The process of user feedback is actually the process of labeling. The functions of all the modules in the system can be optimized through the marking of the user. Through the feedback of the user, the system can more accurately judge the search content of the user, and the recommendation matching degree is higher. The feedback information may be an index such as a click rate (volume), a browsing duration, a collection volume (rate), and a praise volume (rate) of the user.

With further reference to fig. 5, as an implementation of the methods shown in the above-mentioned figures, the present disclosure provides an embodiment of an apparatus for searching for a video, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable to various electronic devices.

As shown in fig. 5, theapparatus 500 for searching for a video of the present embodiment includes: afirst acquisition unit 501, asecond acquisition unit 502, afeature extraction unit 503, and anoutput unit 504. The first obtainingunit 501 is configured to obtain a video segment to be searched; a second obtainingunit 502 configured to obtain a video tag of the video clip; afeature extraction unit 503 configured to extract video features from the video segments; and anoutput unit 504 configured to select a target video from the candidate video set for output based on the video tag and the video feature.

In this embodiment, the specific processing of the first acquiringunit 501, the second acquiringunit 502, thefeature extracting unit 503 and theoutput unit 504 of theapparatus 500 for searching for a video may refer to step 201,step 202,step 203 and step 204 in the corresponding embodiment of fig. 2.

In some optional implementations of the present embodiment, the second obtainingunit 502 is further configured to: extracting at least one key frame from the video clip; for each key frame, outputting the description information of the key frame through a picture description model; extracting candidate video tags from the description information of each key frame; and determining a preset number of candidate video labels with the most repeated times in the candidate video labels as the video labels of the video clips.

In some optional implementations of this embodiment, thefeature extraction unit 503 is further configured to: extracting image features from the video clips; and/or extracting audio features from the video clips; and/or identifying text information from the video clip and extracting text features from the text information.

In some optional implementations of this embodiment, theapparatus 500 further comprises a fusion unit (not shown in the drawings) configured to: and performing feature fusion on at least two extracted features to obtain the video features.

In some optional implementations of this embodiment, theapparatus 500 further comprises a classification unit (not shown in the drawings) configured to: after video features are extracted from the video clips, inputting video labels into a text classification model to obtain a first class probability; inputting the video characteristics into a video classification model to obtain a second class probability; determining a category of the video clip based on the first category probability and the second category probability.

In some optional implementations of this embodiment, theoutput unit 504 is further configured to: filtering candidate videos which are not matched with the category from the candidate video set to obtain a first target sub-candidate video set; calculating the text similarity of the video label of each candidate video in the first target sub-candidate video set and the video label of the video clip; filtering candidate videos with text similarity smaller than a preset similarity threshold value from the first target sub-candidate video set to obtain a second target sub-candidate video set; calculating the matching degree of the video features of each candidate video in the second target sub-candidate video set and the video features of the video clips; and determining the candidate video with the matching degree larger than a preset matching degree threshold value as the target video, and outputting the target video.

In some optional implementations of this embodiment, theapparatus 500 further comprises a first recommending unit (not shown in the drawings) configured to: analyzing the user preference according to a search record of a user, and storing a video preferred by the user, wherein the search record comprises video tags and video features of video clips searched by the user; and selecting a first preset number of videos from the videos favored by the user for recommendation.

In some optional implementations of this embodiment, theapparatus 500 further comprises a second recommending unit (not shown in the drawings) configured to: analyzing similar users of the users according to search records, and storing videos watched by the similar users, wherein the search records comprise video labels and video features of video clips searched by the users; and selecting a second preset number of videos from the videos watched by the similar users for recommendation.

In some optional implementations of the present embodiment, theapparatus 500 further comprises a feedback modification unit (not shown in the drawings) configured to: receiving feedback information of a user; and recommending the video again according to the feedback information.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

An electronic device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of

flows

200 or 400.

A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of

flow

200 or 400.

A computer program product comprising a computer program which, when executed by a processor, implements the method of

flow

200 or 400.

FIG. 6 illustrates a schematic block diagram of an exampleelectronic device 600 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, theapparatus 600 includes acomputing unit 601, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)602 or a computer program loaded from astorage unit 608 into a Random Access Memory (RAM) 603. In theRAM 603, various programs and data required for the operation of thedevice 600 can also be stored. Thecalculation unit 601, theROM 602, and theRAM 603 are connected to each other via abus 604. An input/output (I/O)interface 605 is also connected tobus 604.

A number of components in thedevice 600 are connected to the I/O interface 605, including: aninput unit 606 such as a keyboard, a mouse, or the like; anoutput unit 607 such as various types of displays, speakers, and the like; astorage unit 608, such as a magnetic disk, optical disk, or the like; and acommunication unit 609 such as a network card, modem, wireless communication transceiver, etc. Thecommunication unit 609 allows thedevice 600 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

Thecomputing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of thecomputing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. Thecalculation unit 601 performs the respective methods and processes described above, such as a method of searching for a video. For example, in some embodiments, the method of searching for videos may be implemented as a computer software program tangibly embodied on a machine-readable medium, such asstorage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto thedevice 600 via theROM 602 and/or thecommunication unit 609. When the computer program is loaded into theRAM 603 and executed by thecomputing unit 601, one or more steps of the method of searching for video described above may be performed. Alternatively, in other embodiments, thecomputing unit 601 may be configured by any other suitable means (e.g. by means of firmware) to perform the method of searching for video.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.