Information name	Information content
		segments.media-vector	[bd de d1 b4 3c 8c 9c]
segments.startTime	0
		segments.endTime	91666
segments.startFrame	0
		segments.endFrame	2750
segments.tag-name	Character/landscape/building

As shown in table 2, segments.media-vector represents the visual semantic vector representing a frame in video segment 1. In practice, vectors are typically composed of arrays. In the embodiment of the application, hexadecimal sequence processing is performed on the visual semantic vector to obtain the visual semantic vector in the form of [ bd de d1 b4 c 8c 9c ], and hexadecimal floating point numbers can enable the mobile phone 300 to store with less storage space, so that the storage space can be saved. segment.starttime represents the start time of video segment 1, 0ms (i.e., the point in time corresponding to the start frame in video segment 1). segments.endtime represents the end time 91666ms of video segment 1 (i.e., the point in time corresponding to the end frame in video segment 1). segments.startframe and segment.endframe represent a total of 2750 frames from start to end of video frames in video segment 1. segments-name represents the category labels of video segment 1, including "people", "landscape" and "building".

In some embodiments, the attribute tag of the video segment 1 may also be stored in the sub-document 1, where the attribute tag of the video segment 1 is the attribute tag of the video stored in step S502. Taking one video segment of the video captured by the video segment 1 as a user as an example, the attribute tag of the video segment 1 may include the capturing time, the capturing place, the storage path of the mobile phone 300, and the like of the video segment 1.

In some embodiments, an attribute tag for video segment 1 may also be stored in document 1. It will be appreciated that the attribute tags of a video are fixed, i.e. the attribute tags corresponding to each of a plurality of video segments in a video are identical. The attribute tag of a video can be stored in the document 1, and the attribute tag of each video segment of the video can be obtained from the document 1, so that the content pressure of the mobile phone 300 can be reduced, and the storage cost can be reduced.

Illustratively, the video D includes the video segment 1, and further includes the video segment 2 and the video segment 3, where the visual semantic vector of the representative frame and the related information of the representative frame corresponding to the video segment 1 are stored in the sub-document 1, the visual semantic vector of the representative frame and the related information of the representative frame corresponding to the video segment 2 may be stored in the sub-document 2, and the visual semantic vector of the representative frame and the related information of the representative frame corresponding to the video segment 1 may be stored in the sub-document 3. The information stored in document 1 can be seen as shown in table 3 below:

TABLE 3 Table 3

As shown in table 3, file_path represents the storage path of video D in the cell phone 300. The imaging-time indicates the shooting time of the video D. The location indicates the shooting location of the video D. The segments are used for indicating sub-documents respectively corresponding to the plurality of video segments included in the video D. Based on the above tables 2 and 3, the attribute tag of the video segment 1 may be obtained from the document 1, and the related information of the representative frame of the video segment 1 and the like may be obtained from the sub-document 1. Similarly, the attribute tags for video segment 2 and video segment 3 may also be obtained from document 1.

It should be noted that, since the video semantic understanding may occupy a large amount of computing resources, in order to avoid the influence of using a card, etc., the steps S503-S510 may be executed when the mobile phone 300 is in the state of charging and turning off the screen.

The video searching method provided by the application is based on searching and matching of the searching text input by the user and the index corresponding to the video, and the searching result is obtained. The index corresponding to the video is an index of one video segment in the video, and the index of the video segment at least comprises a visual semantic vector representing a frame in the video segment. While the visual semantic vector representing the frame may indicate a meaning expressed by the content of the picture presented by the representative frame, which may represent the video semantic of the video segment, so that the visual semantic vector representing the frame is sufficiently associated with the video semantic of the video segment in the video. The search matching is carried out based on the visual semantic vector of the representative frame, so that the fusion interaction of the search text and the picture content displayed by the representative frame can be realized, the accuracy of the video search result is further improved, and the use experience of a user is improved.

Next, the steps included in the searching stage of the video searching method provided by the present application will be described in further detail with reference to fig. 5 b.

S511 the gallery service module 510 receives user input operations for searching text.

The user may enter search text in a search interface provided by the handset 300. Illustratively, the user may enter search text in a search box 321 included in an album display interface 320 as shown in FIG. 3 a. Also illustratively, the user may enter search text in a search box 421 included in the negative one-screen interface 420 as shown in FIG. 4 a.

The search text is text that the user describes about the characteristics of his own desired video. Illustratively, the search text may include a video acquisition time, a video acquisition location, screen content presented by the video, and the like. For example, the search text may be "scenery shot last week", which the present application is not limited to.

The gallery service module 510 sends the search text to the search module 520S 512.

The search module 520 invokes the multimodal understanding module 530 to determine text semantic vectors corresponding to the search text S513.

The search module 520 invokes the multimodal understanding module 530 to perform text semantic understanding on the search text to obtain text semantic vectors corresponding to the search text.

Text semantic understanding refers to letting the mobile phone understand the meaning expressed by text, and is a key technology of Natural Language Processing (NLP) technology.

In some embodiments, the multimodal understanding module 530 provides a CLIP model, and can input search text to a text encoder of the CLIP model, resulting in text semantic vectors corresponding to the search text.

In some embodiments, the text encoder of the CLIP model may be trained by text training samples. The training process of the text encoder of the CLIP model can be seen in the following embodiments.

S514, the multi-modal understanding module 530 returns text semantic vectors corresponding to the search text to the search module 520.

The search module 520 performs vector recall in the index library based on the text semantic vector corresponding to the search text S515.

Vector recall refers to recall in an index library an index that matches a text semantic vector corresponding to the search text.

In some embodiments, the vector similarity between the text semantic vector corresponding to the search text and the visual semantic vector respectively included by the plurality of indexes in the index library can be calculated, so as to obtain a vector similarity calculation result respectively corresponding to the plurality of indexes, and N indexes with higher vector similarity in the plurality of vector similarity calculation results are used as vector recall results. Wherein N is an integer greater than 0. Illustratively, N may be a predetermined number of vector recall results, such as 5, 8, or 10, etc.

In some embodiments, as described above, the vector space in the index library includes a plurality of vector regions, each of which includes a plurality of indices of higher vector similarity. In the embodiment of the application, the distance between the text semantic vector corresponding to the search text and the clustering center point of a plurality of vector areas in the index library can be calculated, and the closest clustering center point can be determined. And calculating the vector similarity between the visual semantic vector and the text semantic vector respectively included by a plurality of indexes in the vector region to which the cluster center point belongs. And sequencing the indexes according to the sequence of the vector similarity from large to small to obtain the inverted zipper corresponding to the clustering center point. Illustratively, the top N indices in the inverted zipper may be used as vector recall results. Also for example, an index whose vector similarity exceeds a vector similarity threshold may be used as a vector recall result.

The process of vector recall is described in detail below in conjunction with fig. 7.

As shown in fig. 7, the inverted index library includes a plurality of cluster center points such as cluster center point 1. Firstly, calculating the distance between a text semantic vector corresponding to a search text and a clustering center point of a plurality of vector areas in an index library, and determining a clustering center point 1 as the closest clustering center point. And calculating the vector similarity of the visual semantic vectors and the text semantic vectors of the multiple indexes in the vector area of the center 1, sequencing the visual semantic vectors and the text semantic vectors from large to small according to the vector similarity to obtain inverted zippers 1 corresponding to the cluster center points 1, and selecting the index of TopN from the inverted zippers 1 corresponding to the cluster center points 1 as a vector recall result.

In the inverted zipper 1, the visual semantic vector corresponding to the index 1 is closest to the clustering center point 1, and the distances between the index 2 and the index 3 and the clustering center point 1 gradually become farther, so that the index for selecting the TopN is selected from the index 1 sequentially backwards. N may be any integer greater than 0, which is not limited in the present application.

The search module 520 invokes the natural language understanding module 540 to identify entities in the search text S516.

The search module 520 invokes the natural language understanding module 540 to identify the entities contained in the search text.

For example, entities in the search text that have a particular meaning may be identified by a named entity identification technique (NAMED ENTITY Recognizion, NER). Entities may include, but are not limited to, time, place, person name, organization name, proper nouns. Taking the search text as a scene shot in the last week as an example, the entities in the search text comprise the last week and the scene.

S517 the natural language understanding module 540 returns the entities in the search text to the search module 520.

The search module 520 recalls entities in the index base based on the entities in the search text S518.

Entity recall refers to recall in an index library an index that matches an entity in the search text.

In some embodiments, the index may include information about the video segment representing the frame and an attribute tag of the video segment, including the entity. Such as video acquisition time, video acquisition location, category labels for video segments, etc., may include entities. Taking the searching text as the scenery shot in the city B in the last week as an example, the entity 'city B' which is a place exists in the searching text, the entity 'last week' which is a time and the entity 'scenery' related to the picture content displayed by the video can be matched in the entities corresponding to the indexes respectively, and the index matched with the entity in the searching text is obtained as an entity recall result.

The search module 520 ranks the vector recall results and the entity recall results S519.

In some embodiments, the intersection results or union results of the vector recall results and the entity recall results may be ordered.

Illustratively, the ranking may be by vector similarity between the vector recall result and the search text, and by entity matching between the entity recall result and the search text. For example, the vector similarity between the text semantic vector of the search text and the visual semantic vector of the recall result (vector recall result or entity recall result) and the matching degree between the entity in the search text and the entity in the recall result can be weighted and summed to obtain the comprehensive matching degree corresponding to the recall result, and the comprehensive matching degrees corresponding to the recall results (including the vector recall result and the entity recall result) are sorted according to the order from large to small.

Therefore, based on the search text input by the user, on the basis of the search matching of the text semantic vector, the search matching of the entities in the search text is performed, and the final displayed result sequence is obtained based on the comprehensive matching degree of the search results, so that the video presented to the user is ensured to be a result which is more matched with the search text of the user, and the use experience of the user is further improved.

S520 the search module 520 returns the search results to the gallery service module 510.

The search module 520 returns the ranked search results to the gallery service module 510.

S521, gallery service module 510 presents the search results to the user.

It will be appreciated that in practice, the handset 300 will typically store pictures and videos in a gallery application. Therefore, when searching is performed through a search interface provided in the gallery application, the picture search result and the video search result are displayed simultaneously. That is, the index library includes indexes corresponding to the pictures in addition to indexes corresponding to the video segments.

In some embodiments, the index to which the picture corresponds may include, but is not limited to, a picture semantic vector, an attribute tag of the picture, and the like. Illustratively, as described above, the video frame may be an image, and the picture is also an image, so that the picture semantic vector corresponding to the picture may be generated by the image encoder of the CLIP model provided by the multimodal understanding module 530 and returned to the gallery service module 510. For example, the attribute tag of the picture may be obtained by the gallery service module 510 by receiving and storing a new or modified operation of the attribute tag of the picture by the user. The search module 520 receives the picture semantic vector and the attribute label of the picture sent by the gallery service module 510, and constructs an index corresponding to the picture based on the picture semantic vector and the attribute label.

Illustratively, as with the first search result display interface 330 of FIG. 3a, the display is that the first search results include video search results and picture search results.

In some embodiments, the search results may be presented based on a plurality of ordered indexes. As described above, the index to which the video corresponds includes visual semantic vectors representing frames, related information representing frames, and attribute tags for the video segments. The index corresponding to the picture comprises a picture semantic vector, an attribute label of the picture and the like.

Illustratively, as with the first search result display interface 330 of FIG. 3a, the picture search results of the first display area 331 show images, and the video search results show thumbnails and points in time. The information corresponding to the thumbnail and the time point is information included in an index corresponding to the video search result. Taking "video a" in fig. 3a as an example, the thumbnail thereof is the start frame of the video segment a included in the index, the first time point thereof is the time point "02:18" corresponding to the start frame of the video segment a included in the index, and the second time point thereof is the total duration "08:32" of the video included in the index.

Also by way of example, the video search results and the picture search results may also exhibit respective corresponding attribute tags, e.g., the video search results may exhibit video capture times, video capture locations, and the like. As shown in (2) of fig. 3B, the picture search result includes a photographing time of picture "2023, 10, 1, and a photographing place of picture" city B "

Still further exemplary, the video search results may also display other content in the corresponding index, such as persona relationships, persona names, and the like. The application is not limited in this regard.

It should be noted that, the gallery service module, the search module, the multi-modal understanding module and the natural language understanding module may also be located in the cloud server, that is, the cloud server uses interactions of the four modules to implement steps included in the index construction stage. In the searching stage, the steps involved in the searching stage may be implemented based on interaction between the electronic device such as the mobile phone 300 and the cloud server.

In some embodiments, the mobile phone 300 may send the search text input by the user to the gallery service module of the cloud server, so that the gallery service module of the cloud server interacts with other modules to implement the steps included in the search phase, and then the gallery service module of the cloud server sends the search result to the mobile phone 300, so that the mobile phone 300 displays the search result to the user.

In some embodiments, assume that the total duration of video E is 2 minutes and it is taken at a 30 frame rate, i.e., 30 images are taken in one second. When the video E is stored in the cloud server, the multi-mode understanding module in the cloud server can conduct frame splitting processing on the video E, the video E is decomposed into video frames to obtain 3600 video frames, the multi-mode understanding module conducts segmentation processing on the video E, the video E is segmented into 120 video segments by taking 1s as a time unit, each video segment comprises 30 video frames (namely 30 images shot in one second), the multi-mode understanding module scores the 30 video frames in each video segment, and one video frame with the highest score is used as a representative frame of the video segment. Illustratively, the scoring may be based on jitter, sharpness, pixels, etc. of the video frame, as the application is not limited in this regard.

And then the searching module of the cloud server takes the visual semantic vector of the representative frame as an index of the corresponding video segment so as to match the representative frame of each video segment of the video E based on the search text, if the search text is successfully matched with the 50 th video segment of the video E, the thumbnail of the video returned to the user is the representative frame of the 50 th video segment, and the returned time point is 50s. The specific implementation of the above embodiment may be referred to fig. 5b and the description of the above embodiment, which are not repeated here.

It should be noted that the multi-mode understanding module 530 of the mobile phone 300 may also segment the video stored in the gallery with 1s as a time unit, which is not limited by the present application.

The training process of the image encoder of the CLIP model and the text encoder of the CLIP model is described below in conjunction with fig. 8 and 9. The following embodiments describe the subdivision of the training patterns of the image encoder of the CLIP model and the text encoder of the CLIP model into the following steps 1-7 based on contrast learning.

Step 1, acquiring an image training sample and a text training sample corresponding to the image training sample.

Both the image encoder of the CLIP model and the text encoder of the CLIP model need to be trained in advance with a large number of training samples. Therefore, before model training, a training sample of an image encoder of the CLIP model, that is, an image training sample, is acquired, and a training sample of a text encoder of the CLIP model, that is, a text training sample corresponding to the image training sample, is acquired.

As described above, the image training samples may include picture training samples and video frame training samples. The picture training sample can be any picture, and the video frame training sample can be a video frame in any video. The text training samples corresponding to the image training samples refer to texts corresponding to the contents displayed by the image training samples, namely the text training samples can express the contents displayed by the image training samples. Illustratively, the image training sample is a thumbnail displayed by "video A" in the first display area 331 in FIG. 3a, then its corresponding text training sample may be "Male student standing at night next to the big tree".

The method is not limited in the acquisition mode of the text training samples corresponding to the image training samples.

For example, the text training samples corresponding to the image training samples may be manually labeled, and the text training samples are manually labeled according to the image semantic understanding of the text training samples. Also for example, the text training samples corresponding to the image training samples may be automatically generated by identifying relevant content such as objects, scenes, actions, and the like in the image training samples. Still further by way of example, the text training samples corresponding to the image training samples may be automatically generated by a text generation model for generating descriptive text of the image.

It should be noted that the number of the image training samples is not limited in the present application. It will be appreciated that the text training samples correspond to the image training samples and are therefore the same number.

As shown in fig. 9, N image training samples are acquired, and N text training samples corresponding to the N image training samples one by one are acquired. Illustratively, image training sample 1 corresponds to text training sample 1.

And 2, inputting the image training samples into an image encoder, and outputting image vectors corresponding to the image training samples by the image encoder.

As shown in fig. 8, for an image training sample, an image encoder may encode it to obtain an image vector for the image training sample.

As shown in fig. 9, N image training samples are input to an image encoder, and image vectors I₁、I₂、I₃……I_N corresponding to the N image training samples are obtained.

And 3, inputting the text training samples into a text encoder, and outputting text vectors corresponding to the text training samples by the text encoder.

As shown in fig. 8, for a text training sample, a text encoder may encode it to obtain a text vector for the text training sample.

As shown in fig. 9, N text training samples are input to a text encoder, and text vectors T₁、T₂、T₃……T_N corresponding to the N text training samples are obtained.

And 4, respectively combining each image vector and a plurality of text vectors to obtain a plurality of vector pairs, determining the vector pair with the corresponding relation as a positive sample vector pair from the plurality of vector pairs, and determining the rest vector pairs as negative sample vector pairs.

Contrast learning is an unsupervised training approach, so positive and negative samples need to be defined from training samples. In an embodiment of the application, positive and negative pairs of sample vectors are determined from a plurality of pairs of vectors.

In some embodiments, assuming that there are N image vectors and N text vectors, combining each image vector and N text vectors separately may result in a number of n×n vector pairs. It can be understood that the image training samples and the text training samples have corresponding relations, so that the vectors corresponding to the image training samples and the text training samples also have corresponding relations, N positive sample vector pairs can be determined as vector pairs formed by the image vectors and the text vectors with the corresponding relations in n×n vector pairs, and N negative sample vector pairs can be determined as negative sample vector pairs in the remaining vector pairs.

As shown in fig. 9, taking I₁ as an example, it is combined with T₁、T₂、T₃……T_N to obtain N vector pairs I₁˙T₁、I₁˙T₂、I₁˙T₃……I₁˙T_N and, similarly, I₂、I₃……I_N, n×n vector pairs can be obtained. The vector pair with the corresponding relation of I₁˙T₁、I₂˙T₂、I₃˙T₃……I_N˙T_N is determined as a positive sample vector pair, and the rest is determined as a negative sample vector pair.

And 5, calculating the vector similarity between the image vector and the text vector in each vector pair.

For example, a vector cosine similarity between the image vector and the text vector in each vector pair may be calculated.

And 6, adjusting parameters of the image encoder and the text encoder based on the loss function, the vector similarity corresponding to the positive sample vector pair and the vector similarity corresponding to the negative sample vector pair.

As shown in fig. 8, the parameters of the image encoder and the text encoder are adjusted based on the image encoder and the image vector outputted thereby, and the text encoder and the text vector thereof, which are subjected to contrast learning.

It will be appreciated that in embodiments of the present application, the model training goal is to maximize the vector similarity of the positive sample vector pair and minimize the vector similarity of the negative sample vector pair.

Illustratively, the loss function may be a cross entropy loss function, whose formula is specifically as follows:

Wherein,Representing the loss value of the loss function, K representing the number of vector pairs, y_i representing the true vector similarity corresponding to the ith vector pair,Representing the similarity of the i-th vector pair to the corresponding predicted vector.

Furthermore, in some embodiments, the true vector similarity of the positive sample vector pair may be represented as 1, the true vector similarity of the negative sample vector pair may be represented as 0, and parameters of the image encoder and the text encoder may be adjusted until the true sample vector pair may be maximally closer to 1, and the negative sample vector pair may be maximally closer to 0, i.e., the value of the penalty function may be minimized.

And 7, ending training to obtain a trained image encoder and a trained text encoder when the training cut-off condition is met. Illustratively, the training cutoff condition may be that a preset number of training times is reached during model training, or that a loss value of a loss function is smaller than a loss value threshold during model training, or the like.

Embodiments of the present application also provide a computer-readable storage medium storing a computer program which, when executed by a computer, is capable of carrying out one or more steps of any one of the video search methods described above.

The computer readable storage medium may be a non-transitory computer readable storage medium, for example, a ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

Another embodiment of the application also provides a computer program product containing instructions. The computer program product is capable of implementing one or more steps of any of the video search methods described above when executed by a computer.

The electronic device, the computer readable storage medium and the computer program product provided in this embodiment are used to execute the corresponding video searching method provided above, so the beneficial effects thereof can be referred to the beneficial effects in the corresponding video searching method provided above, and will not be described herein.

The terms first, second, third and the like in the description and in the claims and in the drawings are used for distinguishing between different objects and not for limiting the specified order.

In embodiments of the application, words such as "exemplary" or "such as" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g." in an embodiment should not be taken as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.

Claims

1. A video search method, the method comprising:

Displaying a first interface of a gallery application, wherein the first interface comprises an input box, and the input box comprises a first text;

Wherein the first interface comprises a first thumbnail of a first video, the first thumbnail comprising a first point in time;

responding to the triggering operation of the user on the first thumbnail, and starting to play the first video from the first time point;

displaying a second interface of the gallery application, wherein the second interface comprises the input box, and the input box comprises a second text;

wherein the second interface comprises a second thumbnail of the first video, the second thumbnail comprising a second point in time;

And responding to the triggering operation of the user on the second thumbnail, starting to play the first video from the second time point, wherein the second time point is later than the first time point.

2. The method of claim 1, wherein the first text is different from the second text, the first video matches the first text, and the first video matches the second text.

3. The method of claim 2, wherein the first video comprises a first video frame and a second video frame;

the first video matches the first text, the first video matches the second text, comprising:

the first text matches a first video frame corresponding to the first thumbnail, and the second text matches a second video frame corresponding to the second thumbnail, the second video frame being subsequent to the first video frame.

4. The method of claim 3, wherein the playing the first video from the first point in time in response to a user triggering operation of the first thumbnail comprises:

Starting to play the first video from the first video frame or the first start frame;

the responding to the triggering operation of the user on the second thumbnail, playing the first video from the second time point, comprises the following steps:

Playing the first video from the second video frame or a second start frame;

Wherein the first start frame precedes the first video frame, the second start frame precedes the second video frame, and the second start frame follows the first video frame.

5. The method according to any one of claims 1-4, further comprising:

Displaying a third interface of the gallery application, the third interface including the input box, the input box including a third text, the third text being different from the first text, the third interface including the first thumbnail;

And responding to the triggering operation of the user on the first thumbnail, and starting to play the first video from the first time point.

6. The method of any of claims 1-4, wherein the first interface further comprises a third thumbnail of the first video, the third thumbnail comprising a third point in time, the method further comprising:

And responding to the triggering operation of the user on the third thumbnail, starting to play the first video from the third time point, wherein the first text is matched with a third video frame corresponding to the third thumbnail, and the third time point is later than the second time point.

7. The method of claim 6, wherein playing the first video from the third point in time comprises:

the first video is played from the third video frame or a third start frame, the third start frame is before the third video frame, and the third start frame is after the second video frame.

8. The method of any of claims 1-7, wherein the first interface further comprises a fourth thumbnail of the second video, the fourth thumbnail comprising a fourth point in time, the method further comprising:

And responding to the triggering operation of the user on the fourth thumbnail, starting to play the second video from the fourth time point, wherein the first text is matched with a fourth video frame corresponding to the fourth thumbnail, and the second video is different from the first video.

9. A video search method, the method comprising:

The first interface comprises a first thumbnail of a first video, the first thumbnail comprises a first time point, and the first text is matched with a first video frame corresponding to the first thumbnail through a CLIP model;

Responsive to a trigger operation of the user on the first thumbnail, playing the first video from the first point in time;

Displaying a second interface of the gallery application, wherein the second interface comprises the input box, the input box comprises a second text, and the second text is matched with a second video frame corresponding to the second thumbnail through the CLIP model;

10. The method of claim 9, wherein the first video frame is at a first video segment of the first video and the second video frame is at a second video segment of the first video, the first video frame and the second video frame being determined by:

performing first processing on the first video to obtain a plurality of video frames and classification labels of each video frame;

Performing second processing on the first video based on the classification labels respectively corresponding to the plurality of video frames to obtain a plurality of video segments including the first video segment and the second video segment;

one video frame of the first video segment is determined to be the first video frame and one video frame of the second video segment is determined to be the second video frame.

11. The method of claim 10, wherein the performing a second process on the first video based on the class labels corresponding to the plurality of video frames to obtain a plurality of video segments including the first video segment and the second video segment comprises:

and performing second processing on the first video based on differences of image parameters respectively corresponding to adjacent video frames of the plurality of video frames and differences of classification labels respectively corresponding to adjacent video frames of the plurality of video frames to obtain a plurality of video segments including the first video segment and the second video segment.

12. The method of claim 10, wherein said determining a video frame of the first video segment as the first video frame comprises:

And determining a video frame corresponding to a first starting frame, a first ending frame, a random video frame or a time point of a first position of the first video segment as the first video frame, wherein the time point of the first position is determined based on an average value of the starting time point and the ending time point of the first video segment.

13. The method of claim 10, wherein said determining a video frame of the first video segment as the first video frame comprises:

And determining the first video frame based on the image parameters respectively corresponding to the plurality of video frames of the first video segment and the classification labels respectively corresponding to the plurality of video frames of the first video segment.

14. The method of claim 9, wherein the first text matches a first video frame corresponding to the first thumbnail via a CLIP model comprising the steps of:

Inputting the first text into a text encoder of the CLIP model to obtain a text semantic vector of the first text;

Inputting the first video frame into an image encoder of the CLIP model to obtain a visual semantic vector of the first video frame;

The first text is matched with the first video frame based on the text semantic vector and the visual semantic vector.

15. The method of claim 14, wherein a first vector similarity between the text semantic vector and the visual semantic vector is greater than or equal to a first threshold.

16. The method of claim 14, wherein a second vector similarity between the text semantic vector and a vector of a first cluster center point of an inverted index library is greater than or equal to a second threshold, wherein a visual semantic vector of the first video frame belongs to a first cluster corresponding to the first cluster center point, wherein a third vector similarity between the text semantic vector and the visual semantic vector is greater than or equal to a third threshold, wherein the inverted index library comprises clusters corresponding to a plurality of cluster center points, and wherein the plurality of cluster center points are determined by clustering a plurality of visual semantic vectors in the inverted index library.

17. The method of any of claims 10-16, wherein the entity of the first text matches an entity of an attribute tag of the first video segment.

18. The method of any of claims 9-16, wherein the first interface further comprises a third thumbnail of a second video, a third video frame corresponding to the third thumbnail being in a third video segment of the second video, an entity of the first text matching an entity of an attribute tag of the third video segment.

19. The method of claim 17, wherein the first interface further comprises a fourth thumbnail of the first video, the first text matches a fourth video frame corresponding to the fourth thumbnail, the fourth video frame is in a fourth video segment of the first video, and an entity of the first text matches an entity of an attribute tag of the fourth video segment;

the display order of the first thumbnail and the fourth thumbnail is determined by:

determining a first comprehensive degree of matching of the first thumbnail with the first text based on a first vector similarity of the visual semantic vector of the first video frame with the text semantic vector of the first text and a first degree of matching of the attribute tag of the first video segment with the entity of the first text;

Determining a second comprehensive matching degree of the second thumbnail and the first text based on a second vector similarity of the visual semantic vector of the fourth video frame and the text semantic vector of the first text, and a second matching degree of the attribute tag of the fourth video segment and the entity of the first text;

and displaying the first thumbnail before the fourth thumbnail according to the order of the comprehensive matching degree from large to small.

20. An electronic device comprising a memory, a display screen, and a processor;

The memory being coupled to the processor, the memory being for storing computer program code, the computer program code comprising computer instructions, the display screen being provided with display functionality, the one or more processors invoking the computer instructions to cause the electronic device to perform the video search method of any of claims 1-8, or the video search method of any of claims 9-19.

21. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the video search method according to any of claims 1-8 or the video search method according to any of claims 9-19.