Disclosure of Invention
The video searching method, the electronic equipment and the storage medium provided by the application aim to improve the accuracy of video searching results, enable the electronic equipment to accurately search and display videos corresponding to user demands, reduce user operations and improve the use experience of users.
In order to achieve the above purpose, the application adopts the following technical scheme:
in a first aspect, the application provides a video searching method applied to electronic equipment, wherein the electronic equipment can be a mobile phone, a tablet computer, a notebook computer and other equipment comprising a gallery application. An input box is included in the first interface that supports the function of searching for pictures/videos in the gallery application. The input box of the first interface includes a first text, which is a search text input by the user, for example, "girl doing yoga", "boy standing near the bridge", and the like. The first interface includes a first thumbnail of the first video, the first thumbnail corresponding to one of the video frames in the first video. The first thumbnail includes a first point in time, the first point in time being a timestamp of the first video. When the user performs a triggering operation on the first thumbnail, for example, a clicking operation or the like. The electronic device plays the first video from the first point in time.
The electronic device displays a second interface of the gallery application, the second interface including an input box, the input box including a second text. The second interface includes a second thumbnail of the first video, the second thumbnail including a second point in time, the second point in time being later than the first point in time. And after the user performs triggering operation on the second thumbnail. The electronic device starts to play the first video from the second time point, that is, after the user triggers the first thumbnail or the second thumbnail, the electronic device plays the first video, but the starting time points of the playing are different, that is, the video frames displayed to the user are different.
Based on the second text input by the user, the electronic equipment can search the first video, after the user triggers the second thumbnail, the electronic equipment can play the first video from a second time point which is later than the first time point, and the played picture content meets the user requirement. That is, after the user inputs the text, the video search result displayed by the electronic device can meet the user requirement. Therefore, the electronic equipment can accurately display the video corresponding to the user demand, reduce the user operation and further improve the use experience.
In one possible implementation, the first text entered by the user within the search box is different from the second text, e.g., the first text may be "a boy standing on the river side" and the second text may be "a boy standing near the bridge". The first video matches the first text, for example, the first video includes screen content about "men standing on river side". The first video matches the second text, e.g. the first video comprises picture content about "men standing near the bridge". Therefore, based on the search text input by the user, the electronic equipment can display the video matched with the search text, so that the user requirement is met, the user does not need to manually find the search text, and the use experience of the electronic equipment is improved.
In one possible implementation, the first video includes a first video frame and a second video frame, and the second video frame is played after the first video frame, that is, when the first video is played, the first video frame is played before the second video frame is played. The first thumbnail corresponds to the first video frame, i.e., the first thumbnail displayed is the first video frame of the first video. The second thumbnail corresponds to the second video frame, i.e., the second thumbnail displayed is the second video frame of the first video. The matching of the first text with the first video further comprises matching the first text with a first video frame, for example, the first text is 'a boy standing on river side', and the picture content of the first video frame has a boy standing on river side. The matching of the second text to the first video further includes matching the second text to a second video frame, e.g., the second text is "a boy standing near a bridge", the visual content of the second video frame having a boy standing near the bridge.
Therefore, the search text input by the user is matched with the picture content displayed by the video frame corresponding to the thumbnail, and the search text reflects the user requirement, so that the electronic equipment can search the video accurately corresponding to the user requirement, the video searching accuracy can be improved, and the use experience of the user is further improved.
In one possible implementation, when the electronic device starts playing the first video from the first point in time after the user triggers the first thumbnail of the first video, the electronic device may start playing the first video from the first video frame or the first start frame, the first start frame preceding the first video frame. That is, the electronic device may start playing from the first video frame matched with the first text, or may start playing from the first start frame, and further play to the first video frame matched with the first text.
Similarly, when the user triggers the second thumbnail of the first video, the electronic device starts playing the first video from the second time point, and the electronic device may start playing the second video frame matched with the second text, or may start playing the second video frame from a second start frame before the second video frame, or may play the second video frame, where the second start frame is after the first video frame.
After the user triggers the thumbnail, the electronic device may start playing from the video frame matching the search text, or may start playing from the previous video frame. Meanwhile, the second start frame precedes the second video frame and follows the first video frame, the first video frame and the second video frame matching different first text and second text, respectively. Meaning that if the user triggers the second thumbnail, the user is not presented with the first video frame matching the first text, but may begin playing from a second starting frame that is closer to the second video frame. Compared with the first video frame, the picture content displayed by the second initial frame has stronger correlation with the picture content displayed by the second video frame. Therefore, the picture content displayed to the video played to the user can be more coherent, can correspond to the user requirement more accurately, and further improves the use experience of the user.
In one possible implementation, the video search method further includes the electronic device displaying a third interface of the gallery application, the third interface including an input box, the input box including a third text. The third text is different from the first text. For example, the first text may be "a boy standing on the river side", and the third text may be "a boy standing outdoors". The third interface includes a first thumbnail of the first video. And after the user performs triggering operation on the first thumbnail. The electronic device plays the first video from the first point in time.
In practical applications, the text descriptions of the same video by different users may be different, and the text descriptions of the same video by the same user may also be changed, that is, the user inputs different search texts, but may want to search for the same video search result. In the application, the first text can search the first video, the third text can search the first video, and the electronic equipment displays the first thumbnail, namely the electronic equipment can search the same video search result based on different search texts. Therefore, the electronic equipment can search the same video which accurately corresponds to the user requirement based on different search texts, so that the video searching accuracy can be improved, and the use experience of the user is improved.
In one possible implementation, the first interface further includes a third thumbnail of the first video, the third thumbnail including a third point in time. The video searching method further includes the step that after the user triggers the third thumbnail, the electronic device starts to play the first video from the third time point. The first text matches a third video frame corresponding to a third thumbnail, the third point in time being later than the second point in time. The above-mentioned first text may also be matched with the first video frame corresponding to the first thumbnail. I.e. the same search text may search for different video frames of the same video. Therefore, different video frames of the video can be searched based on the same search text, so that the method and the device can display the search results accurately corresponding to the search text based on the search text, improve the accuracy of video search, and further improve the use experience of users.
In one possible implementation, when the user triggers the third thumbnail of the first video, the electronic device starts playing the first video from a third point in time, and may start playing the first video from a third video frame matching the first text, or may start playing the first video from a third start frame before the third video frame, so that the electronic device may play the third video frame, and the third start frame is after the second video frame.
Therefore, the first video can be played from the third initial frame with stronger relevance with the third video frame, and the first video can also be directly played from the third video frame, so that the picture content matched with the first text can be displayed to the user, the user requirement is met, and the use experience of the user is improved.
In one possible implementation, the first interface further includes a fourth thumbnail of the second video, the fourth thumbnail including a fourth point in time, and the video search method further includes the electronic device playing the second video from the fourth point in time when the user triggers the fourth thumbnail. The first text matches a fourth video frame corresponding to a fourth thumbnail. Therefore, based on the same search text input by the user, different video frames of different videos matched with the search text can be searched, the accuracy of video search is improved, and the use experience of the user is further improved.
In one possible implementation, the video search method further includes the electronic device displaying a negative one-screen interface, the negative one-screen interface including a search box that supports online searches and searches of local files of the electronic device. The search box includes a first text. The negative one-screen interface includes a first thumbnail of the first video, the first thumbnail including a first point in time. When the user triggers the first thumbnail, the electronic device may begin playing the first video from the first point in time.
Therefore, the method and the device support the search box of the user through the gallery application to search the video, and also support the search box of the user through the negative one-screen interface to search the video, so that the use experience of the user is further improved.
In a second aspect, the application provides a video searching method applied to electronic equipment, wherein the electronic equipment can be a mobile phone, a tablet computer, a notebook computer and other equipment comprising a gallery application. An input box is included in the first interface that supports the function of searching for pictures/videos in the gallery application. The input box of the first interface includes a first text, which is a search text input by the user, such as "boy standing outdoors", "girl dancing indoors", and the like. The first interface includes a first thumbnail of a first video. The first thumbnail includes a first point in time, the first point in time being a timestamp in the first video. The first text is matched with the first video frame corresponding to the first thumbnail through the CLIP model, namely, the text semantic of the first text is matched with the visual semantic of the first video frame. The CLIP model is a pre-trained neural network model for matching images and text, i.e., first text and first video frames, that are input by a user. When the user performs a triggering operation on the first thumbnail, for example, a clicking operation. The electronic device plays the first video from the first point in time.
The electronic device displays a second interface of the gallery application including an input box including a second text. The second interface includes a second thumbnail of the first video, the second thumbnail including a second point in time, the second point in time being later than the first point in time. And matching the second text with the second video frame corresponding to the second thumbnail through the CLIP model, namely matching the text semantic of the second text with the visual semantic of the second video frame. And after the user performs triggering operation on the second thumbnail. The electronic device starts to play the first video from the second time point, that is, after the user triggers the first thumbnail or the second thumbnail, the electronic device plays the first video, but the starting time points of the playing are different, that is, the video frames displayed to the user are different.
In this way, based on the search text input by the user, a video frame matching the search text can be searched. The text semantics of the search text are fully associated with the visual semantics of the video frame, so that the fusion interaction of the search text and the picture content displayed by the video frame can be realized, the video search accuracy is further improved, and the use experience of a user is improved.
In one possible implementation, dividing the video may result in some video segments. The first video frame is at a first video segment of the first video and the second video frame is at a second video segment of the first video. The first video frame and the second video frame can be determined by the steps that the electronic device performs first processing on the first video, for example, frame splitting processing is performed on the first video, so as to obtain a plurality of video frames of the first video and classification labels of each video frame, wherein the classification labels refer to types of objects displayed by the video frames. For example, the classification tag may be a person, plant, animal, building, or natural scene, etc. The electronic device performs a second process on the first video based on the classification labels corresponding to the plurality of video frames, for example, performs a segmentation process on the first video, and divides the first video into a plurality of video segments to obtain a plurality of video segments including the first video segment and the second video segment. The electronic device then determines one video frame of the first video segment as a first video frame, i.e., the first video frame is determined as a representative frame of the first video segment and is operable to represent the first video segment, and the electronic device determines one video frame of the second video segment as a second video frame and the second video frame is determined as a representative frame of the second video segment and is operable to represent the second video segment.
Therefore, if the scheme is implemented on the electronic equipment, the index is not required to be built for each video frame of the video, so that time delay required by searching and matching is avoided, the index of the first video segment can be built based on the first video frame, the index of the second video segment can be built based on the second video frame, the number of indexes is greatly reduced, the video searching speed is improved, the user experience is further improved, and the cost consumed in building the index can be saved.
In one possible implementation, the electronic device performs the second processing on the first video based on the classification labels corresponding to the plurality of video frames, where the electronic device further includes performing the second processing on the first video based on differences in image parameters corresponding to adjacent video frames of the plurality of video frames of the first video, and based on differences in classification labels corresponding to adjacent video frames of the plurality of video frames, where the image parameters are display characteristics of the video frames, such as jitter, definition, pixel values, and the like. For example, the electronic device may perform the second processing on the first video based on a difference in sharpness corresponding to each of the adjacent video frames, such as the first video frame and the second video frame, of the first video and a difference between classification labels corresponding to each of the adjacent video frames, such as the first video frame and the second video frame. Thereby obtaining a plurality of video segments including a first video segment and a second video segment.
Therefore, the variety change of objects displayed by the plurality of video frames can be determined based on the classification labels, the change of image quality displayed by the plurality of video frames can be determined based on the image parameters, and the video frames with more similar displayed picture content can be divided into one video segment, so that the video frames which can represent the video segments can be determined, and the representative frames corresponding to the plurality of video segments of one video can represent the picture content displayed by the video more completely.
In one possible implementation, the electronic device determining one video frame of the first video segment as a first video frame, i.e., representing a representative frame of the first video segment, further includes the electronic device determining a first start frame, i.e., a first video frame, a first end frame, i.e., a last video frame, a random one of a plurality of video frames of the first video segment, or a video frame corresponding to a point in time of the first location as the first video frame. The time point of the first position can be obtained based on the average value of the starting time point and the ending time point of the first video segment, that is, the middle time point of the first video segment is obtained, and the video frame corresponding to the middle time point is the middle frame of the first video segment. In this way, any one video frame of a video segment may represent the video segment, facilitating the electronic device to build an index for the video segment.
In one possible implementation, the electronic device determining one video frame of the first video segment as the first video frame further includes the electronic device determining the first video frame based on image parameters respectively corresponding to the plurality of video frames of the first video segment and classification tags respectively corresponding to the plurality of video frames of the first video segment. Based on the image quality of the video frame and the kind of presentation object of the video frame, a video frame that is more representative for the first video segment is determined. Therefore, the accuracy of the index established based on the representative frame can be improved, the accuracy of video searching is further improved, and the use experience of a user is improved.
In one possible implementation, the matching of the first text and the first video frame corresponding to the first thumbnail through the CLIP model further comprises the electronic device inputting the first text into a text encoder of the CLIP model to obtain a text semantic vector of the first text, wherein the text semantic vector can represent semantic features of the whole first text. The electronic device inputs the first video frame into an image encoder of the CLIP model to obtain a visual semantic vector of the first video frame, and the visual semantic vector can represent semantic features of the first video frame. The electronic device then matches the first text with the first video frame based on the text semantic vector and the visual semantic vector.
As such, the first video frame may be used to represent the first video segment, and thus its visual semantic vector is sufficiently associated with the first video segment. The electronic equipment is matched with the visual semantic vector of the first video frame based on the text semantic vector of the first text, so that fusion interaction of the search text and the picture content displayed by the representative frame can be realized, further, the accuracy of video search is improved, and the use experience of a user is improved.
In one possible implementation, a first vector similarity between the text semantic vector of the first text and the visual semantic vector of the first video frame is greater than or equal to a first threshold. Thus, a first threshold is set, visual semantic vectors of the first video frames with high vector similarity are determined, and more accurate video search results are displayed to a user.
In one possible implementation, the inverted index library includes a plurality of visual semantic vectors, and the electronic device clusters the visual semantic vectors to determine a plurality of cluster center points and cluster clusters respectively corresponding to the plurality of cluster center points. The visual semantic vector of the first video frame belongs to a first cluster corresponding to the first cluster center point. The second vector similarity between the text semantic vector of the first text and the vector of the first cluster center point of the inverted index library is greater than or equal to the second threshold. The third vector similarity between the text semantic vector and the visual semantic vector of the first text is greater than or equal to a third threshold.
Therefore, when the electronic equipment searches and matches based on the text semantic vector of the search text, the electronic equipment can be matched with a plurality of clustering center points and then matched with the visual semantic vector of the clustering cluster of the determined clustering center points, so that the electronic equipment is prevented from being matched with all indexes in the index library. Further avoiding searching delay, improving video searching efficiency, and further improving user experience.
In one possible implementation, the entity of the first text matches the entity of the attribute tag of the first video segment. Therefore, based on the search text input by the user, the electronic equipment performs search matching of the entities in the search text on the basis of the search matching of the text semantic vector, namely the electronic equipment can perform vector recall and entity recall, video search results which are both vector recall results and entity recall results can be displayed to the user, the accuracy of video search is further improved, and the use experience of the user is improved.
In one possible implementation, the first interface displayed by the electronic device further includes a third thumbnail of the second video, and a third video frame corresponding to the third thumbnail is segmented in a third video of the second video. The entity of the first text matches the entity of the attribute tag of the third video segment. Therefore, the electronic equipment can also carry out entity recall on the basis of carrying out vector recall, and can display the entity recall result and the vector recall result to the user, so that the richness of video search results is improved, the accuracy of video search is further improved, and the use experience of the user is improved.
In one possible implementation, the first interface displayed by the electronic device further includes a fourth thumbnail of the first video, the first text matches a fourth video frame corresponding to the fourth thumbnail, the fourth video frame is in a fourth video segment of the first video, and an entity of the first text matches an entity of an attribute tag of the fourth video segment. And the entity of the first text matches the entity of the attribute tag of the first video segment, the first text matches the first video frame. The display sequence of the first thumbnail is before the fourth thumbnail on the first interface displayed by the electronic equipment.
The order of presentation of the first thumbnail and the fourth thumbnail may be determined by the electronic device determining a first overall degree of matching of the first thumbnail to the first text based on a first vector similarity of the visual semantic vector of the first video frame to the text semantic vector of the first text and a first degree of matching of the attribute tag of the first video segment to the entity of the first text. The electronic device determines a second comprehensive matching degree of the second thumbnail and the first text based on a second vector similarity of the visual semantic vector of the fourth video frame and the text semantic vector of the first text and a second matching degree of the attribute tag of the fourth video segment and the entity of the first text. And the electronic equipment displays the first thumbnail before the fourth thumbnail according to the sequence of the comprehensive matching degree from large to small, namely, the first comprehensive matching degree is larger than the second comprehensive matching degree, and the thumbnail with higher comprehensive matching degree is displayed in front.
Therefore, the video search results are ordered based on the comprehensive matching degree, the video search results ranked at the better positions are ensured to be more matched with the search text of the user, and the use experience of the user is further improved.
In a third aspect, the application provides an electronic device comprising a memory, a display screen and a processor, the memory storing computer program code comprising computer instructions, the display screen being provided with display functionality, the one or more processors invoking the computer instructions to cause the electronic device to perform the method of the first or second aspects described above.
In a fourth aspect, the present application provides a computer readable storage medium having a computer program stored therein, which when executed by a processor, implements the method of the first or second aspect described above.
According to the technical scheme, the application has the following beneficial effects:
Based on a first text input by a user, the first video can be searched, the first thumbnail is displayed, after the user triggers the first video, the first video can be played from a first time point included in the first thumbnail, and the picture content played from the first time point by the electronic equipment can meet the user requirement. Based on a second text input by a user, the first video can be searched, a second thumbnail is displayed, after the user triggers the second text, the first video can be played from a second time point included in the second thumbnail, and the picture content played from the second time point by the electronic equipment can meet the user requirement. That is, the first video is matched with the second text, but the time point when the first video starts to play is different under the triggering operation of the user. Therefore, the matched picture content can be displayed to the user based on the text input by the user, the user requirement is met, the manual searching operation of the user can be reduced, and the use experience of the user is further improved.
Detailed Description
The technical advantages of the video searching method provided by the application are contrasted and explained below by combining with the related technology. For ease of understanding, the description is provided by way of an example scenario. In this example scenario, the electronic device is a cell phone, and a plurality of pictures and a plurality of videos are stored in a gallery application of the cell phone.
First, words involved in the embodiments of the present application will be described. It will be understood that this description is for a clearer understanding of embodiments of the application and is not necessarily to be construed as limiting the embodiments of the application.
Video frames refer to any frame of video, one frame is a still picture in the video, and consecutive frames can form the video.
Video segmentation refers to segmentation obtained by dividing video. In some embodiments, the video may be first de-framed to obtain video frames, and then a video segmentation algorithm may be used to segment the video including the plurality of video frames to obtain video segments. Reference is made to the description of the embodiments below for a specific implementation.
Representative frame-is one of the video segments that can be used to represent the video segment. Illustratively, the representative frame may be a start frame, an end frame, a random video frame, or the highest scoring optimal frame of the video segment, etc.
The CLIP model the CLIP (Contrastive Language-Image Pre-Training) model is a Pre-trained neural network model for matching images and text. In some embodiments, a text encoder (Text Encoder) and an image encoder (Image Encoder) based on the CLIP model perform contrast learning, and may be trained to obtain text encoders for outputting text semantic vectors of text and image encoders for outputting visual semantic vectors of images or video frames.
Text semantic vector-a vector that can be obtained by inputting text into a text encoder that can characterize the semantic features of the entire text. Illustratively, the text encoder may employ a transducer or the like model commonly used in natural language processing (Natural Language Processing, NLP), which is not limiting to the application. In the embodiment of the application, the search text input by the user can be input into the text encoder to obtain the text semantic vector of the search text.
Visual semantic vector-a video frame of an image or video can be input to an image encoder. Illustratively, the image encoder may employ a CNN model or a VIT model, which is not limited by the present application. In the embodiment of the application, the representative frame of the video can be input into an image encoder to obtain the visual semantic vector of the representative frame.
Vector similarity-a term used to describe the degree of similarity between two vectors (e.g., between a text semantic vector and a visual semantic vector). In embodiments of the present application, a video frame that matches a search text may be determined by comparing the similarity between the text semantic vector and the visual semantic vector of the search text. The vector similarity may be calculated by a cosine similarity calculation formula, for example, but may be calculated by other methods.
Entity-words with specific meaning in text. Illustratively, an entity may include, but is not limited to, a time, place, person name, organization name, proper noun in text. In some embodiments, entities in text that have a particular meaning may be identified by Named Entity Recognition (NER) technology (NAMED ENTITY), which is not limiting in this regard.
Image parameters, indicating the presentation characteristics of an image or video frame. By way of example, image parameters may include jitter, sharpness, pixel values, etc. of a video frame, as the application is not limited in this regard.
Attribute tag-information indicating attributes of a video or video segment. In the embodiment of the application, the attribute tags of the video can comprise video acquisition places, video acquisition time, names of people, file names or classification tags and the like.
In the related art, when a gallery of an electronic device stores a video, it also stores a shooting location or shooting time of the video, keywords in a file name of the video, and the like as attribute tags, and uses the attribute tags as indexes of the video. The subsequent user can input simple search texts such as keywords in time, place or file name on the search interface of the gallery, namely search matching between the search texts input by the user and the index (attribute label) of the video can be performed, and video search is realized.
Assuming that the electronic device is a mobile phone 300, a gallery thereof stores video 1, and attribute tags of "star" and "today" have been assigned in advance to the video 1. Illustratively, "stars" may each be a file name manually configured by the user for video 1. "this year" is the video acquisition time of video 1. As shown in fig. 1a, when the search text input by the user in the search box of the gallery provided by the mobile phone 300 is a keyword of "star", the mobile phone 300 may search for a video with an attribute tag of "star" and display a search result including video 1.
However, as shown in fig. 1b, if the user inputs a complex search text of "boys standing beside trees" in the search box of the gallery, the mobile phone 300 cannot search for the video 1, and still needs to be manually found by the user.
In practical application, a user may input the complex search text of "boys standing beside trees" to describe a desired video based on the content of the displayed image, but in the related art, the electronic device only supports search matching of keywords such as attribute tags, which easily results in that the electronic device cannot display the video required by the user, and still needs the user to manually operate the scroll bar of the sliding gallery to perform the searching, which is complex in operation, thereby affecting the use experience of the user on the mobile phone 300.
In order to solve the above-mentioned problems, the embodiments of the present application provide a video searching method, which can be applied to an electronic device, and for convenience of understanding, the composition of the electronic device and its software structure are described.
The application is not limited to the type of electronic device. For example, the electronic device may be a mobile phone, a tablet computer, a desktop, a laptop, a notebook, an Ultra-mobile Personal Computer (UMPC), a handheld computer, a netbook, a Personal digital assistant (Personal DIGITAL ASSISTANT, PDA), a wearable electronic device, a smart watch, or the like, and the specific form of the electronic device is not particularly limited by the present application.
In this embodiment, as shown in fig. 2a, the electronic device may include a processor 110, an internal memory 120, a camera 130, a display screen 140, an audio module 150, a speaker 150A, and a headset interface 150B.
The processor 110 may include one or more processing units, for example, the processor 110 may include a video codec, and/or a neural Network Processor (NPU), etc.
Processor 110 may also be provided with a memory for storing instructions and data.
The internal memory 120 may be used to store computer executable program code including instructions. The processor 110 executes various functional applications of the electronic device and data processing by executing instructions stored in the internal memory 120.
The internal memory 120 may include a storage program area and a storage data area. The storage program area may store an application program required for at least one function of the operating system. (such as a sound playing function at the time of video playing, an image playing function at the time of video playing, etc.), and the like. The storage data area may store data created during use of the electronic device (e.g., video data, etc.), and so on.
In some embodiments, the internal memory 120 stores instructions for performing a video search method. The processor 110 may perform a search for video by executing instructions stored in the internal memory 120.
In some embodiments, the electronic device performs a video search in videos stored in a gallery application, which videos may be shot for a user using the electronic device. The electronic device may implement shooting functions through an ISP, a camera 130, a video codec, a GPU, a display screen 140, an application processor, and the like.
The ISP is used to process the data fed back by the camera 130. In some embodiments, the ISP may be provided in the camera 130. The camera 130 is used to capture still images or video. Video codecs are used to compress or decompress digital video. Thus, the electronic device may play or record video in a variety of encoding formats, such as moving picture experts group (moving picture experts group, MPEG) 1, MPEG2, MPEG3, MPEG4, and the like.
The NPU can realize intelligent cognition and other applications of the electronic equipment, such as image recognition, face recognition, voice recognition, text understanding and the like. In some embodiments, text understanding may be performed by the NPU of search text entered by a user at a search interface provided by the electronic device.
The electronic device implements display functions through the GPU, the display screen 140, and the application processor, etc. The GPU is a microprocessor for image processing, and is connected to the display screen 140 and the application processor.
A series of graphical user interfaces (GRAPHICAL USER INTERFACE, GUIs) may be displayed on the display screen 140 of the electronic device, all of which are home screens of the electronic device. In general, the display screen 140 of the electronic device includes controls that are limited in display, and a user can interact with the controls by direct manipulation (direct manipulation) to read or edit information about the application.
In some embodiments, the electronic device may include a gallery application, the display screen 140 of the electronic device may display an icon corresponding to the gallery application, and after user activation, the display screen 140 of the electronic device may display a search interface of the gallery application, the search interface including a search control. The user can edit the search control to input the search text and trigger the operation, so that the electronic device can search in the gallery application based on the search text input by the user and display the search result to the user through the display screen 140.
In some embodiments, if the user triggers the video stored by the gallery application, the electronic device may implement audio functions when the video is played through the audio module 150, speaker 150A, headphone interface 150B, and application processor, etc.
The audio module 150 is used to convert digital audio information into an analog audio signal output and also to convert an analog audio input into a digital audio signal. The electronic device may listen to audio as the video is played through speaker 150A. The earphone interface 150B is used to connect a wired earphone. The electronic device may listen to audio while video is playing through a wired headset connected to headset interface 150B.
It is to be understood that the configuration illustrated in this embodiment does not constitute a specific limitation on the electronic apparatus. In other embodiments, the electronic device may include more or fewer components than shown, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.
In addition, the electronic device also runs an operating system on the components. Such as the iOS operating system developed by apple corporation, the Android open source operating system developed by google corporation, the Windows operating system developed by microsoft corporation, etc. An operating application may be installed on the operating system.
The operating system of the electronic device may employ a layered architecture, an event driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture. In the embodiment of the application, an Android system with a layered architecture is taken as an example, and the software structure of the electronic equipment is illustrated.
Fig. 2b is a software architecture block diagram of an electronic device according to an embodiment of the application.
The layered architecture divides the software into several layers, each with distinct roles and branches. The layers communicate with each other through a software interface. In some embodiments, the Android system is divided into four layers, from top to bottom, an application layer, an application framework layer, an Zhuoyun rows (Android runtime) and system libraries, and a kernel layer, respectively.
The application layer may include a series of application packages. As shown in fig. 2b, the application package may include applications such as gallery service modules, cameras, music, video playback applications, and the like.
In some embodiments, the gallery service module stores pictures/videos obtained under operations such as shooting, downloading, screen capturing or screen recording by a user, is also used for storing representative frames of video segments, visual semantic vectors of the representative frames and other information, receives search texts input by the user, so that the electronic equipment can perform search matching of the videos based on the search texts input by the user, and displays the pictures/videos obtained under the operations such as shooting, downloading, screen capturing or screen recording by the user.
In some embodiments, the video playback application may be a native video playback application of the electronic device. In some embodiments, the video playback application may also be a third party video playback application.
The application framework layer provides an application programming interface (application programming interface, API) and programming framework for the application of the application layer. The application framework layer includes a number of predefined functions. As shown in FIG. 2b, the application framework layer may include a window manager, a content provider, a resource manager, a view system, and the like.
The window manager is used for managing window programs. The content provider is used to store and retrieve data and make such data accessible to applications. The data may include video, images, etc. The resource manager provides various resources for the application program, such as localization strings, icons, pictures, layout files, video files, and the like. The view system includes visual controls, such as controls to display text, controls to display pictures, and the like. The view system may be used to build applications.
Android run time includes a core library and virtual machines. Android runtime is responsible for scheduling and management of the android system. The core library comprises two parts, wherein one part is a function required to be called by java language, and the other part is an android core library.
The system library may include surface manager (surface manager), media library (Media Libraries), three-dimensional graphics processing library (e.g., openGL ES), two-dimensional graphics engine (e.g., SGL), etc. functional modules.
The surface manager is used to manage the display subsystem and provides a fusion of 2D and 3D layers for multiple applications. Media libraries support a variety of commonly used audio, video format playback and recording, still image files, and the like. The three-dimensional graphic processing library is used for realizing three-dimensional graphic drawing, image rendering, synthesis, layer processing and the like. The 2D graphics engine is a drawing engine for 2D drawing.
The kernel layer is a layer between hardware and software. The inner core layer at least comprises a display driver, a camera driver, an audio driver and a sensor driver.
In addition, the electronic device can also comprise a search module, a multi-mode understanding module, a natural language understanding module and other functional modules. The modules may all be located in the same layer of the electronic device, may be located in different layers of the electronic device, or may be located in multiple layers of the electronic device at the same time, so as to implement functions thereof through software interfaces between the layers.
In some embodiments, the search module is configured to construct an index corresponding to the video based on the information stored in the gallery service module, and is also configured to recall vectors in a plurality of indexes based on text semantic vectors corresponding to the search text, recall entities in a plurality of indexes based on entities in the search text, and rank the vector recall results and the entity recall results to obtain the search results.
In some embodiments, the multi-mode understanding module is configured to perform frame splitting processing on the video to obtain video frames, perform segmentation processing on the video by using a video segmentation algorithm to obtain video segments, and then determine a representative frame of the video segments and a visual semantic vector of the representative frame.
In some embodiments, the natural language understanding module is used to identify entities in search text entered by a user.
The video searching method provided by the embodiment of the application can be realized based on interaction among the four modules, namely the gallery service module, the searching module, the multi-mode understanding module and the natural language understanding module, and the specific implementation can be seen in fig. 5b and the detailed description in the following embodiments.
Although the Android system is taken as an example for explanation, the basic principle of the embodiment of the application is also applicable to electronic devices based on iOS, windows and other operating systems.
In order to make the technical scheme of the present application more clearly understood by those skilled in the art, the application scenario of the technical scheme of the present application is first described below.
The video searching method provided by the embodiment of the application can be implemented in a gallery searching scene of the electronic equipment.
Referring to fig. 3a, the method for searching video provided by the embodiment of the present application is exemplified by an electronic device as a mobile phone 300. In this scenario, the gallery of the handset 300 may include multiple pictures and multiple videos. In some embodiments, the pictures/videos may be taken by the user through the mobile phone 300 and stored in the gallery, or may be downloaded by the user through other application platforms and stored in the gallery, or may be downloaded by the user after sending other electronic devices to the mobile phone 300.
In the video searching method, the mobile phone 300 can construct an index for videos stored in the gallery in a state of charging or screen-off. In the process of constructing the index, the mobile phone 300 may perform frame disassembly processing on the video to obtain frames, perform segmentation processing on the video based on a plurality of video frames by using a video segmentation algorithm to obtain video segments, determine the representative frames of the video segments and visual semantic vectors of the representative frames, and then construct the index of the video segments based on the visual semantic vectors of the representative frames. The specific implementation can be seen in fig. 5b and in the detailed description of the embodiments below. It should be appreciated that the pictures stored in the gallery need not be frame broken and the index of the picture may be constructed in a manner that references the index based on the representative frames of the video segments.
Assume that a plurality of videos are captured by the mobile phone 300 and stored in the mobile phone 300 during a weekend trip by the user. The user wishes to clip the video photographed during the tour at leisure time, and the user can search through the search function of the gallery of the mobile phone 300 at this time so that the mobile phone 300 displays the video photographed during the tour.
As shown in fig. 3a, the handset 300 displays a main interface 310, where the main interface 310 includes application icons of a plurality of application programs, such as application icon 311 of a gallery. The user triggers the application icon 311 of the gallery, the mobile phone 300 starts the gallery in response to the triggering operation of the user, and displays the album display interface 320 of the gallery, where the album display interface 320 includes a search box 321 and a plurality of albums. For example, the plurality of albums may include the "all photos" album, the "cameras" album, the "my collection" album, the "screen shots" album, the "my collection" album, the "self-created" album, the "video editing" album, and so forth shown in FIG. 3 a.
In one example, the user may trigger a search box 321 included in the album display interface 320, and input the search text "men standing outdoors" in the search box 321. The mobile phone 300 converts the text semantic vector into the text semantic vector based on 'men standing outdoors' input by a user, searches and matches the text semantic vector with an index of a video and an index of a picture based on the text semantic vector, wherein the index of the video comprises a visual semantic vector representing a frame, and the index of the picture comprises a visual semantic vector of the picture, so that the vector similarity between the text semantic vector and the visual semantic vector can be calculated, and the picture/video corresponding to the visual semantic vector with the vector similarity exceeding a vector similarity threshold is used as a first search result. The cell phone 300 may then display a first search result display interface 330 of the gallery, and the first search result display interface 330 may display a first search result corresponding to "men standing outdoors", which may include pictures and videos therein.
The index of video in the first search result includes a visual semantic vector representing a frame, which as described above may be used to represent a corresponding video segment, so that the visual semantic of the representing frame may represent the visual semantics of a plurality of video frames of the video segment, that is, the visual semantic vector representing the frame is sufficiently associated with the video segment. When a user inputs a complex search text based on the picture content displayed by the video, the method and the device perform search matching with the visual semantic vector representing the frame based on the text semantic vector corresponding to the search text, can realize fusion interaction of the search text and the picture content displayed by the video segmentation, improve the accuracy of video search, and further improve the use experience of the user.
In some embodiments, the first search results displayed by the handset 300 may be ranked in order of high to low relevance to the search text entered by the user.
As shown in fig. 3a, the first search result display interface 330 displayed by the mobile phone 300 may include a first display area 331, a second display area 332, and a third display area 333. In the first display area 331, all first search results, including pictures and videos, are presented. In the second display area 333, one of the most matching video search results is displayed, the number "4" of all the video search results is displayed, and the time point corresponding to the most matching video search result is displayed in an enlarged manner. In the third display area 332, one of the best matching picture search results is shown, and the number "6" of all picture search results is shown.
As shown in fig. 3a, the video search results of the first search result display interface 330 display thumbnails and time points. In some embodiments, the thumbnail corresponds to a video frame of a video segment in the video segment in which the visual semantic vector representing the frame matches a text semantic vector corresponding to the search text entered by the user, i.e., the vector similarity between the two exceeds a vector similarity threshold. The time point of the video may be the time point of the video segment to which the video frame corresponding to the thumbnail belongs. The time point may be, for example, a time point of a video frame corresponding to the thumbnail, or may be a time point of another video frame in the video segment.
Taking the video segment a of the video search result "video a" in the first display area 331 as an example, the "video a" is divided into a plurality of video segments, and the thumbnail displayed by the "video a" corresponds to the start frame of the video segment a, and the vector similarity between the visual semantic vector of the representative frame of the video segment a and the text semantic vector of the "men standing outdoors" exceeds the vector similarity threshold.
The second display area 332 includes two points in time, the first point in time being "02:18", indicating that the start frame of video segment a begins to be displayed at "02:18", and the second point in time indicating that the total duration of "video a" is "08:32". As shown in fig. 3a, if "video a" is a video required by the user, the user may trigger "video a", and the mobile phone 300 may display a video playing interface 340 in response to the triggering operation of the user on "video a", where the mobile phone 300 adjusts the time schedule of "video a" to "02:18" to start playing, that is, starts playing from the start frame of video segment a in "video a".
It should be noted that, the thumbnail and the time point shown in the above "video a" are only examples.
For example, the thumbnail may correspond to a representative frame of video segment a in "video a", which may be a start frame, an end frame, an intermediate frame, an optimal frame, or any one of the video frames of video segment a, etc. The start frame of video segment a may also be used as a thumbnail, as the application is not limited in this regard.
For example, the first time point may be the same as the time point of the video frame corresponding to the thumbnail, for example, the thumbnail corresponds to the representative frame of the video segment a, and the first time point may be the time point corresponding to the representative frame, or the first time point may be different from the time point corresponding to the thumbnail, for example, the thumbnail is the representative frame of the video segment a, the representative frame is the middle frame of the video segment a, and the first time point may be the time point corresponding to the start frame of the video segment a.
It should be noted that, the above manner of displaying the search box 321 through the album display interface 320 for the user to input the search text is only an example, and the search box may also be displayed through other interfaces of the gallery.
In some embodiments, as shown in fig. 3a, album display interface 320 has a "photo" control, a "time point" control, and an "authoring" control at the bottom for entering the photo display interface, the time point display interface, and the authoring display interface, respectively. The photo display interface, the time point display interface and the creation display interface all comprise search boxes, the search boxes of the three interfaces and the search box 321 of the album display interface 320 support the same search function, all support searching in pictures and videos included in a gallery based on search text input by a user, and after the searching is completed, the first search result display interface 330 is also displayed, and the included first search results are the same.
It will be appreciated that if the vector similarity between the visual semantic vector representing a frame of a video segment and the text semantic vector of a certain text exceeds a certain threshold, the video segment may be searched when the user enters the text.
In some embodiments, a video includes multiple video segments, and search text entered by a user may match an index of one or more video segments of the video.
For example, the search text "men standing outdoors" entered by the user may match video segment a, video segment b, and video segment c in "video a". Then video segment a, video segment b, and video segment c of "video a" may be displayed simultaneously. The thumbnail shown by video segment b of "video a" may be the representative frame of video segment b and the thumbnail shown by video segment c of "video a" may be the starting frame of video segment c. As shown in the first display area 331 of the first search result display interface 330 of fig. 3a, the video segment b, and the video segment c of the video a can be searched for the search text "men standing outdoors" entered by the user. Video segment a, video segment b, and video segment c are ordered according to the order of matching degree with the search text. The vector similarity of the visual semantic vector representing each of the three video segments to the text semantic vector of "men standing outdoors" exceeds the vector similarity threshold.
In other embodiments, different search text entered by the user may all search for the same video segment. And is described next in connection with fig. 3 b.
As shown in fig. 3b (1), the first search result display interface 350 may include a first display area 351 and a second display area 352. Wherein all video search results are presented in the first display region 351. All of the picture search results are presented in the second display area 352, and the photographing time and photographing place of the image may be displayed to the right of each picture search result.
Assuming that the user inputs "men standing on river side" at the input box, the first display region 351 may display the video search result matching thereto, including the video segment B of "video B". The first display area 351 includes a thumbnail of a representative frame of the video segment B and two time points, the first time point being a time "05:48" corresponding to a start frame of the video segment B, the second time point representing a total duration "14:18" of the "video B". If the user triggers the thumbnail of the representative frame of the video segment B, the mobile phone 300 can display the video playing interface 360, in which the mobile phone 300 adjusts the time schedule of the "video B" to "05:48" to start playing, that is, the mobile phone 300 starts displaying the starting frame of the video segment B in the "video B".
Alternatively, the first time point may also be a time point of a representative frame of the video segment B of the video B, which is not limited herein.
In fig. 3B, (1) is merely an example, and the first display area 351 may display other information of the "video B", such as a shooting location, shooting time, person name, person relationship, and the like corresponding to the "video B". The second display area 352 may also display other information of the image search result, such as a name of a person, a relationship of a person, etc. to which the image corresponds.
It should be noted that, after the user triggers the thumbnail of the representative frame of the video segment B, the mobile phone 300 adjusts the time schedule of the "video B" to the first time point of the display for playing. It should be understood that, when the first time point is the time point of the representative frame of the video segment B of the video B, if the middle frame of the video segment B is the representative frame, the time schedule may be adjusted to the time point corresponding to the middle frame to start playing, where the time point corresponding to the middle frame is "07:09", and is later than the time "05:48" corresponding to the start frame.
As shown in (2) of fig. 3b, the first search result display interface 370 may include a first display area 371 and a second display area 372. Wherein all picture search results are presented in the first display area 371. All video search results are presented in the second display area 372.
Assuming the user enters "a boy standing near the bridge" at the input box, the second display area 372 may display video search results that match it, including video segment c of "video B" and video segment B of video "B". In connection with the two different search texts of "a boy standing on the river side" and "a boy standing near the bridge" entered by the user, as shown in fig. 3B (1), both search for the video segment B of video "B". The visual semantic vector of the representative frame of the video segment B indicating "video B" is beyond the vector similarity threshold with the text semantic vector of "boys standing near the river" as well as the text semantic vector of "boys standing near the bridge". The display form of the video segment B of the "video B" can be referred to the description of (1) in fig. 3B, and will not be described herein.
The first display area 372 includes a thumbnail of the representative frame of the video segment c and two points in time, the first point in time being the time "08:32" corresponding to the representative frame of the video segment c, the second point in time representing the total duration "14:18" of "video B". If the user triggers the thumbnail of the representative frame of the video segment c, the mobile phone 300 can display the video playing interface 380, in which the mobile phone 300 adjusts the time schedule of the "video B" to "07:26" to start playing, that is, the mobile phone 300 starts playing from the start frame of the video segment B in the "video B". The display forms of the video search result and the picture search result in fig. 3b (2) can be referred to the description of fig. 3a and fig. 3b (1), and will not be repeated here.
It should be appreciated that the recommendation information may be displayed within a search box of a gallery application, which may facilitate a user to obtain relevant information for pictures/videos stored in the gallery.
In some embodiments, the recommendation information displayed within the search box of the gallery is a sentence with natural semantics. The user can input text in the search box according to the content and format of the recommended information, and sentences with natural semantics are adopted as the search text for inputting the search box. The electronic equipment searches the pictures/videos in the gallery according to the search text with natural semantics, so that the pictures/videos required by the user can be more accurately searched.
In one implementation, an electronic device may obtain an attribute tag for a picture/video. In one example, the attribute tags include at least one of time, place, category tag, event.
In some embodiments, the "time" may be obtained from the time of the picture/video capture. By way of example, "time" may include "2023 11, 9, tuesday", "2021, 7, 2, friday", and so forth.
The "location" may be acquired from the shooting location of the picture/video. For example, can be obtained from GPS positioning information at the time of video capture. Illustratively, a "place" may include a city, a sight, etc.
The "class labels" and "events" may be obtained by semantic analysis of the pictures/video. In one example, an electronic device may employ a computer vision service to semantically analyze video frames in a picture or video, generating content of "class labels" and "events" from its semantics.
By way of example, a "class label" may include characters, plants or animals, etc., but also buildings, natural scenes, etc., and may include street art, musical instruments, art exhibitions, athletic contests, birthdays, etc. For example, the persona may include a persona name, a professional name, a persona age group name, etc.
By way of example, an "event" may include a game, sport, tour, etc.
It should be appreciated that for a picture/video, its corresponding attribute tags may include one or more of time, place, category tags, event. For example, for a picture downloaded from a network, the electronic device does not acquire its shooting time, and the content of which attribute tag does not include "time" or "time" is empty.
In one implementation, one or more attribute tags of the picture/video are spliced according to a preset rule, and at least one piece of combined information can be generated. Illustratively, the electronic device concatenates the four attributes "time", "place", "category label" and "event" of the video, and may generate a piece of combined information. Also for example, the electronic device may also generate a piece of combined information from an attribute tag of the picture/video. For example, the attribute tag is "time".
In one implementation, the fixed splice word is spliced with a piece of combination information, so that recommended content corresponding to the picture/video can be generated. Illustratively, the fixed splice word may include "try-search", "on", "shot", "video", and the like.
Table 1 shows some examples of recommended content generated from different numbers of attribute tags. When the recommended content is generated, the content of the attribute tag may be mapped correspondingly. For example, the specific time "2023, 10-month, 10-day monday" is mapped as "today", "previous day", "eighth month", "last year", "this year" or "national festival" or the like.
TABLE 1
As shown in table 1, the attribute tags of one video may generate a plurality of spliced contents according to the splicing combination manner shown in table 1. For example, if the attribute tags of the video include "category tags", "time", "place", 5 different pieces of spliced contents may be generated in a spliced combination manner of sequence numbers 1 to 5 in table 1, respectively.
In one implementation, the electronic device may determine any one of the plurality of spliced content as the recommended content corresponding to the picture/video.
In another implementation manner, the above-mentioned various splicing and combining manners respectively correspond to one priority, and the priorities of the splicing and combining manners are sequentially reduced according to the order from front to back in table 1. And the electronic equipment generates recommended content corresponding to the picture/video according to one splicing combination mode with the highest priority in the splicing combination modes supported by the picture/video. As shown in table 2, if the attribute tags of the video include "category tag", "time" and "place", the fixed splice word is spliced with "category tag", "time" and "place" according to the splice combination method of the sequence number 1 in table 1, so as to generate the recommended content corresponding to the picture/video.
In the video searching method provided by the embodiment of the application, the recommendation information in the search frame can be generated according to the recommendation content corresponding to one picture/video in the gallery.
In one implementation, the electronic device may select recommended content corresponding to one of the first pictures/videos in the gallery to generate recommended information within the search box. In one example, one first picture/video may be within a preset period of time, e.g., a period of time more than one month from today. In one example, the electronic device updates the first picture/video once a day and the selected first picture/video does not repeat for a preset period of time (e.g., a week).
The example, the first day, the recommended information displayed in the search box in the gallery of the electronic device is "video of trial search three days before the trial search is performed on the river side", the second day, the recommended information displayed in the search box in the gallery of the electronic device is "picture of the trial search the last eight months of the year" is performed. The first picture/video selected by the electronic device every day is not repeated within a preset time period (such as a week), and correspondingly, the recommended information displayed in the search box in the gallery of the electronic device is not repeated every day.
Based on the above example of the scenario shown in fig. 3a, taking the electronic device as the mobile phone 300 as an example, assume that the user shoots a plurality of videos through the mobile phone 300 and stores the videos in the gallery of the mobile phone 300 on a weekend. The mobile phone can generate recommendation information in the search box according to the recommendation content corresponding to the video.
As shown in fig. 3c, the mobile phone 300 displays an album display interface 390, the album display interface 390 including a search box 321, in one example, the recommendation information "try search for video on the river three years old" is displayed in the search box 321. The recommendation information is used as an example of search text entered by the user. The user can input the search text in the search box according to the content and format of the recommendation information to search.
As shown in connection with fig. 3c, after the user clicks the trigger search box 321, the mobile phone 300 displays a search interface 301, the search interface 301 including the search box 321 in which the user can input text. The mobile phone 300 detects the operation of inputting text in the search box 321 by the user, and can search in the gallery according to the search text in the search box 321.
For example, recommendation information "try-on search for video on the river three last years" is displayed in the search box 321. The 'Zhang Sanyuan' is a sentence with natural semantics, and compared with the single labels of 'Fu Yuan', 'Zhang Sanyuan', and the like, the video which the user really needs to search is easier to search. For example, 7230 pictures/videos corresponding to the label "last year" in the gallery, 1985 pictures/videos corresponding to the label "river side", 1985 pictures/videos corresponding to the label "Zhang Sano" and a large number of pictures/videos searched by a single label. And according to the method of searching in the riverbank in the last three years, the number of searched pictures/videos can be obviously reduced. Therefore, sentences with natural semantics are displayed as recommendation information, and convenience and rapidness in searching experience are provided for users.
Also, for example, the video searching method provided by the embodiment of the application can be implemented in a global searching scene of the electronic device.
Referring to fig. 4a, the method for searching video provided by the embodiment of the present application will be described below by taking an electronic device as an example of a mobile phone 300. In this scenario, the gallery of the handset 300 may include multiple pictures and multiple videos. The source of the pictures/video can be found in the above examples and will not be described in detail here.
Referring to fig. 4a, based on the example of the scenario shown in fig. 3a, it is still assumed that the user has shot a plurality of videos through the mobile phone 300 and stored the gallery of the mobile phone 300 on a weekend trip. As shown in fig. 4a, the mobile phone 300 displays a main interface 410, and the user can slide the main interface 410 rightward, and the mobile phone 300 can display a negative one-screen interface 420. The negative one-screen interface 420 provides global search functions, and can provide rich online search services for users, and search services for local resources of the mobile phone (i.e., local files stored in the mobile phone 300).
The negative one-screen interface 420 includes a search box 421. The user may input a search text "men standing outdoors" in the search box 421, and the mobile phone 300 may perform an online search based on the search text input by the user, and perform a local resource search in a local file included in the mobile phone 300. When the search is completed, the mobile phone 300 displays a second search result display interface 430, and the second search result display interface 430 may display a second search result corresponding to the inputted "men standing outdoors".
As shown in fig. 4a, the second search result display interface 430 may include a first display area 431, a second display area 432, and a third display area 433. In some embodiments, the first display area 431 includes second search results of an online search, the second display area 432 includes second search results of an in-application search, and the third display area 433 includes second search results of a local file.
In some embodiments, the second search results searched in the application include the second search results searched in a gallery application. Illustratively, the second display area 432 displays content that is displayed in a partial area of the first search result display interface 330 of FIG. 3 a. The second display area 432 includes second search results that are pictures and videos stored in a gallery. Herein, the thumbnail and the time point shown in the video may be referred to the above examples, and are not described herein.
Note that, the content displayed in the second display area 432 is only an example, and the content displayed in other areas of the first search result display interface 330 in fig. 3a may also be displayed.
It should be noted that the above manner of displaying the search box 421 through the negative one-screen interface 420 for the user to input the search text for performing the global search is merely an example. In some embodiments, the user may also pull down the main interface of the mobile phone 300, and the mobile phone 300 displays a main menu interface, where the main menu interface includes a search box, and may also provide a global search function, and the displayed search result is the same as the second search result.
Also exemplary, the video searching method provided by the embodiment of the application can be implemented in a searching scene of a video playing application of the electronic device.
In the scene, the video searching method is realized through interaction between the electronic equipment and the cloud server.
Still taking the electronic device as the mobile phone 300 as an example, in this scenario, the mobile phone 300 includes a video playing application.
As shown in fig. 4b, the mobile phone 300 displays a main interface 440, and the main interface 440 includes application icons of a plurality of application programs such as an application icon 441 of a video playing application. The user triggers the application icon 441 of the video playback application and the handset 300 displays a first interface 450 of the video playback application, the first interface 450 comprising a search box 451 and a search control 452. The user may enter the search text "record sheet for city B" in search box 451 and trigger search control 452. The mobile phone 300 responds to the triggering operation of the user on the search control 452, and can search among a plurality of videos of the cloud server based on search text input by the user. After the search is completed, the mobile phone 300 displays a third search result display interface 460 of the video playing application, where the search result display interface includes a plurality of third search results, and the third search results are videos, and the thumbnail and the time point displayed by the third search results may be referred to the above examples and are not described herein again.
In some embodiments, the third search result display interface 460 may also display the release time of the third search result, that is, the time when the third search result is stored in the cloud server. As shown in fig. 4b, taking the "video C" included in the third search result display interface 460 as an example, the third search result display interface 460 also displays the release time "2018-05-03" of the "video C", which indicates that the "video C" is stored in the cloud server on the 5th month and 3 rd day of 2018. Illustratively, it may be that the user uploads "video C" to the video playing application at 2018, 5, 3, to cause the cloud server to store the "video C".
In some embodiments, as shown in FIG. 4b, the third search results display interface 460 includes "video C" with a first point in time of "05:48". If "video C" is the video required by the user, the user can trigger "video C", and the mobile phone 300 can interact with the cloud server to play from "05:48" of "video B".
The video searching method provided by the application is described in detail below with reference to fig. 5 a. As shown in fig. 5a, the video search method includes an index construction stage and a search stage.
In the index building stage, in some embodiments, an index may be built for each video frame of the video.
In some embodiments, if the solution is implemented on the terminal device, an index is built for each video frame of the video, and matching needs to be performed frame by frame in the search stage, which can increase the time delay required for searching and reduce the user experience. Alternatively, in some embodiments, the video may be segmented and the video index built in segment units.
The method includes the steps of firstly carrying out frame disassembly processing on video to obtain a plurality of video frames of the video, then carrying out segmentation processing on the video based on the plurality of video frames of the video by utilizing a video segmentation algorithm to obtain a plurality of video segments such as a video segment 1, a video segment 2 and the like, then selecting a representative frame from the plurality of video frames in the video segment, respectively carrying out score evaluation on the plurality of video frames in the video segment, determining the video frame with the highest score as the representative frame of each video segment, inputting the representative frame into an image encoder of a CLIP model, outputting a visual semantic vector of the representative frame, constructing an index based on the visual semantic vector corresponding to the representative frame, and obtaining an inverted index library based on the constructed index.
It should be noted that the selection manner of the representative frame is only an example, and the start frame, the intermediate frame, the end frame, or the random video frame of the video segment may be selected as the representative frame, which is described in the following embodiments. In the searching stage, a user inputs a searching text in a searching interface, inputs the searching text into a text encoder of a CLIP model, outputs text semantic vectors corresponding to the searching text, carries out vector recall from an inverted index base based on the text semantic vectors corresponding to the searching text, sends the searching text to a natural language understanding module, carries out entity identification on the searching text to obtain entities in the searching text, carries out entity recall from the inverted index base based on the entities in the searching text, returns vector recall results and entity recall results from the inverted index base based on the vector recall and the entity recall, sorts vector recall results and the searching results to obtain the searching results, and displays the searching results on the searching interface.
In some embodiments, the image encoder of the CLIP model and the text encoder of the CLIP model may be trained based on a contrast learning training approach.
In the following fig. 5b, the mobile phone 300 is taken as an example of the electronic device, and the gallery service module, the search module, the multi-modal understanding module and the natural language understanding module shown in the system gallery in fig. 2b are combined to describe the video search method provided in the embodiment of the present application in detail.
As shown in fig. 5b, the video searching method provided by the embodiment of the present application may be divided into two stages, namely, an index construction stage and a searching stage.
First, the steps involved in the index build phase are described in detail in connection with FIG. 5 b.
S501, a gallery service module 510 receives a new operation or a modification operation of a user on a video.
The new operation of the video refers to an operation of storing the video in the gallery application by the user. For example, the user's new operation on the video may be an operation in which the user photographs the video through the mobile phone 300, downloads the video, or records the screen of the mobile phone 300. The video modification operation refers to an operation of modifying stored videos applied to a gallery by a user. For example, the modification operation of the video by the user may be an operation of cropping, splicing, adding special effects or adding subtitles to the video by the user.
In some embodiments, the gallery service module 510 receiving a user's add-on or modify-on operations on the video includes the gallery service module 510 receiving a user's add-on or modify-on operations on the attribute tags of the video.
By way of example, the attribute tags of the video may include a video acquisition location (e.g., a location where the video was captured, a download source of the video, etc.), a video acquisition time (e.g., a capture time, a download time, or a recording screen time), a name or category tag for a person newly added or modified by the user for the video, an event, and so forth. The classification label may be used to indicate the kind of object presented by the video, and may be, for example, a person, a plant, an animal, a building, or a natural scene. Events may be used to indicate what the video exposed object does. By way of example, the event may be a game, a sport, or the like.
In some embodiments, the classification labels may be manually configured by a user or may be automatically classified by the electronic device.
S502, a gallery service module 510 stores video.
The gallery service module 510 stores the video to the handset 300 in response to a user's add-on or modify operation to the video. Illustratively, the video may be stored in a local file of the handset 300, which the user may view through a variety of approaches, such as a local folder of the handset 300, a gallery application, and so forth. Also, for example, the video may be stored to the cloud for backup under the authorized operation of the user, so as to reduce the memory pressure of the mobile phone 300.
In some embodiments, gallery service module 510 may store the attribute tags of the video in response to a user's new or modified operation on the attribute tags of the video.
S503, the gallery service module 510 invokes the multimodal understanding module 530 to determine a representative frame of the video.
It will be appreciated that a video typically comprises a plurality of video frames, and that it is assumed that for each video frame a corresponding visual semantic vector is determined and stored as an index, i.e. a video corresponding to a large number of indices, a large amount of computational resources and a large amount of storage space are wasted. Meanwhile, noise information may exist in a large number of indexes, so that time is consumed when searching and matching is performed, and video searching results are easily affected.
Therefore, in the embodiment of the present application, the multi-mode understanding module 530 is used to segment the video, and the corresponding representative frame is determined from each video segment, so that the corresponding visual semantic vector is determined for the representative frame to be stored as the index in the subsequent step, and the number of indexes corresponding to the video is greatly reduced. Therefore, the representative frames corresponding to the video segments can completely represent the video semantics of the video, and meanwhile, the cost can be saved.
In some embodiments, gallery service module 510 may invoke the computer vision service provided by multimodal understanding module 530 to determine representative frames of the video. The computer vision service refers to the multi-modal understanding module 530 performing the video semantic understanding on the video, and the multi-modal understanding module 530 determines the representative frame of the video.
Video semantic understanding refers to enabling a mobile phone to understand what is expressed in the content shown in a video, such as understanding information about the type, number, location, relationship between objects in the video, and the like.
In some embodiments, computer vision services provide services for de-framing video, segmenting video using video segmentation algorithms, and determining representative frames of video segments. The determination of the representative frames of the video can be subdivided mainly into the following steps 1-3:
step 1, the multimodal understanding module 530 uses the computer vision service to frame the video.
It will be appreciated that the video is made up of a plurality of video frames, each video frame being a still picture in the video, i.e. each video frame may be an image. The de-framing process refers to the multi-modal understanding module 530 utilizing computer vision services to break up video into individual video frames.
In some embodiments, during the de-framing process of the video by the multimodal understanding module 530, each resulting video frame may be marked with a frame identification and a point in time. One frame identification is used to uniquely tag one video frame, and different video frames can be distinguished based on the frame identification. The time point refers to the time when the video frame appears in the video.
In some embodiments, the category labels for each video frame may be identified during the de-framing process of the video with the computer vision service by the multimodal understanding module 530. The class label may be used to indicate the category to which the video frame presentation object belongs. Illustratively, the category labels may be characters, plants, animals, or the like presented in the video frames. Also by way of example, the category labels may be buildings, natural scenes, etc. presented in the video frames.
The number of classification tags for the video frame is not limited. For example, as shown in fig. 3a, the category labels of the thumbnail images of "video a" included in the first search result display interface 330 may include "boy", "star", "sky", and "tree", etc.
Step2, the multi-modal understanding module 530 segments the video using the computer vision service.
Segmentation processing refers to dividing a video into a plurality of video segments using a video segmentation algorithm provided by a computer vision service.
It will be appreciated that when a video is played, the content presented varies continuously with the sequential play of the video frames, but to a different extent. Thus, similar video frames (of small degree of variation) can be categorized into one video segment.
In some embodiments, the degree of change of one video frame and its adjacent video frames can be measured based on the classification label of the video frame and the image parameters of the video frame, so as to segment the video. Illustratively, the image parameters of the video frame may include jitter, sharpness, pixel values, etc. of the video frame, which the present application is not limited to.
The jitter of the video frame refers to the phenomenon that the content displayed by the video frame is jittered or rocked in the video playing process. For example, when a user holds the mobile phone 300 to take a photograph, a situation in which shake is noticeable may occur when another scene is desired to be photographed by the mobile phone 300. The definition of a video frame refers to the definition of each detail shadow and its boundary in the video frame. The pixels of the video frame may represent the brightness of the video frame.
It can be understood that, compared with the adjacent video frames, the value corresponding to the jitter degree, the variation degree of other image parameters and the variation degree of the classification label of one video frame are positively correlated with the variation degree of the content displayed by the video, that is, the larger or smaller the value corresponding to the jitter degree, the more obvious the variation of the variation degree of the image parameters and the variation degree of the classification label, and the more obvious the variation degree of the content displayed by the video.
In some embodiments, the video segmentation algorithm may be expressed by the following formula 1, calculate the segmentation score of the video frame based on the classification label and the image parameter of the video frame in combination with the following formula 1, and determine whether to determine the video frame as the start frame of a video segment or the end frame of a video segment according to the segmentation score of the video frame. Equation 1 is shown below:
y=α×frameA+β×frameB+γ×frameY+δ×frameT
Where y represents the segmentation score of the video frame, frameA represents the jitter score of the video frame, frameB represents the sharpness change score of the video frame, frameY represents the label change score of the video frame, frameT represents the pixel change score of the video frame, and α, β, γ, and δ represent the coefficients of frameA、frameB、frameY and frameT, respectively. In some embodiments, α, β, γ, and δ may be values that are manually preset according to the degree of influence of the jitter, sharpness, label, and pixel variations of the video frame on the segmentation score of the video frame.
It should be noted that the above process of determining the segmentation score of a video frame based on the classification tag and the plurality of image parameters is merely an example. The segmentation score for a video frame may also be determined based on the classification tag and one or more of a plurality of image parameters, as the application is not limited in this respect.
In some embodiments, the jitter degree of a plurality of video frames in the video may be detected based on an optical flow method of image displacement, a feature point matching method, and a video jitter detection method based on image gray distribution features and the like. And determining the jitter degree fraction of the video frame based on the corresponding relation between the preset jitter degree range and the jitter degree fraction.
In some embodiments, a sharpness detection tool may be used to determine sharpness of a plurality of video frames in a video, and then the sharpness of a video frame may be compared with sharpness of neighboring video frames to determine a sharpness change value for the video frame. And then determining the definition change score of the video frame based on the corresponding relation between the preset definition change value range and the definition change score. Illustratively, the video quality detection tool may be open source software such as FFmpeg, video Quality Measurement Tool, etc.
In some embodiments, the classification tags of one video frame may be compared to classification tags of adjacent video frames, and a tag composite change for the video frame is determined based on the number change of classification tags of the video frame and the content change of the classification tags. For example, the union and intersection of the classification labels of one video frame with the classification labels of its neighboring video frames may be calculated, and the number of classification labels in the union may represent the number of classification labels of the video frames, and the number of classification labels in the intersection may represent the content of the classification labels. When the number of the classification labels in the union or the number of the classification labels in the intersection changes, the label change score of the video frame can be determined according to the preset corresponding relation between the number change range of the classification labels and the label change score.
In some embodiments, a pixel detection tool may be used to determine pixels of a plurality of video frames in the video, and then compare the pixels of the video frames to pixels of adjacent video frames to determine pixel variation values of the video frames. And then determining the pixel change score of the video frame based on the corresponding relation between the preset pixel change value range and the pixel change score. Illustratively, the pixel detection tool may be a PixelStick, measureIt, guides or the like plug-in.
In some embodiments, the degree of change in the content exhibited by two adjacent video frames in the video may be positively correlated with the magnitude of the segmentation score of the video frame, i.e., a greater segmentation score of a video frame indicates a more pronounced degree of change in the content exhibited by the video frame as compared to the adjacent video frame.
For example, a video frame may be compared to its previous video frame and a segment score threshold may be preset, with the video frame having a segment score exceeding the segment score threshold being determined to be the start frame of a video segment. As shown in the above formula 1, the higher the jitter of the video frame, the higher the value corresponding to frameA, the higher the sharpness change of the video frame compared to the previous video frame, the higher the value corresponding to frameB, the higher the tag change of the video frame compared to the previous video frame, the higher the value corresponding to frameY, the higher the tag change of the video frame compared to the previous video frame, the higher the value corresponding to frameY, the larger the pixel change of the video frame compared to the previous video frame, and the higher the value corresponding to frameT.
Also for example, a video frame may be compared to its next video frame and a segment score threshold may be preset, with video frames with segment scores exceeding the segment score threshold being determined to be end frames of a video segment.
In some embodiments, the degree of variation of the content presented by two adjacent video frames in the video may also be inversely related to the size of the segmentation score of the video frame, i.e. the smaller the segmentation score of a video frame is, the more pronounced the degree of variation of the content presented by the video frame compared to the adjacent video frame is, i.e. the lower the segmentation score of a video frame is, the greater the likelihood that it is determined as the start frame of a video segment or as the end frame of a video segment.
For example, a video frame may be compared to its previous video frame and a segment score threshold may be preset, with a video frame with a segment score below the segment score threshold being determined to be the start frame of a video segment. As shown in the above formula 1, the higher the jitter of the video frame, the lower the value corresponding to frameA, the higher the sharpness change of the video frame compared to the previous video frame, the lower the value corresponding to frameB, the higher the tag change of the video frame compared to the previous video frame, the lower the value corresponding to frameY, the higher the tag change of the video frame compared to the previous video frame, the lower the value corresponding to frameY, the higher the pixel change of the video frame compared to the previous video frame, and the lower the value corresponding to frameT.
Step 3, the multi-modal understanding module 530 determines a representative frame corresponding to each video segment using the computer vision service.
After the plurality of video segments are obtained, the multimodal understanding module 530 determines from each video segment a representative frame that can represent the video semantic of the video segment.
In some embodiments, the representative frame may be a start frame, an end frame, an intermediate frame, or a random frame in one video segment.
Illustratively, assuming that a video segment contains 99 video frames, the 1 st video frame (start frame), 99 th video frame (end frame), or 50 th video frame (intermediate frame is taken as a representative frame of the video segment) of the 99 video frames may be randomly selected as a representative frame of the video segment according to the time sequence.
In some embodiments, the optimal frame may be determined from a video segment as the representative frame based on preset rules associated with the image parameters and class labels of the video frame.
For example, a score for a video frame may be calculated based on image parameters and classification tags for the video frame. The more the number of labels of the video frame, the higher the score thereof, the lower the jitter degree of the video frame, the higher the score thereof, the higher the definition of the video frame, the higher the score thereof, the lower the pixel change of the pixel of the video frame compared with the pixel of the previous video frame, and the higher the score thereof. Finally, the highest scoring video frame (i.e., the optimal frame) in a video segment may be determined to be the representative frame.
The determination of representative frames in the video is described in detail below in conjunction with fig. 6.
As shown in fig. 6, the video includes 120 video frames, one video frame is compared with the previous video frame, the segmentation scores corresponding to the 120 video frames are calculated according to the above formula 1, and the video frame with the segmentation score exceeding the segmentation score threshold is used as the start frame of one video segment.
In some embodiments, the 120 video frames are divided into 3 video segments, video segment 1, video segment 2, and video segment 3, and the intermediate frames of each video segment are determined to be their corresponding representative frames, video segment 1 comprising 40 video frames, video segment 2 comprising 51 video frames, and video segment 3 comprising 29 video frames. Namely, the 20 th video frame in the video segment 1 is taken as a representative frame corresponding to the video segment 1, the 26 th video frame in the video segment 2 is taken as a representative frame corresponding to the video segment 2, and the 15 th video frame in the video segment 3 is taken as a representative frame corresponding to the video segment 3.
It should be noted that, the number of video frames in the video segment 1 is 40, and is an even number, and the 20 th video frame and the 21 st video frame are intermediate frames, and any one of the intermediate frames may be determined as a representative frame corresponding to the video segment 1, which is not limited in the present application.
S504 the multimodal understanding module 530 returns the representative frames of the video and their related information to the gallery service module 510.
The related information of the representative frame includes, but is not limited to, a time point corresponding to a start frame, a time point corresponding to an end frame, a time point corresponding to the representative frame, a classification label of the video segment, and the like in the video segment corresponding to the representative frame. The time points corresponding to the start frame, the end frame and the representative frame in the video segment are beneficial to jumping to the corresponding video frame when the video search result is displayed for the user later, so that the user experience is improved. The classification labels of the video segments are beneficial to displaying more accurate video search results when video search is carried out subsequently.
In some embodiments, a classification tag of a video segment may be used to indicate the kind of object presented by the video segment. For example, the classification labels corresponding to the video frames included in the video segment may be combined to obtain the classification labels of the video segment.
In addition, the related information representing the frame may further include a person name and a person relationship corresponding to the video segment. In some embodiments, after the representative frame is determined, the person name and person relationship corresponding to the person in the representative frame may be identified based on the preset person name and person relationship.
The gallery service module 510 stores representative frames of the video and related information thereof S505.
The gallery service module 510 may store the representative frames and their associated information returned by the multimodal understanding module 530 after it receives them. In some embodiments, gallery service module 510 is configured with a database, and gallery service module 510 may store representative frames of video and information related thereto in the database.
S506, the gallery service module 510 invokes the multimodal understanding module 530 to determine visual semantic vectors corresponding to the representative frames.
In some embodiments, gallery service module 510 invokes the computer vision service provided by multimodal understanding module 530 to determine the visual semantic vector corresponding to the representative frame. One video includes a plurality of video segments, each corresponding to a representative frame, and one video corresponds to a plurality of visual semantic vectors. The gallery service module 510 invokes the computer vision service provided by the multimodal understanding module 530 to perform video semantic understanding on the video, and further determines vision semantic vectors corresponding to the plurality of representative frames in the video.
In some embodiments, the computer vision service may provide a CLIP model. The representative frame may be input to an image encoder of the CLIP model to obtain a visual semantic vector corresponding to the representative frame.
In some embodiments, the image encoder of the CLIP model may be trained by image training samples, which may include picture training samples and video frame training samples. The training process of the image encoder of the CLIP model can be seen in the following embodiments.
S507, the multi-modal understanding module 530 returns visual semantic vectors corresponding to the representative frames to the gallery service module 510.
S508, gallery service module 510 stores visual semantic vectors corresponding to the representative frames.
The gallery service module 510 may store the visual semantic vector corresponding to the representative frame returned by the multimodal understanding module 530. Illustratively, if the gallery service module 510 is configured with a database, the gallery service module 510 may store visual semantic vectors corresponding to representative frames into the database.
S509, the gallery service module 510 sends the representative frame of the video and its related information and the visual semantic vector of the representative frame to the search module 520.
The search module 520 constructs an index corresponding to the video S510.
One index is the index corresponding to one video segment in the video. One video includes multiple video segments, then one video corresponds to multiple indices, and different indices correspond to different video segments in the video.
The search module 520 may construct a visual semantic vector representing a frame, related information representing the frame, and an attribute tag combination of the video segments as an index representing the video segment to which the frame corresponds. As exemplified above, the index of a video segment may include a visual semantic vector of a representative frame of the video segment, a point in time corresponding to a start frame, a point in time corresponding to an end frame, and a point in time corresponding to a representative frame in the video segment, a classification tag of the video segment, and an attribute tag of the video segment (e.g., video acquisition time, video acquisition location, and video storage path), among others.
It should be noted that the content included in the above-mentioned index of the video segment is only an example, and the index of the video segment may include any information of the visual semantic vector representing the frame, the related information representing the frame, and the attribute tag of the video segment. The application is not limited in this regard.
Illustratively, the search module 520 may include an index library, and the search module 520 may store an index corresponding to the video into the index library so that a search match may be subsequently made based on the index library.
It will be appreciated that in practical applications, the number of videos stored in the mobile phone 300 may be very large, and one video may correspond to a plurality of indexes, which indicates that the index library may contain a large number of indexes. In some embodiments, to increase search efficiency, searches may be performed using inverted indexes.
For example, after the search module 520 constructs an index corresponding to the video, vector clustering is performed on visual semantic vectors included in the index library to obtain a plurality of clusters, that is, a vector space corresponding to all indexes is divided into a plurality of vector areas, one vector area includes one cluster, each vector area includes a plurality of indexes with higher vector similarity, and each vector area can be replaced by one cluster center point. For example, K-means or hierarchical clustering methods may be used for clustering, and the present application is not limited thereto.
Therefore, when the mobile phone 300 searches and matches text semantic vectors based on search texts, the mobile phone 300 can be matched with the clustering center point first and then with the visual semantic vectors in the vector area of the determined clustering center point, and the mobile phone does not need to be matched with all indexes in an index library, so that the computing resources are saved, the time consumed by searching and matching is greatly reduced, the searching delay is avoided, the searching efficiency of videos is improved, and the use experience of users is further improved.
In some embodiments, the search module 520 may store the received representative frame and related information thereof, and the visual semantic vector of the representative frame, so as to facilitate searching when constructing the index, thereby improving the construction speed of the index.
For example, the visual semantic vector of the representative frame corresponding to one video segment and the related information of the representative frame may be stored into the document. Taking video segment 1 as an example, the visual semantic vector of the representative frame and the related information of the representative frame corresponding to video segment 1 may be stored into sub-document 1. The information stored in sub-document 1 can be seen as shown in table 2 below:
TABLE 2
| Information name | Information content |
| segments.media-vector | [bd de d1 b4 3c 8c 9c] |
| segments.startTime | 0 |
| segments.endTime | 91666 |
| segments.startFrame | 0 |
| segments.endFrame | 2750 |
| segments.tag-name | Character/landscape/building |
As shown in table 2, segments.media-vector represents the visual semantic vector representing a frame in video segment 1. In practice, vectors are typically composed of arrays. In the embodiment of the application, hexadecimal sequence processing is performed on the visual semantic vector to obtain the visual semantic vector in the form of [ bd de d1 b4 c 8c 9c ], and hexadecimal floating point numbers can enable the mobile phone 300 to store with less storage space, so that the storage space can be saved. segment.starttime represents the start time of video segment 1, 0ms (i.e., the point in time corresponding to the start frame in video segment 1). segments.endtime represents the end time 91666ms of video segment 1 (i.e., the point in time corresponding to the end frame in video segment 1). segments.startframe and segment.endframe represent a total of 2750 frames from start to end of video frames in video segment 1. segments-name represents the category labels of video segment 1, including "people", "landscape" and "building".
In some embodiments, the attribute tag of the video segment 1 may also be stored in the sub-document 1, where the attribute tag of the video segment 1 is the attribute tag of the video stored in step S502. Taking one video segment of the video captured by the video segment 1 as a user as an example, the attribute tag of the video segment 1 may include the capturing time, the capturing place, the storage path of the mobile phone 300, and the like of the video segment 1.
In some embodiments, an attribute tag for video segment 1 may also be stored in document 1. It will be appreciated that the attribute tags of a video are fixed, i.e. the attribute tags corresponding to each of a plurality of video segments in a video are identical. The attribute tag of a video can be stored in the document 1, and the attribute tag of each video segment of the video can be obtained from the document 1, so that the content pressure of the mobile phone 300 can be reduced, and the storage cost can be reduced.
Illustratively, the video D includes the video segment 1, and further includes the video segment 2 and the video segment 3, where the visual semantic vector of the representative frame and the related information of the representative frame corresponding to the video segment 1 are stored in the sub-document 1, the visual semantic vector of the representative frame and the related information of the representative frame corresponding to the video segment 2 may be stored in the sub-document 2, and the visual semantic vector of the representative frame and the related information of the representative frame corresponding to the video segment 1 may be stored in the sub-document 3. The information stored in document 1 can be seen as shown in table 3 below:
TABLE 3 Table 3
As shown in table 3, file_path represents the storage path of video D in the cell phone 300. The imaging-time indicates the shooting time of the video D. The location indicates the shooting location of the video D. The segments are used for indicating sub-documents respectively corresponding to the plurality of video segments included in the video D. Based on the above tables 2 and 3, the attribute tag of the video segment 1 may be obtained from the document 1, and the related information of the representative frame of the video segment 1 and the like may be obtained from the sub-document 1. Similarly, the attribute tags for video segment 2 and video segment 3 may also be obtained from document 1.
It should be noted that, since the video semantic understanding may occupy a large amount of computing resources, in order to avoid the influence of using a card, etc., the steps S503-S510 may be executed when the mobile phone 300 is in the state of charging and turning off the screen.
The video searching method provided by the application is based on searching and matching of the searching text input by the user and the index corresponding to the video, and the searching result is obtained. The index corresponding to the video is an index of one video segment in the video, and the index of the video segment at least comprises a visual semantic vector representing a frame in the video segment. While the visual semantic vector representing the frame may indicate a meaning expressed by the content of the picture presented by the representative frame, which may represent the video semantic of the video segment, so that the visual semantic vector representing the frame is sufficiently associated with the video semantic of the video segment in the video. The search matching is carried out based on the visual semantic vector of the representative frame, so that the fusion interaction of the search text and the picture content displayed by the representative frame can be realized, the accuracy of the video search result is further improved, and the use experience of a user is improved.
Next, the steps included in the searching stage of the video searching method provided by the present application will be described in further detail with reference to fig. 5 b.
S511 the gallery service module 510 receives user input operations for searching text.
The user may enter search text in a search interface provided by the handset 300. Illustratively, the user may enter search text in a search box 321 included in an album display interface 320 as shown in FIG. 3 a. Also illustratively, the user may enter search text in a search box 421 included in the negative one-screen interface 420 as shown in FIG. 4 a.
The search text is text that the user describes about the characteristics of his own desired video. Illustratively, the search text may include a video acquisition time, a video acquisition location, screen content presented by the video, and the like. For example, the search text may be "scenery shot last week", which the present application is not limited to.
The gallery service module 510 sends the search text to the search module 520S 512.
The search module 520 invokes the multimodal understanding module 530 to determine text semantic vectors corresponding to the search text S513.
The search module 520 invokes the multimodal understanding module 530 to perform text semantic understanding on the search text to obtain text semantic vectors corresponding to the search text.
Text semantic understanding refers to letting the mobile phone understand the meaning expressed by text, and is a key technology of Natural Language Processing (NLP) technology.
In some embodiments, the multimodal understanding module 530 provides a CLIP model, and can input search text to a text encoder of the CLIP model, resulting in text semantic vectors corresponding to the search text.
In some embodiments, the text encoder of the CLIP model may be trained by text training samples. The training process of the text encoder of the CLIP model can be seen in the following embodiments.
S514, the multi-modal understanding module 530 returns text semantic vectors corresponding to the search text to the search module 520.
The search module 520 performs vector recall in the index library based on the text semantic vector corresponding to the search text S515.
Vector recall refers to recall in an index library an index that matches a text semantic vector corresponding to the search text.
In some embodiments, the vector similarity between the text semantic vector corresponding to the search text and the visual semantic vector respectively included by the plurality of indexes in the index library can be calculated, so as to obtain a vector similarity calculation result respectively corresponding to the plurality of indexes, and N indexes with higher vector similarity in the plurality of vector similarity calculation results are used as vector recall results. Wherein N is an integer greater than 0. Illustratively, N may be a predetermined number of vector recall results, such as 5, 8, or 10, etc.
The vector similarity refers to the degree of similarity between two vectors, and can be calculated by various methods. The similarity degree may be determined by calculating the cosine similarity of the two vectors, and other methods may be used, which are not limited in this application.
In some embodiments, as described above, the vector space in the index library includes a plurality of vector regions, each of which includes a plurality of indices of higher vector similarity. In the embodiment of the application, the distance between the text semantic vector corresponding to the search text and the clustering center point of a plurality of vector areas in the index library can be calculated, and the closest clustering center point can be determined. And calculating the vector similarity between the visual semantic vector and the text semantic vector respectively included by a plurality of indexes in the vector region to which the cluster center point belongs. And sequencing the indexes according to the sequence of the vector similarity from large to small to obtain the inverted zipper corresponding to the clustering center point. Illustratively, the top N indices in the inverted zipper may be used as vector recall results. Also for example, an index whose vector similarity exceeds a vector similarity threshold may be used as a vector recall result.
The process of vector recall is described in detail below in conjunction with fig. 7.
As shown in fig. 7, the inverted index library includes a plurality of cluster center points such as cluster center point 1. Firstly, calculating the distance between a text semantic vector corresponding to a search text and a clustering center point of a plurality of vector areas in an index library, and determining a clustering center point 1 as the closest clustering center point. And calculating the vector similarity of the visual semantic vectors and the text semantic vectors of the multiple indexes in the vector area of the center 1, sequencing the visual semantic vectors and the text semantic vectors from large to small according to the vector similarity to obtain inverted zippers 1 corresponding to the cluster center points 1, and selecting the index of TopN from the inverted zippers 1 corresponding to the cluster center points 1 as a vector recall result.
In the inverted zipper 1, the visual semantic vector corresponding to the index 1 is closest to the clustering center point 1, and the distances between the index 2 and the index 3 and the clustering center point 1 gradually become farther, so that the index for selecting the TopN is selected from the index 1 sequentially backwards. N may be any integer greater than 0, which is not limited in the present application.
The search module 520 invokes the natural language understanding module 540 to identify entities in the search text S516.
The search module 520 invokes the natural language understanding module 540 to identify the entities contained in the search text.
For example, entities in the search text that have a particular meaning may be identified by a named entity identification technique (NAMED ENTITY Recognizion, NER). Entities may include, but are not limited to, time, place, person name, organization name, proper nouns. Taking the search text as a scene shot in the last week as an example, the entities in the search text comprise the last week and the scene.
S517 the natural language understanding module 540 returns the entities in the search text to the search module 520.
The search module 520 recalls entities in the index base based on the entities in the search text S518.
Entity recall refers to recall in an index library an index that matches an entity in the search text.
In some embodiments, the index may include information about the video segment representing the frame and an attribute tag of the video segment, including the entity. Such as video acquisition time, video acquisition location, category labels for video segments, etc., may include entities. Taking the searching text as the scenery shot in the city B in the last week as an example, the entity 'city B' which is a place exists in the searching text, the entity 'last week' which is a time and the entity 'scenery' related to the picture content displayed by the video can be matched in the entities corresponding to the indexes respectively, and the index matched with the entity in the searching text is obtained as an entity recall result.
The search module 520 ranks the vector recall results and the entity recall results S519.
In some embodiments, the intersection results or union results of the vector recall results and the entity recall results may be ordered.
Illustratively, the ranking may be by vector similarity between the vector recall result and the search text, and by entity matching between the entity recall result and the search text. For example, the vector similarity between the text semantic vector of the search text and the visual semantic vector of the recall result (vector recall result or entity recall result) and the matching degree between the entity in the search text and the entity in the recall result can be weighted and summed to obtain the comprehensive matching degree corresponding to the recall result, and the comprehensive matching degrees corresponding to the recall results (including the vector recall result and the entity recall result) are sorted according to the order from large to small.
Therefore, based on the search text input by the user, on the basis of the search matching of the text semantic vector, the search matching of the entities in the search text is performed, and the final displayed result sequence is obtained based on the comprehensive matching degree of the search results, so that the video presented to the user is ensured to be a result which is more matched with the search text of the user, and the use experience of the user is further improved.
S520 the search module 520 returns the search results to the gallery service module 510.
The search module 520 returns the ranked search results to the gallery service module 510.
S521, gallery service module 510 presents the search results to the user.
It will be appreciated that in practice, the handset 300 will typically store pictures and videos in a gallery application. Therefore, when searching is performed through a search interface provided in the gallery application, the picture search result and the video search result are displayed simultaneously. That is, the index library includes indexes corresponding to the pictures in addition to indexes corresponding to the video segments.
In some embodiments, the index to which the picture corresponds may include, but is not limited to, a picture semantic vector, an attribute tag of the picture, and the like. Illustratively, as described above, the video frame may be an image, and the picture is also an image, so that the picture semantic vector corresponding to the picture may be generated by the image encoder of the CLIP model provided by the multimodal understanding module 530 and returned to the gallery service module 510. For example, the attribute tag of the picture may be obtained by the gallery service module 510 by receiving and storing a new or modified operation of the attribute tag of the picture by the user. The search module 520 receives the picture semantic vector and the attribute label of the picture sent by the gallery service module 510, and constructs an index corresponding to the picture based on the picture semantic vector and the attribute label.
Illustratively, as with the first search result display interface 330 of FIG. 3a, the display is that the first search results include video search results and picture search results.
In some embodiments, the search results may be presented based on a plurality of ordered indexes. As described above, the index to which the video corresponds includes visual semantic vectors representing frames, related information representing frames, and attribute tags for the video segments. The index corresponding to the picture comprises a picture semantic vector, an attribute label of the picture and the like.
Illustratively, as with the first search result display interface 330 of FIG. 3a, the picture search results of the first display area 331 show images, and the video search results show thumbnails and points in time. The information corresponding to the thumbnail and the time point is information included in an index corresponding to the video search result. Taking "video a" in fig. 3a as an example, the thumbnail thereof is the start frame of the video segment a included in the index, the first time point thereof is the time point "02:18" corresponding to the start frame of the video segment a included in the index, and the second time point thereof is the total duration "08:32" of the video included in the index.
Also by way of example, the video search results and the picture search results may also exhibit respective corresponding attribute tags, e.g., the video search results may exhibit video capture times, video capture locations, and the like. As shown in (2) of fig. 3B, the picture search result includes a photographing time of picture "2023, 10, 1, and a photographing place of picture" city B "
Still further exemplary, the video search results may also display other content in the corresponding index, such as persona relationships, persona names, and the like. The application is not limited in this regard.
It should be noted that, the gallery service module, the search module, the multi-modal understanding module and the natural language understanding module may also be located in the cloud server, that is, the cloud server uses interactions of the four modules to implement steps included in the index construction stage. In the searching stage, the steps involved in the searching stage may be implemented based on interaction between the electronic device such as the mobile phone 300 and the cloud server.
In some embodiments, the mobile phone 300 may send the search text input by the user to the gallery service module of the cloud server, so that the gallery service module of the cloud server interacts with other modules to implement the steps included in the search phase, and then the gallery service module of the cloud server sends the search result to the mobile phone 300, so that the mobile phone 300 displays the search result to the user.
In some embodiments, assume that the total duration of video E is 2 minutes and it is taken at a 30 frame rate, i.e., 30 images are taken in one second. When the video E is stored in the cloud server, the multi-mode understanding module in the cloud server can conduct frame splitting processing on the video E, the video E is decomposed into video frames to obtain 3600 video frames, the multi-mode understanding module conducts segmentation processing on the video E, the video E is segmented into 120 video segments by taking 1s as a time unit, each video segment comprises 30 video frames (namely 30 images shot in one second), the multi-mode understanding module scores the 30 video frames in each video segment, and one video frame with the highest score is used as a representative frame of the video segment. Illustratively, the scoring may be based on jitter, sharpness, pixels, etc. of the video frame, as the application is not limited in this regard.
And then the searching module of the cloud server takes the visual semantic vector of the representative frame as an index of the corresponding video segment so as to match the representative frame of each video segment of the video E based on the search text, if the search text is successfully matched with the 50 th video segment of the video E, the thumbnail of the video returned to the user is the representative frame of the 50 th video segment, and the returned time point is 50s. The specific implementation of the above embodiment may be referred to fig. 5b and the description of the above embodiment, which are not repeated here.
It should be noted that the multi-mode understanding module 530 of the mobile phone 300 may also segment the video stored in the gallery with 1s as a time unit, which is not limited by the present application.
In the above embodiment, the video searching method provided by the present application needs to apply the image encoder of the CLIP model and the text encoder of the CLIP model, and then needs to train the image encoder of the CLIP model and the text encoder of the CLIP model first. In some embodiments, the image encoder of the CLIP model and the text encoder of the CLIP model may be trained separately. In some embodiments, the image encoder of the CLIP model and the text encoder of the CLIP model may be trained together based on a training pattern of contrast learning.
The training process of the image encoder of the CLIP model and the text encoder of the CLIP model is described below in conjunction with fig. 8 and 9. The following embodiments describe the subdivision of the training patterns of the image encoder of the CLIP model and the text encoder of the CLIP model into the following steps 1-7 based on contrast learning.
Step 1, acquiring an image training sample and a text training sample corresponding to the image training sample.
Both the image encoder of the CLIP model and the text encoder of the CLIP model need to be trained in advance with a large number of training samples. Therefore, before model training, a training sample of an image encoder of the CLIP model, that is, an image training sample, is acquired, and a training sample of a text encoder of the CLIP model, that is, a text training sample corresponding to the image training sample, is acquired.
As described above, the image training samples may include picture training samples and video frame training samples. The picture training sample can be any picture, and the video frame training sample can be a video frame in any video. The text training samples corresponding to the image training samples refer to texts corresponding to the contents displayed by the image training samples, namely the text training samples can express the contents displayed by the image training samples. Illustratively, the image training sample is a thumbnail displayed by "video A" in the first display area 331 in FIG. 3a, then its corresponding text training sample may be "Male student standing at night next to the big tree".
The method is not limited in the acquisition mode of the text training samples corresponding to the image training samples.
For example, the text training samples corresponding to the image training samples may be manually labeled, and the text training samples are manually labeled according to the image semantic understanding of the text training samples. Also for example, the text training samples corresponding to the image training samples may be automatically generated by identifying relevant content such as objects, scenes, actions, and the like in the image training samples. Still further by way of example, the text training samples corresponding to the image training samples may be automatically generated by a text generation model for generating descriptive text of the image.
It should be noted that the number of the image training samples is not limited in the present application. It will be appreciated that the text training samples correspond to the image training samples and are therefore the same number.
As shown in fig. 9, N image training samples are acquired, and N text training samples corresponding to the N image training samples one by one are acquired. Illustratively, image training sample 1 corresponds to text training sample 1.
And 2, inputting the image training samples into an image encoder, and outputting image vectors corresponding to the image training samples by the image encoder.
As shown in fig. 8, for an image training sample, an image encoder may encode it to obtain an image vector for the image training sample.
As shown in fig. 9, N image training samples are input to an image encoder, and image vectors I1、I2、I3……IN corresponding to the N image training samples are obtained.
And 3, inputting the text training samples into a text encoder, and outputting text vectors corresponding to the text training samples by the text encoder.
As shown in fig. 8, for a text training sample, a text encoder may encode it to obtain a text vector for the text training sample.
As shown in fig. 9, N text training samples are input to a text encoder, and text vectors T1、T2、T3……TN corresponding to the N text training samples are obtained.
And 4, respectively combining each image vector and a plurality of text vectors to obtain a plurality of vector pairs, determining the vector pair with the corresponding relation as a positive sample vector pair from the plurality of vector pairs, and determining the rest vector pairs as negative sample vector pairs.
Contrast learning is an unsupervised training approach, so positive and negative samples need to be defined from training samples. In an embodiment of the application, positive and negative pairs of sample vectors are determined from a plurality of pairs of vectors.
In some embodiments, assuming that there are N image vectors and N text vectors, combining each image vector and N text vectors separately may result in a number of n×n vector pairs. It can be understood that the image training samples and the text training samples have corresponding relations, so that the vectors corresponding to the image training samples and the text training samples also have corresponding relations, N positive sample vector pairs can be determined as vector pairs formed by the image vectors and the text vectors with the corresponding relations in n×n vector pairs, and N negative sample vector pairs can be determined as negative sample vector pairs in the remaining vector pairs.
As shown in fig. 9, taking I1 as an example, it is combined with T1、T2、T3……TN to obtain N vector pairs I1˙T1、I1˙T2、I1˙T3……I1˙TN and, similarly, I2、I3……IN, n×n vector pairs can be obtained. The vector pair with the corresponding relation of I1˙T1、I2˙T2、I3˙T3……IN˙TN is determined as a positive sample vector pair, and the rest is determined as a negative sample vector pair.
And 5, calculating the vector similarity between the image vector and the text vector in each vector pair.
For example, a vector cosine similarity between the image vector and the text vector in each vector pair may be calculated.
And 6, adjusting parameters of the image encoder and the text encoder based on the loss function, the vector similarity corresponding to the positive sample vector pair and the vector similarity corresponding to the negative sample vector pair.
As shown in fig. 8, the parameters of the image encoder and the text encoder are adjusted based on the image encoder and the image vector outputted thereby, and the text encoder and the text vector thereof, which are subjected to contrast learning.
It will be appreciated that in embodiments of the present application, the model training goal is to maximize the vector similarity of the positive sample vector pair and minimize the vector similarity of the negative sample vector pair.
Illustratively, the loss function may be a cross entropy loss function, whose formula is specifically as follows:
Wherein,Representing the loss value of the loss function, K representing the number of vector pairs, yi representing the true vector similarity corresponding to the ith vector pair,Representing the similarity of the i-th vector pair to the corresponding predicted vector.
Furthermore, in some embodiments, the true vector similarity of the positive sample vector pair may be represented as 1, the true vector similarity of the negative sample vector pair may be represented as 0, and parameters of the image encoder and the text encoder may be adjusted until the true sample vector pair may be maximally closer to 1, and the negative sample vector pair may be maximally closer to 0, i.e., the value of the penalty function may be minimized.
And 7, ending training to obtain a trained image encoder and a trained text encoder when the training cut-off condition is met. Illustratively, the training cutoff condition may be that a preset number of training times is reached during model training, or that a loss value of a loss function is smaller than a loss value threshold during model training, or the like.
Embodiments of the present application also provide a computer-readable storage medium storing a computer program which, when executed by a computer, is capable of carrying out one or more steps of any one of the video search methods described above.
The computer readable storage medium may be a non-transitory computer readable storage medium, for example, a ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
Another embodiment of the application also provides a computer program product containing instructions. The computer program product is capable of implementing one or more steps of any of the video search methods described above when executed by a computer.
The electronic device, the computer readable storage medium and the computer program product provided in this embodiment are used to execute the corresponding video searching method provided above, so the beneficial effects thereof can be referred to the beneficial effects in the corresponding video searching method provided above, and will not be described herein.
The terms first, second, third and the like in the description and in the claims and in the drawings are used for distinguishing between different objects and not for limiting the specified order.
In embodiments of the application, words such as "exemplary" or "such as" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g." in an embodiment should not be taken as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.