CN118692014B

Movatterモバイル変換

Info

Publication number: CN118692014B
Application number: CN202411181885.XA
Authority: CN
Inventors: 陈世哲
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2024-08-27
Filing date: 2024-08-27
Publication date: 2024-12-13
Anticipated expiration: 2044-08-27
Also published as: CN118692014A

Abstract

The embodiment of the application discloses a method, a device, equipment, a medium and a product for identifying a video tag, wherein the method comprises the steps of obtaining a video to be identified, and extracting multi-modal characteristics of the video to be identified to obtain a first multi-modal characteristic, wherein the first multi-modal characteristic comprises a plurality of modal characteristics corresponding to a plurality of modalities respectively; the method comprises the steps of searching a plurality of reference videos similar to a video to be identified according to first multi-mode features, wherein each reference video carries a video tag, constructing context learning information according to second multi-mode features corresponding to each reference video and the video tags carried by each reference video respectively and the first multi-mode features of the video to be identified, and identifying the video tags of the video to be identified according to the context learning information. The technical scheme of the embodiment of the application can rapidly and accurately identify the video tag of the obtained video.

Description

Video tag identification method, device, equipment, medium and product

Technical Field

The present application relates to the field of video tag technologies, and in particular, to a video tag identification method, a video tag identification apparatus, an electronic device, a computer readable storage medium, and a computer product.

Background

Video tag identification is an important part of video content characteristics, and by automatically generating tags for massive users to generate content (User-GENERATED CONTENT, UGC) videos, different granularity video content characteristics can be provided for downstream content distribution links (such as recommendation systems and content operation), but due to the diversity of UGC video contents, tag libraries used in actual business scenes often contain hundreds of thousands or even millions of tags, and matching corresponding video tags accurately for each video faces a great challenge.

Disclosure of Invention

Embodiments of the present application provide a video tag recognition method, a video tag recognition apparatus, an electronic device, a computer-readable storage medium, and a computer program product, which can quickly and accurately recognize a video tag of a video.

Other features and advantages of the application will be apparent from the following detailed description, or may be learned by the practice of the application.

According to one aspect of the embodiment of the application, a video tag identification method is provided, which comprises the steps of obtaining a video to be identified, extracting multi-modal features of the video to be identified to obtain first multi-modal features, wherein the first multi-modal features comprise multi-modal features corresponding to multiple modes respectively, searching multiple reference videos similar to the video to be identified according to the first multi-modal features, each reference video carries a video tag, constructing context learning information according to second multi-modal features corresponding to each reference video and video tags carried by each reference video, and the first multi-modal features of the video to be identified, and identifying the video tag of the video to be identified according to the context learning information.

According to one aspect of the embodiment of the application, a video tag identification device is provided, which comprises an acquisition module, a retrieval module and a construction module, wherein the acquisition module is used for acquiring a video to be identified, carrying out multi-modal feature extraction on the video to be identified to obtain a first multi-modal feature, the first multi-modal feature comprises a plurality of modal features corresponding to a plurality of modes respectively, the retrieval module is used for retrieving a plurality of reference videos similar to the video to be identified according to the first multi-modal feature, each reference video carries a video tag, the construction module is used for constructing context learning information according to a second multi-modal feature corresponding to each reference video and each video tag carried by each reference video respectively, and the first multi-modal feature of the video to be identified, and the identification module is used for identifying the video tag of the video to be identified according to the context learning information.

In an embodiment of the application, the first multi-modal feature comprises a first video frame feature, the second multi-modal feature comprises a second video frame feature, a construction module is further used for performing feature conversion processing on the first video frame feature to obtain a first visual text mark sequence, performing feature conversion processing on the second video frame feature corresponding to each reference video to obtain a second visual text mark sequence corresponding to each reference video, acquiring first text information corresponding to the video to be identified and second text information corresponding to each reference video, and constructing the context learning information according to the second visual text mark sequence, the second text information and the video tag corresponding to each reference video, and the first visual text mark sequence and the first text information.

In an embodiment of the application, the number of the first video frame features includes a plurality of first video frame features, the construction module is further configured to perform feature fusion processing on the plurality of first video frame features to obtain target video features, and the target video features are aligned to a preset text feature space through a pre-training feature alignment module to obtain the first visual text marker sequence.

In an embodiment of the application, the construction module is further configured to construct a learning example according to a first text label for isolating different information under the same video, the second visual text label sequence corresponding to the same reference video, the second text information and the video label, so as to obtain a plurality of learning examples corresponding to a plurality of reference videos, splice the plurality of learning examples to obtain learning information, wherein the plurality of learning examples in the learning information are separated by the second text label for isolating information of different videos, construct a recognition example according to the first text label, the first visual text label sequence and the first text information, and generate the contextual learning information according to the learning information and the recognition example.

In an embodiment of the application, the acquisition module is further used for acquiring a video title of the video to be identified and an identification text obtained by converting the audio information of the video to be identified, generating a situation description of the video to be identified according to a video scene of the video to be identified, and generating the first text information according to the video title, the identification text and the situation description.

In an embodiment of the application, the plurality of target candidate videos comprise a plurality of target candidate videos retrieved by each retrieval mode, the retrieval module is further used for calculating the similarity between the video to be identified and each target candidate video according to the plurality of target candidate videos retrieved by each retrieval mode, calculating the average similarity between the target candidate videos and the video to be identified in each retrieval mode according to each target candidate video, and selecting the reference video from the plurality of target candidate videos according to the average similarity.

In an embodiment of the application, the acquisition module is further used for extracting a plurality of video frames from the video to be identified, dividing the plurality of video frames into a plurality of fragments, extracting a target video frame from the plurality of fragments, extracting video features of the target video frame to obtain first video frame features, acquiring first text information corresponding to the video to be identified, extracting text features of the first text information to obtain first text features, and obtaining the first multi-mode features according to the first video frame features and the first text features.

In an embodiment of the application, the identification is further used for carrying out sequence feature conversion on the context learning information to obtain a context learning sequence, the sequence feature is an input feature supported by a tag generation model, the context learning sequence is input to the tag generation model, the tag generation model is obtained by keeping frozen original model parameters of a preset language model and adjusting newly-added model parameters of the language model according to a sample context learning sequence, the newly-added model parameters are related to a low-rank adaptive module introduced into the language model, and a target video tag of the video to be identified output by the tag generation model is obtained.

In an embodiment of the application, the device further comprises a training module for acquiring a first sample video feature, a first text information and a first sample video tag corresponding to a first sample video and a second sample video feature, a second sample text information and a second sample video tag corresponding to a second sample video, respectively performing feature alignment processing on the first sample video feature and the second sample video feature according to a pre-trained initial feature alignment module to obtain a first sample visual tag sequence and a second sample visual tag sequence, constructing sample context learning information according to the first sample visual tag sequence, the first sample text information and the first sample video tag and the second sample text information, performing sequence feature conversion on the sample context learning information to obtain the sample context learning sequence, introducing the low-rank adaptive module into the language model to introduce the new model parameters into the language model through the low-rank adaptive module, maintaining the original model parameters and freezing the new model parameters according to the low-rank adaptive module, and generating the new model parameters to the new model and the second model context learning sequence, and generating the new model tag.

In one embodiment of the application, the training module is further used for acquiring a sample image and a sample description text corresponding to the sample image, extracting characteristics of the sample image to obtain sample visual characteristics, inputting the sample visual characteristics to a module to be trained so that the module to be trained aligns the sample visual characteristics to an input text characteristic space of the language model, acquiring sample text characteristics corresponding to an alignment description instruction, inputting the sample text characteristics and target sample visual text characteristics output by the characteristic alignment module to the language model, wherein model parameters of the language model are kept frozen, acquiring sample prediction description text output by the language model, and training the module to be trained according to the sample description text and the sample prediction description text to obtain the initial characteristic alignment module.

In an embodiment of the present application, the training module is further configured to input the sample context learning sequence to a language model that is introduced into the low-rank adaptive module, obtain a predicted sample tag output by the language model, adjust the newly added model parameter of the language model according to a difference between the predicted sample tag and the second sample video tag to obtain the tag generation model, and adjust the module parameter of the initial feature alignment module according to a difference between the predicted sample tag and the second sample video tag.

According to one aspect of the embodiment of the application, the embodiment of the application provides an electronic device, comprising one or more processors, and a storage device for storing one or more computer programs, which when executed by the one or more processors, cause the electronic device to implement the video tag identification method as described above.

According to an aspect of an embodiment of the present application, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor of an electronic device, causes the electronic device to perform the video tag identification method as described above.

According to an aspect of the embodiments of the present application, there is provided a computer program product including a computer program stored in a computer-readable storage medium, a processor of an electronic device reading and executing the computer program from the computer-readable storage medium, causing the electronic device to execute the video tag identification method as above.

In the technical scheme provided by the embodiment of the application, the video to be identified is obtained, the multi-modal feature extraction is carried out on the video to be identified, the first multi-modal feature containing multiple modal features is generated, the multiple reference videos similar to the video to be identified are searched according to the first multi-modal feature, each reference video carries a video tag, the reference videos not only provide rich contrast information, but also provide important context basis for tag identification of the video to be identified, context learning information is constructed according to the second multi-modal feature corresponding to each reference video and the video tag carried by each reference video respectively, and the first multi-modal feature of the video to be identified, and the context learning information is utilized to enable tag identification to be limited to single-modal feature analysis, but also to carry out context contrast and learning by utilizing the tags and the multi-modal features of the multiple reference videos, so that references are provided for identifying the video tags of the video to be identified, and the video tags of the video to be identified are identified rapidly and accurately, and more accurately.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application. It is evident that the drawings in the following description are only some embodiments of the present application and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art. In the drawings.

FIG. 1 is a schematic diagram of an implementation environment in which the present application is directed.

Fig. 2 is a flowchart illustrating a video tag recognition method according to an exemplary embodiment of the present application.

Fig. 3 is a schematic diagram illustrating another video tag identification method according to an exemplary embodiment of the present application.

Fig. 4-1 is a schematic diagram of a context learning information, which is shown in an exemplary embodiment of the present application.

Fig. 4-2 are diagrams illustrating another context learning information according to an exemplary embodiment of the present application.

Fig. 5 is a flowchart illustrating another video tag recognition method according to an exemplary embodiment of the present application.

Fig. 6 is a flowchart illustrating another video tag recognition method according to an exemplary embodiment of the present application.

Fig. 7 is a flowchart illustrating another video tag recognition method according to an exemplary embodiment of the present application.

Fig. 8 is a flowchart illustrating another video tag recognition method according to an exemplary embodiment of the present application.

FIG. 9 is a training schematic of a feature training module, shown in accordance with an exemplary embodiment of the present application.

Fig. 10 is a schematic diagram of an application chain of a video tag recognition method according to an exemplary embodiment of the present application.

Fig. 11 is a schematic diagram illustrating an overall structure flow of video tag identification according to an exemplary embodiment of the present application.

Fig. 12 is a block diagram showing the construction of a video tag recognition apparatus according to an exemplary embodiment of the present application.

Fig. 13 shows a schematic diagram of a computer system suitable for use in implementing an embodiment of the application.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.

The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the content and operations/, nor do they necessarily have to be performed in the order described. For example, some operations may be decomposed, and some operations may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.

It should be noted that the term "plurality" as used herein means two or more. "and/or" describes the association relationship of the association object, and indicates that there may be three relationships, for example, a and/or B may indicate that there are three cases of a alone, a and B together, and B alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

The following describes the technical scheme of the embodiment of the present application in detail.

Referring to fig. 1, fig. 1 is a schematic diagram of an implementation environment according to the present application. The implementation environment includes a terminal 10 and a server 20.

The terminal 10 is used for video to be identified and transmits the video to be identified to the server 20.

The server may also send the video tag of the video to be identified to the terminal, so that the terminal classifies the video or recommends the video according to the video tag.

In some embodiments, the server 20 may also acquire the video to be identified by itself, then perform multi-mode feature extraction, similar video retrieval and context learning information construction, and identify the video tag of the video to be identified according to the context learning information.

In some embodiments, the terminal 10 may also implement the process of identifying the video tag separately, that is, the terminal 10 obtains the video to be identified, so as to perform multi-mode feature extraction, similar video retrieval and context learning information construction, and further identify the video tag of the video to be identified.

The terminal 10 may be any electronic device capable of acquiring the modal data of the target object, such as a smart phone, a tablet, a notebook computer, a computer, an intelligent voice interaction device, an intelligent home appliance, a vehicle-mounted terminal, or an aircraft, and the server 20 may be an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content distribution networks), and basic cloud computing services such as big data and artificial intelligent platforms, which are not limited herein.

The terminal 10 and the server 20 previously establish a communication connection through a network so that the terminal 10 and the server 20 can communicate with each other through the network. The network may be a wired network or a wireless network, and is not limited in this regard.

It should be noted that, in the specific embodiment of the present application, at least one of the video to be identified and the reference video relates to the object, when the embodiment of the present application is applied to the specific product or technology, the permission or consent of the object needs to be obtained, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region.

Various implementation details of the technical solutions of the embodiments of the present application are set forth in detail below.

As shown in fig. 2, fig. 2 is a flowchart illustrating a video tag identification method according to an embodiment of the present application, which may be applied to the implementation environment shown in fig. 1, and the method may be performed by a terminal or a server, or may be performed by both the terminal and the server, and in the embodiment of the present application, the method is described by way of example as the method is performed by the server, and the video tag identification method may include S210 to S240, which will be described in detail below.

S210, acquiring a video to be identified, and extracting multi-modal features of the video to be identified to obtain first multi-modal features, wherein the first multi-modal features comprise a plurality of modal features respectively corresponding to a plurality of modalities.

In an embodiment of the present application, the video to be identified refers to a video for which a video tag has not been determined, and the video tag is a tag for identifying and classifying video contents, and the tag may be a specific keyword, phrase, or any descriptive information related to a video subject.

The method comprises the steps of obtaining a video to be identified, and carrying out multi-modal feature extraction on the video to be identified, wherein modes can be understood as different representation or obtaining modes of data, such as images and texts belong to different modes, the multi-modal comprises at least two modes, multi-modal feature extraction is carried out on the video to be identified, and further the obtained first multi-modal feature comprises a plurality of modal features obtained by respectively carrying out feature extraction on a plurality of modes, for example, feature extraction can be carried out on video content of the video to be identified, feature extraction can be carried out on text information of the video to be identified, feature extraction can be carried out on audio information of the video to be identified, and the first multi-modal feature comprises video features corresponding to the video content, text features corresponding to the text information and audio features corresponding to the audio information.

In an example, the video to be identified includes information of multiple modes, and feature extraction can be performed on a specific mode of the video to be identified according to an instruction of a service scene or a terminal to which the video to be identified belongs, for example, if the service scene to which the video to be identified belongs is a short video scene using a template, the template includes unified background music, so that feature extraction can be performed on text and video content of the video to be identified without performing feature extraction on audio information of the video to be identified.

In another example, feature extraction may be performed on a feature modality of a video to be identified, such as a music video, feature extraction may be performed on video content and sound, and feature extraction may be performed on video content, text, and sound of a video in a television/video segment.

S220, searching a plurality of reference videos similar to the video to be identified according to the first multi-mode characteristics, wherein each reference video carries a video tag.

In the embodiment of the application, a video library is pre-stored, the video library includes a plurality of candidate videos carrying video tags, the video tags of the candidate videos can be obtained through identification in steps S210 to S240, or can be determined by marking by related personnel, and the method is not limited herein.

After the first multi-mode feature of the video to be identified is obtained, a plurality of reference videos similar to the video to be identified are searched in a video library according to the first multi-mode feature, wherein the similarity between the searched reference videos and the video to be identified is larger than a preset similarity threshold.

In an example, the process of performing similar video retrieval according to the first multi-mode feature may be to compare each mode feature included in the first multi-mode feature with a corresponding mode feature of a candidate video in a video library, and further select a reference video corresponding to a feature similarity higher than a preset similarity threshold, for example, compare a video feature included in the first multi-mode feature with video features of each candidate video in the video library to select a first reference video set, compare a text feature included in the first multi-mode feature with text features of each candidate video in the video library to select a second reference video set, compare a video feature included in the first multi-mode feature with text features of each candidate video in the video library to select a third reference video set, compare a text feature included in the first multi-mode feature with video features of each candidate video in the video library to select a fourth reference video set, and obtain a final reference video similar to the video to be identified according to the first to fourth reference video set.

In another example, the process of performing similar video retrieval according to the first multi-modal feature may be to perform feature fusion on a plurality of modal features included in the first multi-modal feature to obtain a fused feature, and compare the fused feature with the fused feature of each candidate video in the video library to select a fifth reference video set.

In another example, a final reference video similar to the video to be identified may be obtained according to the first to fourth reference video sets and the fifth reference video set, for example, an intersection video of the first to fourth reference video sets and the fifth reference video set is used as the final reference video, and for example, the first to fourth reference video sets and the fifth reference video set are used as the final reference video.

S230, context learning information is built according to the second multi-mode features corresponding to the reference videos, the video tags carried by the reference videos and the first multi-mode features of the videos to be identified.

In the embodiment of the application, a second multi-mode feature is required to be obtained by extracting multi-mode features from each reference video, wherein the multi-mode feature extraction process of the video to be identified and the reference video is the same, the type of the modal features included in the first multi-mode feature is the same as the type of the modal features included in the second multi-mode feature, for example, the type of the features of the first multi-mode feature includes a video feature and a text feature, and the type of the features of the second multi-mode feature also includes a video feature and a text feature.

It can be understood that the second multi-mode feature of the reference video has a mapping relationship with the video tag, that is, the reference video is a learning example related to the video tag, and further context learning information is constructed according to the second multi-mode feature corresponding to each reference video, the video tag carried by each reference video, and the first multi-mode feature of the video to be identified, where the context learning information includes the context information provided by the learning example and is prompt information with analog learning.

In an example, the second multi-mode feature corresponding to each reference video, the video tag carried by each reference video, and the first multi-mode feature of the video to be identified may be spliced to obtain the context learning information.

S240, identifying the video tag of the video to be identified according to the context learning information.

As described above, the context learning information includes the context information provided by the learning example, and further, the video tag of the video to be identified is identified according to the context learning information, and the identification of the video tag of the video to be identified can be guided according to the context information provided by the learning example, so as to obtain the target video tag of the video to be identified.

In one example, the contextual learning information is input to a pre-trained tag generation model by which video tags for the video to be identified are generated from the contextual learning information.

In the embodiment of the application, the video to be identified is obtained, the multi-modal feature extraction is carried out on the video to be identified, the first multi-modal feature containing multiple modal features is generated, the multiple reference videos similar to the video to be identified are searched according to the first multi-modal feature, each reference video carries a video tag, the reference videos not only provide rich contrast information, but also provide important context basis for tag identification of the video to be identified, the context learning information is constructed according to the second multi-modal feature corresponding to each reference video and the video tag carried by each reference video, and the first multi-modal feature of the video to be identified, and the context learning information is used for carrying out context comparison and learning by utilizing the tags and the multi-modal features of the multiple reference videos, so that references are provided for identifying the video tags of the video to be identified, the bias possibly occurring under the single mode is reduced, the video tags to be identified can be identified rapidly and accurately, and the identified video tags are more accurate and reliable.

In an embodiment of the present application, another video tag identification method is provided, which may be applied to the implementation environment shown in fig. 1, and the method may be performed by a terminal or a server, or may be performed by the terminal and the server together, and in the embodiment of the present application, the method is performed by the server as an example, as shown in fig. 3, and S230 is extended to S310 to S330 on the basis of S210 to S240 shown in fig. 2, where a first multi-mode feature includes a first video frame feature, a second multi-mode feature includes a second video frame feature, and the video frame feature is extracted from a video frame of a video to be identified, and S310 to S330 are described in detail below.

And S310, performing feature conversion processing on the first video frame features to obtain a first visual text marking sequence, and performing feature conversion processing on the second video frame features corresponding to each reference video to obtain a second visual text marking sequence corresponding to each reference video.

In the embodiment of the application, the video tag is text information, the video tag and the video frame feature come from different forms and different modes, and in order to facilitate the subsequent construction of context learning information, the video frame feature can be converted into a visual text tag sequence through feature conversion processing, wherein the visual text tag sequence is a token sequence corresponding to the video frame feature, and the feature conversion processing process of the first video frame feature and the feature conversion processing of the second video frame feature are the same.

In an example, the number of the first video frame features comprises a plurality of first visual text marking sequences for performing feature conversion processing on the first video frame features, wherein the first visual text marking sequences are obtained by performing feature fusion processing on the plurality of first video frame features to obtain target video features and aligning the target video features to a preset text feature space through a pre-training feature alignment module.

The method comprises the steps of obtaining a target video feature by carrying out feature fusion processing on a plurality of first video frame features, wherein the target video feature can contain comprehensive video content, the feature fusion processing can be average pooling processing or weighting processing, the weight of each first video frame feature can be determined according to the position of a video frame corresponding to the first video frame feature in a video to be identified, for example, the closer the position of the video frame in the video to be identified is to the middle position, the larger the weight of the video frame corresponding to the video frame is, the closer the position of the video frame in the video to be identified is to the beginning or ending position, and the smaller the weight of the video frame feature corresponding to the video frame is.

In other embodiments of the present application, if only one first video frame feature exists, the first video frame feature may be enhanced to expand the video frame feature, and then feature fusion processing is performed based on the first video frame feature and the expanded video frame feature to obtain the target video feature.

In the embodiment of the application, the feature alignment module is pre-trained, and the target video features are input to the feature alignment module so as to be aligned to a preset text feature space, wherein the preset text feature space can be the input text feature space of the pre-trained label generation model, namely, the non-text token is aligned to the space of the label generation model input text token, so that the non-text features are translated into the understandable content of the label generation model, and the subsequent label generation model recognizes the video labels according to the context learning information.

S320, acquiring first text information corresponding to the video to be identified and second text information corresponding to each reference video.

According to the method and the device for obtaining the text information, the first text information can be obtained according to the attached text information of the video to be identified, for example, the attached text information comprises a video title and video release time, the video release platform, text content detected and identified from the video, for example, subtitles, texts obtained by converting human voice, and the like, and the text type contained in the second text information corresponding to the reference video is the same as the text type contained in the first text information.

It can be understood that the additional text information of the video to be identified has more and more information content, and in order to construct efficient and accurate context learning information, the additional text information can be screened to obtain the first text information.

In an example, the method comprises the steps of obtaining first text information corresponding to a video to be identified, obtaining a video title of the video to be identified and an identification text obtained by converting audio information of the video to be identified, generating situation description of the video to be identified according to video scenes of the video to be identified, and generating the first text information according to the video title, the identification text and the situation description.

The metadata can be parsed from the video file of the video to be identified to extract the video title, if the extracted video title includes special characters or formats, cleaning and normalization processes, such as special character removal, space processing, case normalization and the like, are required, and if the extracted video title is different from the language of the video tag, the extracted video title can be converted into the language corresponding to the video tag.

It will be appreciated that the video to be identified includes audio information that is not background audio, but rather audio from objects contained in the video to be identified, such as audio conversations, may be converted to identified text by automatic speech recognition techniques.

The video title and the recognition text are basic text information attached to the video, in order to further add descriptive language to provide more context background, the context information brought by the video is further enriched, so that the context description can be determined according to the video scene of the video to be recognized, wherein the scene content in the video can be detected and analyzed through objects, such as the scene corresponding to the video is determined through recognizing objects appearing in the video, the time sequence information of the video to be recognized is further analyzed, the scene change information of the video is determined, the context of the video is helped to be understood, and the context description is generated according to the scene and the scene change information of the video. For example, the detection of beach, sun, etc. elements in the video may generate a context description that resembles "on a sunny beach".

In an example, the context description may be further refined based on identifying keywords in the text and information such as speech emotion (e.g., cheerful, serious). For example, the detection of laughter or cheering in audio may infer that the context of the video may be a pleasant or celebratory scene.

After the video title, the identification text and the context description are acquired, the video title, the identification text and the context description can be spliced to obtain first text information, namely meaningful text information is acquired from a video, and the first text information with context understanding is generated, so that a basis is provided for subsequent tag generation.

In one example, if the content of the first text information is small, if the number of text words contained in the first text information is smaller than a preset number threshold, text data enhancement can be performed on the first text information, and diversified text input can be generated through synonym replacement, sentence reconstruction, text expansion and other methods.

The process of obtaining the second text information is the same as the process of obtaining the first text information, and will not be described here again.

S330, context learning information is constructed according to the second visual text marking sequence, the second text information and the video label corresponding to each reference video, and the first visual text marking sequence and the first text information.

In the embodiment of the application, the second visual text mark sequence, the second text information and the video tag corresponding to each reference video and the first visual text mark sequence and the first text information can be spliced to obtain the context learning information, wherein the second visual text mark sequence and the second text information corresponding to the same reference video are taken as inputs, the video tag is taken as outputs, the inputs and the outputs are spliced to obtain the input-output learning examples, the context information is provided through the learning examples, and the learning examples corresponding to the reference videos are spliced with the first visual text mark sequence and the first text information of the video to be identified to obtain the context learning information.

In an example, context learning information is built according to second visual features, second text information and video tags corresponding to each reference video, a first visual text mark sequence and first text information, the context learning information comprises building learning examples according to first text marks used for isolating different information under the same video and the second visual text mark sequence, the second text information and the video tags corresponding to the same reference video to obtain a plurality of learning examples corresponding to a plurality of reference videos, splicing the plurality of learning examples to obtain learning information, wherein the plurality of learning examples in the learning information are separated through the second text marks used for isolating the information of the different videos, building identification examples according to the first text marks, the first visual text mark sequence and the first text information, and generating the context learning information according to the learning information and the identification examples.

For example, the first text label and the second text label are special token, for example, the first text label is < pad >, and the first text label and the second text label are used for isolating different information under the same video, reference to learning examples corresponding to the video 1, namely the input-output examples, for example, a second visual text label sequence 1< pad > second text information 1< pad > video label 1, reference to learning examples corresponding to the video 2, for example, a second visual text label sequence 2< pad > second text information 2< pad > video label 2, and the like to obtain a plurality of learning examples corresponding to a plurality of reference videos. The second text mark is < eoc >, the second text mark is used for isolating information of different videos, a plurality of learning examples are spliced to obtain learning information, the learning information is a second visual text mark sequence 1< pad > the second text information 1< pad > the video tag 1< eoc >, a second visual text mark sequence 2< pad > the second text information 2< pad > the video tag 2< eoc >, and the like, and the video tag 1 and the second visual text mark sequence are information of different videos, so that the learning information is divided by the < eoc >.

In the embodiment of the application, an identification example is constructed, wherein the identification example is a first text message of a first visual text mark sequence < pad >, and further, learning information and the identification example are spliced together to form context learning information with prompt information, wherein the prompt information can be a mapping relation among the visual text mark sequence, the text information and a video tag, and the learning information and the identification example also belong to information of different videos, so that the learning information and the identification example are spliced through a second text mark to obtain the context learning information shown in the figure 4-1.

It can be appreciated that the mark forms of the first text mark and the second text mark can be flexibly adjusted according to practical situations, and < pad > and < eoc > are examples provided for the embodiment of the present application.

In other embodiments of the present application, when building the context learning information, in addition to combining multiple reference videos similar to the video to be identified, a negative video unrelated to the video to be identified may be introduced as a comparison, for example, a third visual text label sequence of the negative video, the third text information and a video label are spliced to obtain the reverse information through the first text label, and then the learning information, the reverse information and the identification example are spliced to obtain the context learning information as shown in fig. 4-2.

It should be noted that, for other detailed descriptions of S210 to S220 and S240 shown in fig. 3, please refer to S210 to S220 and S240 shown in fig. 2, and the detailed descriptions are omitted here.

According to the embodiment of the application, the multi-modal characteristics and the video labels of the reference video are listed first, the learning examples are constructed, then the multi-modal characteristics of the video to be identified are listed, the identification examples are constructed, and then the learning examples and the identification examples are spliced, so that the input containing similar video context information is constructed, the multi-modal information can be effectively utilized to perform the context learning, and the content labels of the video to be identified can be accurately generated later.

In the embodiment of the present application, the method is described by using the server as an example, as shown in fig. 5, and the video tag identification method extends S220 shown in fig. 2 to S510 to S530 on the basis of the implementation environment shown in fig. 2. The first multi-mode feature includes a target video feature and a first text feature, where the target video feature may be obtained by feature fusion of a plurality of first video frame features, and a feature dimension of the target video feature is the same as a feature dimension of the first text feature, and S510 to S530 are described in detail below.

S510, acquiring a pre-established video feature retrieval library and a text feature retrieval library, wherein the video feature retrieval library comprises mapping relations of each candidate video and video features corresponding to the candidate videos, and the text feature retrieval library comprises mapping relations of each candidate video and text features corresponding to the candidate videos.

In the embodiment of the application, a video feature retrieval library and a text feature retrieval library are pre-established, wherein the video feature retrieval library A comprises mapping relations of candidate videos and video features corresponding to the candidate videos, and the text feature retrieval library B comprises mapping relations of candidate videos and text features corresponding to the candidate videos.

S520, respectively searching a video feature search library and a text feature search library according to the target video features, and respectively searching the video feature search library and the text feature search library according to the first text features to obtain a plurality of target candidate videos similar to the video to be identified.

According to the target video feature and the first text feature, searching is carried out in a search library A and a search library B respectively, wherein four searching modes exist, namely 1) the target video feature searches the search library A, 1) candidate video sets corresponding to video features similar to the target video feature in the search library A are determined, 2) the target video feature searches the search library B, 2) candidate video sets corresponding to text features similar to the target video feature in the search library B are determined, 3) the first text feature searches the search library A, 3) candidate video sets corresponding to video features similar to the first text feature in the search library A are determined, 4) the first text feature searches the search library B, and 4) candidate video sets corresponding to text features similar to the first text feature in the search library B are determined.

When the video is searched, feature similarity between a target video feature and a video feature and between a first text feature and a text feature are calculated, wherein the feature similarity can be calculated through cosine similarity, euclidean distance and the like of feature vectors, and then a plurality of candidate videos with high similarity are selected as candidate video sets, for example, K candidate videos corresponding to the feature similarity top-K are selected to obtain candidate video sets, and the number of the candidate videos included in each candidate video set can be the same or different.

In an example, duplicate videos are deleted from candidate video set 1-candidate video set 4, and a specified number of candidate videos may be selected as a plurality of target candidate videos according to a ranking from high to low of similarity.

In another example, the videos in the candidate video set 1-candidate video set 4 may be all directly used as target candidate videos.

S530, selecting a reference video from the target candidate videos according to the video similarity between the target candidate videos and the video to be identified.

In an example, the video similarity between each target candidate video and the video to be identified can be calculated, and then a plurality of reference videos are selected from a plurality of target candidate videos according to the video similarity.

And selecting reference videos from a plurality of target candidate videos according to the average similarity, and sorting the average similarity from high to low, and further selecting top-N videos with high similarity, for example, selecting target candidate videos a, b and d as the reference videos, wherein N is smaller than K.

In other embodiments of the present application, in calculating the average similarity of each target candidate video, it is not necessary to assign the same weight to the four search modes, and weights may be dynamically assigned to each search mode, for example, the weights corresponding to search mode 1) and search mode 3) are greater than the weights corresponding to search mode 2) and search mode 4), so as to calculate the weighted average similarity of the target candidate video, and select the reference video through the weighted average similarity.

It should be noted that, the detailed description of S210, S230 to S240 shown in fig. 5 is please refer to S210, S230 to S240 shown in fig. 2, and the detailed description is omitted here.

According to the embodiment of the application, the video features and the text features are used for cross search in two different search libraries, so that the target candidate video similar to the video to be identified can be searched to the greatest extent by utilizing multi-mode information, and the accuracy and the reliability of selecting the reference video are improved by calculating the average similarity of the target candidate video and the video to be identified in each search mode, so that the top-N video most similar to the video to be identified and the label result thereof are obtained.

The embodiment of the application also provides another video tag identification method, which can be applied to the implementation environment shown in fig. 1, and can be executed by a terminal or a server, the method may be executed by the terminal and the server together, and in the embodiment of the present application, the method is described by the server as an example, as shown in fig. 6, and S210 shown in fig. 2 is extended to S610 to S640 on the basis of the method shown in fig. 2. Wherein, S610 to S640 are described in detail below.

S610, acquiring a video to be identified, extracting a plurality of video frames from the video to be identified, and dividing the plurality of video frames into a plurality of fragments.

S620, extracting target video frames from the plurality of fragments, and extracting video features of the target video frames to obtain first video frame features.

S630, obtaining first text information corresponding to the video to be identified, and extracting text features of the first text information to obtain first text features.

S640, obtaining a first multi-mode feature according to the first video frame feature and the first text feature.

In the embodiment of the application, a plurality of video frames can be extracted from the video to be identified according to the video duration of the video to be identified, for example, 1 frame of video frame is extracted from the video to be identified every preset time, the longer the video duration of the video to be identified is, the longer the preset time is, for example, the video to be identified is 30S video, one frame of video frame is extracted every 1S, the video to be identified is 1min video, one frame of video frame is extracted every 2S, a plurality of video frames can also be randomly extracted from the video to be identified, the video to be identified can be divided into three sections, and the video frames are extracted from the three sections according to different proportions.

After extracting the plurality of video frames, the plurality of video frames may be divided into a plurality of segments, where the plurality of video frames may be first time-ordered or randomly-ordered, and then the plurality of segments may be divided into the plurality of segments, and then the target video frames may be extracted from the plurality of segments, for example, 1 frame or multiple frames may be randomly extracted from each segment to obtain a plurality of target video frames, and the video feature extraction may be performed on the target video frames by using a pre-trained video feature encoder to obtain the first video frame feature.

The method for obtaining the first text information corresponding to the video to be identified can refer to an embodiment shown in fig. 3, which is not described in detail herein, and the text feature encoder performs text feature extraction on the first text information through the pre-trained text feature encoder.

The first video frame feature and the first text feature are taken as first multi-modal features.

In one example, the text feature encoder and the video feature encoder are trained using the CLIP (Contrastive Language-Image Pre-tracking) method such that the text feature extracted by the text feature encoder feature and the video feature extracted by the video feature encoder feature are similar.

It should be noted that, for other detailed descriptions of S220 to S240 shown in fig. 6, please refer to S220 to S240 shown in fig. 2, and further description is omitted herein.

In the embodiment of the application, a plurality of video frames are extracted from the video, the video frames are divided into fragments, the target video frames are extracted from the fragments for video feature extraction, so that the obtained first video frame features are representative and feature singleness is avoided, and based on the characteristics of the CLIP model, the video text features are extracted by an encoder to be close to each video frame feature under ideal conditions, so that the first multi-mode features are more accurate and reliable.

In an embodiment of the present application, another video tag identification method is provided, where the video tag identification method may be applied to the implementation environment shown in fig. 1, and the method may be performed by a terminal or a server, or may be performed by both the terminal and the server, and in an embodiment of the present application, the method is performed by the server as an example, and as shown in fig. 7, the video tag identification method extends the process of performing video tag identification in S240 in fig. 2 to S710 to S730 on the basis of the process shown in fig. 2 to 6. The steps S710 to S730 are described in detail below.

S710, performing sequence feature conversion on the context learning information to obtain a context learning sequence, wherein the sequence feature is an input feature supported by a label generation model.

In the embodiment of the application, the context learning information is a continuous long text sequence, in order to facilitate the understanding and processing of the label generation model, the context learning information needs to be subjected to sequence feature conversion processing, the continuous long text sequence is converted into a token sequence which can be understood and processed by the model, and the token is a basic input unit of the language model and can be a single character, word or sub-word, so that the model can better understand and process the semantics and structure of the text.

In one example, the context learning information may be sequence feature converted to a context learning sequence by a tokenizer (word segmenter) of a language model.

S720, inputting the context learning sequence into a label generation model, wherein the label generation model is obtained by keeping original model parameters of a preset language model frozen and adjusting newly-added model parameters of the language model according to the sample context learning sequence, and the newly-added model parameters are related to a low-rank self-adaptive module introduced into the language model.

In the embodiment of the application, the label generation model can be obtained through training a preset language model, and the language model is obtained through pre-training the model, so that the language model can be applied to the field of natural language processing (Natural Language Processing, NLP), such as machine translation, voice recognition, text generation and the like, wherein the language model comprises, but is not limited to, a large language model (Large Language Model, LLM), a multi-modal large language model (multimodal large language model, MLLM) and other generation type models, and on the basis, the language model is subjected to fine tuning processing to better fit the requirements of a service scene in order to enable the language model to be optimized for the application scene of the content label. In the embodiment of the application, parameters of a language model are efficiently fine-tuned by a Low-Rank adaptive module (LoRA), and the LoRA generates a model by introducing newly added model parameters, such as a Low-Rank matrix, into the language model, wherein the newly added model parameters are trainable, original model parameters of the language model remain frozen, and the newly added model parameters of the language model are adjusted according to a sample context learning sequence, so that the language model comprising the newly added model parameters and the original model parameters is used as a label generation model.

S730, obtaining a target video tag of the video to be identified, which is output by the tag generation model.

The tag generation model may learn from the analogy based on the context information provided by the learning example in the context learning information, the identification of the video tag of the video to be identified.

It should be noted that, for other detailed descriptions of S210 to S230 shown in fig. 7, please refer to S210 to S230 shown in fig. 2, and further description is omitted herein.

In the embodiment of the application, the context learning information comprises a reference prompt with a video tag generated based on multi-mode characteristics, the reference prompt is converted into a language model to better understand and process the token sequence, the tag generation model is obtained based on the language model and can refer to the prompt, the tag is quickly generated for the video to be identified, and the tag generation model is subjected to parameter efficient fine adjustment based on the language model, so that the task requirement of the video tag can be better met.

It should be noted that in an embodiment of the present application, another video tag identification method is provided, where the video tag identification method may be applied to the implementation environment shown in fig. 1, and the method may be performed by a terminal or a server, or may be performed by the terminal and the server together, and in an embodiment of the present application, the method is performed by the server as an example, and as shown in fig. 8, the video tag identification method adds a training process of a tag generation model on the basis of the training process shown in fig. 7, that is, S810 to S850. S810 to S850 are described in detail below.

S810, acquiring a first sample video feature, first sample text information and a first sample video tag corresponding to the first sample video, and a second sample video feature, second sample text information and a second sample video tag corresponding to the second sample video.

In the embodiment of the application, a model is trained through a training set, the training set comprises a plurality of sample videos, each sample video carries a video tag, a first sample video is a reference sample video in training, a second sample video is a sample video to be identified in training, the first sample video and the second sample video are similar videos, the process of acquiring sample video features of the sample videos can be shown in fig. 6, and the process of acquiring sample text information can be shown in fig. 3 and is not repeated herein.

S820, performing feature alignment processing on the first sample video feature and the second sample video feature according to the pre-trained initial feature alignment module to obtain a first sample visual marker sequence and a second sample visual marker sequence, and constructing sample context learning information according to the first sample visual marker sequence, the first sample text information and the first sample video tag, and the second sample visual marker sequence and the second sample text information.

In the embodiment of the application, an initial feature alignment module is obtained through pre-training, the initial feature alignment module can map video features to a space aligned with text features, so that visual information can be better understood and utilized by a language model, feature alignment processing is respectively carried out on the first sample video features and the second sample video features through the initial feature alignment module, so that the first sample video features and the second sample video features are mapped to the space aligned with the text features of the language model, a first sample visual marker sequence and a second sample visual marker sequence are respectively obtained, and further, sample context learning information is constructed according to the first sample visual marker sequence, the first sample text information and the first sample video tag, and the second sample visual marker sequence and the second sample text information, and the construction process is shown in fig. 3.

In an example, the initial feature alignment module can train through image and text description pairs and is combined with a language model, the training step of the initial feature alignment module comprises the steps of obtaining sample images and sample description texts corresponding to the sample images, extracting features of the sample images to obtain sample visual features, inputting the sample visual features to a module to be trained so that the module to be trained aligns the sample visual features to an input text feature space of the language model, obtaining sample text features corresponding to alignment description instructions, inputting target sample visual text features output by the sample text features and the feature alignment module to the language model, wherein model parameters of the language model are kept frozen, obtaining sample prediction description texts output by the language model, and training the module to be trained according to the sample description texts and the sample prediction description texts to obtain the initial feature alignment module.

The module to be trained can be a network with a Full Connection (FC) layer, a multi-layer perceptron (MLP) or other neural network structures such as q-precursors, and the sample description text is used for describing the image content of the sample image. As shown in fig. 9, the sample image is input to a video feature encoder, and a sample visual feature containing main visual information of the sample image, such as an object, a background, a color, etc., is extracted by the video feature encoder. And meanwhile, an alignment description instruction is acquired, the alignment description instruction is used for indicating a language model to generate a description text corresponding to the image, and the alignment description instruction can be a text instruction, for example, one of the following sentences is adopted, namely, the following pictures are briefly described. "," provides a brief description of a given picture. "," succinctly explain the provided pictures. "etc., so that the alignment description instructions can be processed by the language model, and thus converted to sample text features by Text Tokenizer processing, the sample text features are input to the language model.

Since the visual features and the text features come from different modalities, the direct input of the visual features and the text features into a language model is generally poor in effect, and therefore the sample visual features need to be input into a module to be trained, and the module to be trained is used for mapping the visual features into a space consistent with the text features through learning, so that the target sample visual text features can be aligned with the text features in the same or similar feature space.

It should be noted that, when the module to be trained is trained, model parameters of the language model are kept frozen, after the sample text features and the target sample visual text features are input into the language model, the language model combines the target sample visual text features and the sample text features to generate a sample prediction description text, parameters of the module to be trained can be adjusted according to differences of the sample description text corresponding to the sample prediction description text and sample images until the module to be trained converges to obtain the initial feature alignment module, and through training, the initial feature alignment module can effectively align the visual features to a space consistent with the text features, so that the language model can perform excellent in multi-mode input.

S830, performing sequence feature conversion on the sample context learning information to obtain a sample context learning sequence.

And performing sequence feature conversion on the sample context learning information through tokenizer of the language model to obtain a sample context learning sequence.

S840, introducing the low-rank adaptive module into the language model to introduce new model parameters into the language model through the low-rank adaptive module.

S850, original model parameters of the language model are kept frozen, and newly added model parameters of the language model are adjusted according to the sample context learning sequence and the second sample video label to obtain a label generation model.

In the embodiment of the application, the LoRA module is introduced into the language model, the LoRA module introduces a low-rank matrix in certain layers of the model, the LoRA module transforms input data through the low-rank matrix and then transmits the transformed data to the next layer of the model in the forward propagation process, the transformation is equivalent to adding a layer of additional parameter adjustment on the basis of the language model, namely, introducing new model parameters into the language model without changing the parameters of the original model, only training the new model parameters in the training process, keeping the original model parameters of the language model frozen without updating the original model parameters, and in this way, the fine adjustment of the model can be realized under the condition of keeping the original model parameters unchanged.

In an example, a sample context learning sequence is input to a language model introduced into a low-rank adaptive module, and a predicted sample label output by the language model is obtained, the language model can be analogically learned from the sample context learning sequence to generate a predicted sample label corresponding to a second sample video, and the second sample video corresponds to the second sample video label, so that a label generation model can be obtained by adjusting newly-added model parameters of the language model according to differences between the predicted sample label and the second sample video label, and a loss function, such as a contrast loss function, a mean square error loss function, and the like, is calculated according to the difference between the predicted sample label and the second sample video label, and the newly-added model parameters are further optimized according to the loss function until the difference between the predicted sample label and the second sample video label is smaller than a preset threshold value.

It should be noted that, while training the LoRA modules, the initial feature alignment module may be further optimized, that is, the module parameters of the initial feature alignment module may be adjusted according to the difference between the predicted sample tag and the second sample video tag, for example, the module parameters of the initial feature alignment module may be further optimized according to the loss function until the difference between the predicted sample tag and the second sample video tag is less than the preset threshold, so as to obtain the final feature alignment module.

In the embodiment of the application, the feature alignment module is initially trained by combining with the language model, parameters of the language model are kept frozen, the feature alignment module learns to align visual features with text features through training, so that effective conversion and fusion among multi-mode features are realized, then LoRA is introduced into the language model, loRA achieves effective adjustment of model output with minimum parameter change in a low-rank decomposition mode, the task requirement of a video tag is adapted, the LoRA is trained, and meanwhile, the adapter is continuously trained, so that the model can generate content tags for videos to be identified according to multi-mode context examples, and the feature alignment module is more accurate in feature alignment.

In order to facilitate understanding, the embodiment of the present application further provides a method for identifying a video tag, as shown in fig. 10, where a video enters a content processing link from a content production link, obtains corresponding content features through a man-machine collaboration manner, and enters a downstream content distribution link. The embodiment of the application provides a video tag identification method, which belongs to a machine marking link.

The whole process of video tag recognition can be seen in fig. 11, which includes video multi-modal feature extraction, similar video retrieval, context learning instruction construction and model recognition processes, wherein LLM is exemplified by LLM model, and LLM can select open-source chinese LLM, such as chinese-llama, BLOOM, etc.

The video multi-modal feature extraction comprises the steps of extracting two modalities, namely a video frame and a video text from an original video. Firstly, 1 second of a video 1 frame is firstly extracted, then the video is divided into M segments in an equant mode, 1 frame is randomly taken from each segment in a training mode, the most middle 1 frame is taken from each segment in a test mode, M frames are obtained altogether, video texts are extracted, namely, the title of the video is extracted, the texts obtained through ASR (automatic generation) conversion of the video are combined into a sentence.

The reason why the original audio features are not used in the embodiment of the application is that based on the service scene, the background sounds in the short video are unified popular background music, the judgment is not very good, and important oral playing information is extracted through ASR and combined into video text information, and the original audio signals are not used as one of the inputs.

The embodiment of the application uses a CLIP method to train a video feature encoder (cv encoder) and a text feature encoder (text encoder) to extract multi-mode features of the video. Specifically, the video frame is composed of M images, video frame characteristics are obtained through calculation of a cv encoder, and are marked as f= { f^v₁, f^v₂,…, f^v_M }, video titles and other texts such as ASR are spliced together, video text characteristics are obtained through calculation of a text encoder, and are marked as f^t. Based on the characteristics of the CLIP model alignment feature, the video text feature is ideally relatively close to each video frame feature.

Before performing similar video retrieval, a retrieval library needs to be constructed. For each video, the video frame feature f= { F^v₁, f^v₂,…, f^v_M } and the video text feature F^t can be obtained according to the above method, in order to simplify the calculation flow, the video feature F is subjected to average pooling (average pooling processing) to obtain a feature F^v with the same dimension as the text feature, and each video in the search library has a labeling result of a label, at this time, two search libraries can be built, namely a search library A { F^v = > video }, a search library B { F^t = > video }, and a search library B is built by using the video text feature.

The above step of constructing the search pool may be accomplished by offline computing.

And for the video to be identified, after the visual feature f^v and the text feature f^t are extracted, cross search is carried out on the search libraries A and B, and a top-K result of 4 paths of search is returned. Based on the description, the characteristics of the CLIP are aligned with the visual and characteristic characteristics in an ideal condition, besides the visual characteristic search library A and the text characteristic search library B, the visual characteristic search text characteristic search library B and the text characteristic search library A can be used for cross search, videos in the top-K results of the 4 paths are respectively calculated to be similar to the videos to be identified, then the average similarity (if a certain video is not in a certain top-K path, the similarity is regarded as 0) of the videos obtained by search in the 4 paths is calculated, top-N videos and corresponding label results are obtained according to the average similarity and are used as subsequent context reference samples. Where K may be set to a value greater than N, such as k=2×n.

In the embodiment of the application, the basic thought of contextual learning (In Context Learning, ICL) is used for learning from analogy and expanding the learning to multi-modal application scenes, and in the embodiment of the application, top-N samples closest to the video content to be identified are recalled by the similar video retrieval method described above, and the input of the LLM is constructed by using the following templates.

Specifically, the template firstly lists the multi-modal characteristics of top-N videos and corresponding label results, and then gives out the characteristics of the videos to be identified, so that the LLM generates the content labels thereof. Wherein < pad > and < eoc > are both special token for separating different modalities and resulting token sequences within one video, and < eoc > represents end of chunk for separating token sequences of different videos. The method comprises the steps of obtaining a template video text i which is information of a title of the video and text such as ASR (automatic generation) and the like (the original text is not text features f^t), obtaining a video label i which is label marking result text corresponding to the video, obtaining a token sequence of the visual features through conversion of the video frame features f^v through an adapter (feature alignment module) to be trained in the template, generating the token sequence (namely the context learning sequence) through tokenizer of the LLM by the whole input sequence, and thus constructing an input similar to the context demonstration example to the LLM by the method, and referring to the graph of FIG. 4-1.

The model identification process, LLM, receives a token sequence (i.e., the context learning sequence described above) and generates a tag based on this sequence.

It should be noted that, the embodiment of the present application is different from the traditional LLM without training, and considering that the task of content tagging has a certain subjectivity, appropriate fine tuning of LLM can be better fit to the requirements of service scene, so that a LoRA is used to perform parameter efficient fine tuning on LLM. The LLM parameters remain totally frozen throughout the training process, training only two modules LoRA and adapter.

The training method of LLM includes two key phases.

Stage one, training the adapter to align the visual features. Only adapter was trained at this stage, without LoRA added, using the picture-text description pair.

At this stage, no other information is provided to the LLM, but only the visual token can be received, so that the adapter is forced to successfully "translate" the visual information to the LLM to generate the correct picture description text, so that the adapter can train better, and particularly, see fig. 9.

Stage two, training LoRA. At this stage, LLM adds LoRA while training adapters and LoRA to allow the model to generate label results for the video to be identified according to the context instance, as shown in fig. 11, at which stage the input token sequence is generated according to the template as described above, without requiring manual design instructions.

In the embodiment of the application, if the LLM model generates a label result for the video to be identified in the training process, the selected context example combination and template structure can be optimized according to the quality feedback of the generated label. For example, if the model has a low quality of labels generated for certain categories, then the context example combinations associated with those categories may be preferentially selected for secondary learning.

It can be understood that after the model training is completed, a new video predictive label can be used, and after the visual characteristics and the text characteristics of the video are extracted, by searching similar videos and constructing a context enhanced input token sequence according to a template, the LLM can directly generate a corresponding content label result.

The video tag identification method provided by the application utilizes the inherent strong reasoning capability of the large model, also considers the standard difference under the service characteristic scene, provides a multi-mode large model based on context enhancement, combines the thought of context learning, introduces a similar video search, outputs the top-N recalled similar video of the video to be identified and the corresponding tag result to the large model in the form of a reference sample to be inferred together, enhances the characteristics of the input model, and realizes the training target with fewer training samples.

Embodiments of the apparatus of the present application are described herein as being used to perform the video tag identification method of the above embodiments of the present application. For details not disclosed in the embodiments of the apparatus of the present application, please refer to the embodiments of the video tag identification method of the present application.

An embodiment of the present application provides a video tag recognition apparatus, as shown in fig. 12, including.

The obtaining module 1210 is configured to obtain a video to be identified, and perform multi-modal feature extraction on the video to be identified to obtain a first multi-modal feature, where the first multi-modal feature includes a plurality of modal features corresponding to a plurality of modes respectively.

The retrieving module 1220 is configured to retrieve a plurality of reference videos similar to the video to be identified according to the first multimodal feature, where each reference video carries a video tag.

The construction module 1230 is configured to construct context learning information according to the second multi-modal feature corresponding to each of the reference videos, the video tag carried by each of the reference videos, and the first multi-modal feature of the video to be identified.

And an identifying module 1240, configured to identify the video tag of the video to be identified according to the context learning information.

In one embodiment of the application, based on the foregoing scheme, the first multi-modal feature includes a first video frame feature, the second multi-modal feature includes a second video frame feature, the construction module is further configured to perform feature conversion processing on the first video frame feature to obtain a first visual text tag sequence, perform feature conversion processing on the second video frame feature corresponding to each reference video to obtain a second visual text tag sequence corresponding to each reference video, obtain first text information corresponding to the video to be identified and second text information corresponding to each reference video, and construct the context learning information according to the second visual text tag sequence, the second text information and the video tag corresponding to each reference video, and the first visual text tag sequence and the first text information.

In one embodiment of the application, based on the above scheme, the number of the first video frame features includes a plurality of first video frame features, the construction module is further configured to perform feature fusion processing on the plurality of first video frame features to obtain target video features, and align the target video features to a preset text feature space through a pre-training feature alignment module to obtain the first visual text marker sequence.

In one embodiment of the application, based on the scheme, the construction module is further used for constructing a learning example according to a first text mark used for isolating different information under the same video and the second visual text mark sequence, the second text information and the video tag corresponding to the same reference video to obtain a plurality of learning examples corresponding to a plurality of reference videos, splicing the learning examples to obtain learning information, wherein the learning examples in the learning information are separated through the second text mark used for isolating the information of the different videos, constructing an identification example according to the first text mark, the first visual text mark sequence and the first text information, and generating the contextual learning information according to the learning information and the identification example.

In one embodiment of the application, based on the scheme, the acquisition module is further used for acquiring a video title of the video to be identified and an identification text obtained by converting audio information of the video to be identified, generating a situation description of the video to be identified according to a video scene of the video to be identified, and generating the first text information according to the video title, the identification text and the situation description.

In one embodiment of the application, based on the scheme, the plurality of target candidate videos comprise a plurality of target candidate videos retrieved by each retrieval mode, the retrieval module is further used for calculating the similarity between the video to be identified and each target candidate video according to the plurality of target candidate videos retrieved by each retrieval mode, calculating the average similarity between the target candidate videos and the video to be identified in each retrieval mode according to each target candidate video, and selecting the reference video from the plurality of target candidate videos according to the average similarity.

In one embodiment of the present application, based on the foregoing scheme, the obtaining module is further configured to extract a plurality of video frames from the video to be identified, divide the plurality of video frames into a plurality of segments, extract a target video frame from the plurality of segments, perform video feature extraction on the target video frame to obtain a first video frame feature, obtain first text information corresponding to the video to be identified, and perform text feature extraction on the first text information to obtain a first text feature, and obtain the first multi-modal feature according to the first video frame feature and the first text feature.

In one embodiment of the application, based on the scheme, the method further comprises the steps of performing sequence feature conversion on the context learning information to obtain a context learning sequence, wherein the sequence feature is an input feature supported by a tag generation model, inputting the context learning sequence into the tag generation model, the tag generation model is obtained by keeping original model parameters of a preset language model frozen and adjusting newly-added model parameters of the language model according to a sample context learning sequence, the newly-added model parameters are related to a low-rank adaptive module introduced into the language model, and acquiring a target video tag of the video to be identified, which is output by the tag generation model.

In one embodiment of the present application, based on the foregoing scheme, the apparatus further includes a training module configured to obtain a first sample video feature, a first sample text information, and a first sample video tag corresponding to a first sample video, and a second sample video feature, a second sample text information, and a second sample video tag corresponding to a second sample video, perform feature alignment processing on the first sample video feature and the second sample video feature according to a pre-trained initial feature alignment module, respectively, to obtain a first sample visual tag sequence and a second sample visual tag sequence, and construct sample context learning information according to the first sample visual tag sequence, the first sample text information, and the first sample video tag, and the second sample context learning information, perform sequence feature conversion on the sample context learning information, to obtain the sample context learning sequence, introduce the low-adaptive module into the language model, introduce the new enhancement parameters into the language model according to the pre-trained initial feature alignment module, hold the new enhancement parameters into the language model, and generate the new language model according to the enhancement parameters, and perform the enhancement parameters adjustment on the language model according to the new model.

In one embodiment of the application, based on the scheme, the training module is further used for acquiring a sample image and a sample description text corresponding to the sample image, extracting characteristics of the sample image to obtain sample visual characteristics, inputting the sample visual characteristics to a module to be trained so that the module to be trained aligns the sample visual characteristics to an input text characteristic space of the language model, acquiring sample text characteristics corresponding to an alignment description instruction, inputting the sample text characteristics and target sample visual text characteristics output by the characteristic alignment module to the language model, wherein model parameters of the language model are kept frozen, acquiring sample prediction description text output by the language model, and training the module to be trained according to the sample description text and the sample prediction description text to obtain the initial characteristic alignment module.

In one embodiment of the present application, based on the foregoing scheme, the training module is further configured to input the sample context learning sequence to a language model that is introduced into the low-rank adaptive module, and obtain a predicted sample tag that is output by the language model, adjust the newly added model parameter of the language model according to a difference between the predicted sample tag and the second sample video tag to obtain the tag generation model, and adjust the module parameter of the initial feature alignment module according to a difference between the predicted sample tag and the second sample video tag.

It should be noted that, the apparatus provided in the foregoing embodiments and the method provided in the foregoing embodiments belong to the same concept, and the specific manner in which each module and unit perform the operation has been described in detail in the method embodiments, which is not repeated herein.

The device provided in the above embodiment may be provided in the terminal or in the server.

The embodiment of the application also provides electronic equipment, which comprises one or more processors and a storage device, wherein the storage device is used for storing one or more computer programs, and the electronic equipment is enabled to realize the video tag identification method when the one or more computer programs are executed by the one or more processors.

It should be noted that, the computer system 1300 of the electronic device shown in fig. 13 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present application.

As shown in fig. 13, the computer system 1300 includes a processor (Central Processing Unit, CPU) 1301 that can perform various appropriate actions and processes, such as performing the methods in the above-described embodiments, according to a program stored in a Read-Only Memory (ROM) 1302 or a program loaded from a storage portion 1308 into a random access Memory (Random Access Memory, RAM) 1303. In the RAM 1303, various programs and data required for the system operation are also stored. The CPU 1301, ROM 1302, and RAM 1303 are connected to each other through a bus 1304. An Input/Output (I/O) interface 1305 is also connected to bus 1304.

In some embodiments, the following components are connected to I/O interface 1305, including an input portion 1306 including a keyboard, a mouse, etc., an output portion 1307 including a Cathode Ray Tube (CRT), a Liquid crystal display (Liquid CRYSTAL DISPLAY, LCD), etc., and a speaker, etc., a storage portion 1308 including a hard disk, etc., and a communication portion 1309 including a network interface card such as a LAN (Local Area Network ) card, modem, etc. The communication section 1309 performs a communication process via a network such as the internet. The drive 1310 is also connected to the I/O interface 1305 as needed. Removable media 1311, such as magnetic disks, optical disks, magneto-optical disks, semiconductor memory, and the like, is mounted on drive 1310 as needed so that a computer program read therefrom is mounted into storage portion 1308 as needed.

In particular, according to embodiments of the present application, the process described above with reference to the flowcharts may be implemented as a computer program. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising a computer program for performing the method shown in the flowchart. In such embodiments, the computer program may be downloaded and installed from a network via the communication portion 1309 and/or installed from the removable medium 1311. When executed by a processor (CPU) 1301, performs various functions defined in the system of the present application.

It should be noted that, the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium may be, for example, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of a computer-readable storage medium may include, but are not limited to, an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-Only Memory (ROM), an erasable programmable read-Only Memory (Erasable Programmable Read Only Memory), a flash Memory, an optical fiber, a portable compact disc read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer-readable signal medium may comprise a data signal propagated in baseband or as part of a carrier wave, with a computer-readable computer program embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. A computer program embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, etc., or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. Where each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer programs.

The units or modules involved in the embodiments of the present application may be implemented in software, or may be implemented in hardware, and the described units or modules may also be disposed in a processor. Where the names of the units or modules do not in some way constitute a limitation of the units or modules themselves.

Another aspect of the application also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements a video tag identification method as described above. The computer-readable storage medium may be included in the electronic device described in the above embodiment or may exist alone without being incorporated in the electronic device.

Another aspect of the present application also provides a computer program product comprising a computer program stored in a computer readable storage medium. The processor of the electronic device reads the computer program from the computer-readable storage medium, and the processor executes the computer program to cause the electronic device to execute the video tag identification method provided in the above-described respective embodiments.

It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.

Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the embodiments disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains.

The foregoing is merely illustrative of the preferred embodiments of the present application and is not intended to limit the embodiments of the present application, and those skilled in the art can easily make corresponding variations or modifications according to the main concept and spirit of the present application, so that the protection scope of the present application shall be defined by the claims.

Claims

1. A method for identifying a video tag, comprising:

Acquiring a video to be identified, and extracting multi-modal features of the video to be identified to obtain a first multi-modal feature, wherein the first multi-modal feature comprises a plurality of modal features corresponding to a plurality of modes respectively, and the first multi-modal feature comprises a first video frame feature;

Searching a plurality of reference videos similar to the video to be identified according to the first multi-modal characteristics, and extracting multi-modal characteristics of each reference video to obtain second multi-modal characteristics, wherein the second multi-modal characteristics comprise second video frame characteristics, and each reference video carries a video tag;

performing feature conversion processing on the first video frame features to obtain a first visual text marking sequence, and performing feature conversion processing on the second video frame features corresponding to each reference video to obtain a second visual text marking sequence corresponding to each reference video;

Acquiring first text information corresponding to the video to be identified and second text information corresponding to each reference video;

Taking a second visual text marking sequence and second text information corresponding to the same reference video as input, taking a video tag as output, and splicing the input and the output to obtain an input-output learning example;

splicing the learning examples corresponding to the multiple reference videos with a first visual text marking sequence and first text information of the video to be identified to obtain context learning information, wherein the context learning information comprises context information provided by the learning examples and is prompt information with analog learning;

And identifying the video tag of the video to be identified according to the context learning information.

2. The method of claim 1, wherein the number of first video frame features comprises a plurality of, wherein performing feature conversion processing on the first video frame features to obtain a first sequence of visual text labels comprises:

Performing feature fusion processing on the plurality of first video frame features to obtain target video features;

And aligning the target video features to a preset text feature space through a pre-training feature alignment module to obtain the first visual text marking sequence.

3. The method of claim 1, wherein the splicing the learning examples corresponding to the plurality of reference videos with the first visual text tag sequence and the first text information of the video to be identified to obtain the context learning information comprises:

According to a first text mark used for isolating different information under the same video, and the second visual text mark sequence, the second text information and the video tag corresponding to the same reference video, a learning example is built, so that a plurality of learning examples corresponding to a plurality of reference videos are obtained;

splicing the plurality of learning examples to obtain learning information, wherein the plurality of learning examples in the learning information are separated by a second text mark for isolating information of different videos;

Constructing an identification example according to the first text mark, the first visual text mark sequence and the first text information;

The contextual learning information is generated from the learning information and the recognition example.

4. The method according to claim 1, wherein the obtaining the first text information corresponding to the video to be identified and the second text information corresponding to each of the reference videos includes:

Acquiring a video title of the video to be identified and an identification text obtained by converting the audio information of the video to be identified;

generating a situation description of the video to be identified according to the video scene of the video to be identified;

generating the first text information according to the video title, the identification text and the context description.

5. The method of claim 1, wherein the first multimodal feature comprises a target video feature and a first text feature, wherein retrieving a plurality of reference videos similar to the video to be identified based on the first multimodal feature comprises:

acquiring a pre-established video feature retrieval library and a text feature retrieval library, wherein the video feature retrieval library comprises mapping relations of each candidate video and video features corresponding to the candidate videos, and the text feature retrieval library comprises mapping relations of each candidate video and text features corresponding to the candidate videos;

searching the video feature searching library and the text feature searching library according to the target video features respectively, and searching the video feature searching library and the text feature searching library according to the first text features respectively to obtain a plurality of target candidate videos similar to the video to be identified;

And selecting the reference video from the target candidate videos according to the video similarity between the target candidate videos and the video to be identified.

6. The method of claim 5, wherein the plurality of target candidate videos includes a plurality of target candidate videos retrieved in each retrieval manner, wherein the selecting the reference video from the plurality of target candidate videos according to video similarities between the plurality of target candidate videos and the video to be identified includes:

the multiple target candidate videos are retrieved according to each retrieval mode, and the similarity between the video to be identified and each target candidate video is calculated;

for each target candidate video, calculating the average similarity between the target candidate video and the video to be identified in each retrieval mode;

and selecting the reference video from the target candidate videos according to the average similarity.

7. The method according to claim 1, wherein the performing multi-modal feature extraction on the video to be identified to obtain a first multi-modal feature includes:

Extracting a plurality of video frames from the video to be identified, and dividing the plurality of video frames into a plurality of fragments;

extracting target video frames from the plurality of fragments, and extracting video features of the target video frames to obtain first video frame features;

Acquiring first text information corresponding to the video to be identified, and extracting text features of the first text information to obtain first text features;

and obtaining the first multi-mode feature according to the first video frame feature and the first text feature.

8. The method according to any one of claims 1 to 7, wherein the identifying the video tag of the video to be identified based on the contextual learning information comprises:

Performing sequence feature conversion on the context learning information to obtain a context learning sequence, wherein the sequence feature is an input feature supported by a label generation model;

Inputting the context learning sequence into the label generation model, wherein the label generation model is obtained by keeping original model parameters of a preset language model frozen and adjusting newly-added model parameters of the language model according to the sample context learning sequence, and the newly-added model parameters are related to a low-rank self-adaptive module introduced into the language model;

and acquiring a target video tag of the video to be identified, which is output by the tag generation model.

9. The method of claim 8, wherein the training step of the tag generation model comprises:

acquiring a first sample video feature, first sample text information and a first sample video tag corresponding to a first sample video, and a second sample video feature, second sample text information and a second sample video tag corresponding to a second sample video;

Performing feature alignment processing on the first sample video feature and the second sample video feature according to a pre-trained initial feature alignment module to obtain a first sample visual marker sequence and a second sample visual marker sequence, and constructing sample context learning information according to the first sample visual marker sequence, the first sample text information and the first sample video tag, and the second sample visual marker sequence and the second sample text information;

Performing sequence feature conversion on the sample context learning information to obtain a sample context learning sequence;

introducing the low-rank adaptive module into the language model to introduce the newly added model parameters into the language model through the low-rank adaptive module;

And keeping the original model parameters of the language model frozen, and adjusting the newly added model parameters of the language model according to the sample context learning sequence and the second sample video label to obtain the label generation model.

10. The method of claim 9, wherein prior to feature alignment processing of the video features according to the pre-trained initial feature alignment module, the method further comprises:

acquiring a sample image and a sample description text corresponding to the sample image;

extracting features from the sample image to obtain sample visual features, and inputting the sample visual features to a module to be trained so that the module to be trained aligns the sample visual features to an input text feature space of the language model;

Acquiring sample text features corresponding to the alignment description instruction, and inputting the sample text features and target sample visual text features output by the feature alignment module into the language model, wherein model parameters of the language model are kept frozen;

And acquiring a sample prediction description text output by the language model, and training the module to be trained according to the sample description text and the sample prediction description text to obtain the initial feature alignment module.

11. The method of claim 9, wherein said adjusting the model parameters of the language model based on the sample context learning sequence to obtain the tag generation model comprises:

inputting the sample context learning sequence into a language model which is introduced into the low-rank self-adaptive module, and acquiring a prediction sample label which is output by the language model;

According to the difference between the prediction sample label and the second sample video label, the newly added model parameters of the language model are adjusted to obtain the label generation model;

The method further comprises the steps of:

And adjusting the module parameters of the initial feature alignment module according to the difference between the prediction sample label and the second sample video label.

12. A video processing apparatus, comprising:

The system comprises an acquisition module, a first video frame feature and a second video frame feature, wherein the acquisition module is used for acquiring a video to be identified, and carrying out multi-mode feature extraction on the video to be identified to obtain a first multi-mode feature, the first multi-mode feature comprises a plurality of mode features corresponding to a plurality of modes respectively, and the first multi-mode feature comprises the first video frame feature;

The retrieval module is used for retrieving a plurality of reference videos similar to the video to be identified according to the first multi-modal characteristics, carrying out multi-modal characteristic extraction on each reference video to obtain second multi-modal characteristics, wherein the second multi-modal characteristics comprise second video frame characteristics, and each reference video carries a video tag;

The construction module is used for carrying out feature conversion processing on the first video frame features to obtain a first visual text marking sequence, and carrying out feature conversion processing on the second video frame features corresponding to the reference videos to obtain a second visual text marking sequence corresponding to the reference videos; the method comprises the steps of obtaining first text information corresponding to a video to be identified and second text information corresponding to each reference video respectively, taking a second visual text marking sequence corresponding to the same reference video and the second text information as input, taking a video tag as output, and splicing the input and the output to obtain an input-output learning example;

and the identification module is used for identifying the video tag of the video to be identified according to the context learning information.

13. An electronic device, comprising:

One or more processors;

storage means for storing one or more programs that, when executed by the one or more processors, cause the electronic device to perform the method of any of claims 1-11.

14. A computer readable storage medium, having stored thereon a computer program which, when executed by a processor of an electronic device, causes the electronic device to perform the method of any of claims 1 to 11.

15. A computer program product, characterized in that it comprises a computer program stored in a computer readable storage medium, from which a processor of an electronic device reads and executes the computer program, causing the electronic device to perform the method of any one of claims 1 to 11.