Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments can be embodied in many different forms and should not be construed as limited to the embodiments set forth herein, but rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the exemplary embodiments to those skilled in the art. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted.
The described features, structures, or characteristics of the disclosure may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the present disclosure. However, those skilled in the art will recognize that the aspects of the present disclosure may be practiced with one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the disclosure.
The drawings are merely schematic illustrations of the present disclosure, in which like reference numerals denote like or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in at least one hardware module or integrated circuit or in different networks and/or processor devices and/or microcontroller devices.
The flow diagrams depicted in the figures are exemplary only, and not necessarily all of the elements or steps are included or performed in the order described. For example, some steps may be decomposed, and some steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.
In this specification, the terms "a," "an," "the," "said" and "at least one" are used to indicate the presence of at least one element/component/etc., the terms "comprising," "including" and "having" are intended to be open-ended, meaning that there may be additional elements/components/etc., in addition to the listed elements/components/etc., and the terms "first," "second," and "third," etc., are used merely as labels, and are not limiting in number of their objects.
FIG. 1 shows a schematic diagram of an exemplary system architecture to which the knowledge extraction method of embodiments of the present disclosure may be applied.
As shown in fig. 1, the system architecture may include a server 101, a network 102, a terminal device 103, a terminal device 104, and a terminal device 105. Network 102 is the medium used to provide communication links between terminal device 103, terminal device 104, or terminal device 105 and server 101. Network 102 may include various connection types such as wired, wireless communication links, or fiber optic cables, among others.
The server 101 may be a server providing various services, such as a background management server providing support for devices operated by a user with the terminal device 103, the terminal device 104, or the terminal device 105. The background management server may perform analysis and other processing on the received data such as the request, and feed back the processing result to the terminal device 103, the terminal device 104, or the terminal device 105.
The terminal device 103, the terminal device 104, and the terminal device 105 may be, but are not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a wearable smart device, a virtual reality device, an augmented reality device, and the like.
In the embodiment of the disclosure, for example, a video producer may upload a video to be processed and text information of the video to be processed through the terminal device 103, the server 101 may obtain the video to be processed and the text information of the video to be processed, extract a plurality of candidate knowledge texts from the video to be processed and the text information, obtain video features of the video to be processed and text features of each candidate knowledge text, determine similarity between the video features and each text feature through a first model, determine a plurality of target candidate knowledge texts from the plurality of candidate knowledge texts according to the similarity, determine matching degree between each target candidate knowledge text and video content of the video to be processed through a second model, and determine the target knowledge text from the plurality of target candidate knowledge texts according to the matching degree. The server 101 may send the video to be processed and the target knowledge text thereof to the terminal device 104 and the terminal device 105 of the video viewer, who may know the video to be processed in advance through the target knowledge text of the video to be processed, and choose to view the video to be processed.
It should be understood that the numbers of the terminal device 103, the terminal device 104, the terminal device 105, the network 102 and the server 101 in fig. 1 are only illustrative, and the server 101 may be a server of one entity, may be a server cluster formed by a plurality of servers, may be a cloud server, and may have any number of terminal devices, networks and servers according to actual needs.
The steps of the knowledge extraction method in the exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings and embodiments. The method provided by the embodiments of the present disclosure may be performed by any electronic device, for example, the server and/or the terminal device in fig. 1 described above, but the present disclosure is not limited thereto.
FIG. 2 is a flow chart illustrating a knowledge extraction method, according to an example embodiment.
As shown in fig. 2, the method provided by the embodiments of the present disclosure may include the following steps.
In step S210, a video to be processed and text information of the video to be processed are acquired.
The video to be processed may be any video uploaded by the user, and the video to be processed may be a short video recorded by the user, a video produced by the user through an application program such as editing software, or a movie video, a television play video, a variety video on a network, which is not particularly limited in this disclosure.
For example, the video to be processed may be a short-knowledge-class short video on a short-video platform, where the short-knowledge-class short video refers to a short video for explaining knowledge, and unlike an entertainment-class short video, the short-knowledge-class short video focuses on the aspects of life skills, science popularization knowledge, learning skills, sharing experience, and the like. The user can learn knowledge by looking at short videos of the generic knowledge class.
The text information of the video to be processed may be information such as a video title, a video subtitle, a video description text, etc. edited by a user for the video to be processed, or information such as a video title, a video subtitle, a video description text, etc. generated by the server according to automatic identification of video frames of the video to be processed.
In the embodiment of the disclosure, the server may automatically obtain a plurality of to-be-processed videos in batches and text information of each to-be-processed video, and the processing procedure of the to-be-processed video is described below by taking one to-be-processed video as an example, but the disclosure is not limited thereto.
In step S220, a plurality of candidate knowledge texts are extracted from the video and text information to be processed.
In the embodiment of the disclosure, a plurality of candidate knowledge texts can be extracted from the video to be processed, and a plurality of candidate knowledge texts can be extracted from text information of the video to be processed.
The knowledge text refers to words which can represent knowledge content of the video to be processed, and the candidate knowledge text refers to words which are preliminarily determined and become candidates to be target knowledge text of the video to be processed.
In the embodiment of the disclosure, the knowledge text may include a term and knowledge points (i.e., a "term-knowledge point" pair) thereof, where the term may be a concept (e.g., a plant) and/or an entity (e.g., a rose) representing knowledge content spoken by the video to be processed, and the knowledge points may refer to knowledge points of a certain aspect corresponding to the concept or entity, for example, the knowledge points corresponding to the term "plant" are "histories" (i.e., a "plant-histories"), and the knowledge points corresponding to the term "rose" are "cuttages" (i.e., rose-cuttages). The terms and knowledge points can be in one-to-one correspondence, and the terms and the knowledge points corresponding to the terms can form a term-knowledge point pair.
In the embodiment of the disclosure, the candidate knowledge text may include candidate terms and candidate knowledge points of the candidate terms (i.e. "candidate terms-candidate knowledge point" pairs), where the candidate terms refer to terms primarily determined that are candidates to be target terms of the video to be processed, and the candidate knowledge points refer to knowledge points primarily determined that are candidates to be target knowledge points of the video to be processed.
In the embodiment of the disclosure, a plurality of candidate terms and candidate knowledge points corresponding to the candidate terms are extracted from a video to be processed and text information of the video to be processed, for example, a plurality of candidate knowledge texts can be extracted from the video to be processed and the text information describing "apple nutrition value", wherein one candidate term is, for example, "apple", and the candidate knowledge point corresponding to the candidate term is "nutrition value".
In an exemplary embodiment, extracting a plurality of candidate knowledge texts from text information of a video to be processed comprises extracting a plurality of candidate entries from the text information of the video to be processed, and obtaining candidate knowledge points of each candidate entry according to the position of each candidate entry in the text information.
Specifically, a keyword recall mode may be used to extract a plurality of candidate terms from the text information of the video to be processed, for example, keywords mentioned in the text information are obtained as candidate terms through TF-IDF (Term Frequency-inverse text Frequency index), textRank (text ranking algorithm), or a recall mode Hashtag (topic label) may be used to extract a plurality of candidate terms from the text information of the video to be processed, for example, an entity is obtained as candidate terms from the topic label (Hashtag) of the video.
Specifically, a plurality of candidate entries can be extracted from text information of a video to be processed, then candidate knowledge points corresponding to the candidate entries are obtained by using a keyword recall mode, for example, sentences where the candidate entries are located, a previous sentence and a next sentence of the sentences are selected from the text information to serve as candidate texts, keywords of the candidate texts are selected to serve as candidate knowledge points corresponding to the candidate entries, or candidate knowledge points corresponding to the candidate entries can be obtained by using a mode of a longest common subsequence, for example, sentences where the candidate entries are located are selected, and the longest common subsequence of word segmentation results of the sentences is calculated to serve as candidate knowledge points corresponding to the candidate entries.
In the embodiment of the disclosure, the sequence annotation model may be trained in advance, so that a plurality of candidate knowledge texts are automatically extracted from the video to be processed and the text information of the video to be processed by using the annotation model.
Specifically, a short video platform may be used to obtain modal information such as a batch of training videos (e.g., short videos of general knowledge class), training texts corresponding to the training videos, and training knowledge texts corresponding to the training videos (e.g., training vocabulary entries and training knowledge points). The training knowledge text can be obtained by manually marking the training video. Training the sequence annotation model through training texts and training knowledge texts corresponding to the training video to obtain a trained sequence annotation model.
Specifically, the text information of the video to be processed may include a capture (description) text of the video, an OCR (Optical Character Recognition ) text, an ASR (Automatic Speech Recognition, automatic speech recognition technology) text, etc., and the vocabulary entries in the text information may be automatically recognized as candidate vocabulary entries through a trained sequence annotation model. Similarly, knowledge points corresponding to the vocabulary entries can be automatically identified through the trained sequence annotation model to serve as candidate knowledge points.
In the embodiment of the disclosure, after extracting a plurality of candidate knowledge texts, the candidate knowledge texts can be recalled in a keyword mining and entity naming recognition mode, so that the obtained candidate knowledge texts are richer.
In step S230, video features of the video to be processed and text features of each candidate knowledge text are acquired.
In the embodiment of the disclosure, the video to be processed can be subjected to feature extraction to obtain the video features of the video to be processed, and the feature extraction can be performed on each candidate knowledge text to obtain the text features of each candidate knowledge text.
Step S230 may be performed by the first model or may be performed by another neural network model.
In an exemplary embodiment, obtaining video features of a video to be processed includes performing encoding processing on each video frame in the video to be processed to obtain video frame features of each video frame of the video to be processed, performing encoding processing on text information of the video to be processed to obtain video text features of the video to be processed, and determining the video frame features and the video text features of each video frame as video features of the video to be processed.
Specifically, the ith video to be processed may be represented by V(i)=[v(i),t(i), where V(i) represents a video frame in the video to be processed, T(i) represents video text information of the video to be processed, a candidate term corresponding to the ith video to be processed may be represented by T(i), and a candidate knowledge point corresponding to the ith video to be processed may be represented by C(i).
In particular, video frame v(i) may be represented as a picture frameCoding each picture frame by ResNet to obtain K1 picture frame vectorsAnd averaging the K1 picture frame vectors to obtain the video frame characteristic Vv(i) of the video to be processed.
In particular, the video text information t(i) of the video to be processed can be represented as a sequence of charactersEncoding by BERT (Bidirectional Encoder Representation from Transformers, bi-directional encoder representation from transducer) to obtain a character vector of length K2The video text characteristic Vt(i) of the video to be processed is obtained by averaging the vector quantity of K2 characters, and the video frame characteristic Vv(i) and the video text characteristic Vt(i) can be combined to obtain the video characteristic V(i)=Vv(i)Vt(i).
In an exemplary embodiment, the candidate knowledge texts comprise candidate entries and candidate knowledge points of the candidate entries, and the obtaining of text features of each candidate knowledge text comprises the steps of performing splicing processing on the candidate entries of the candidate knowledge texts and the knowledge points of the candidate entries to obtain spliced knowledge texts, and performing encoding processing on the spliced knowledge texts to obtain text features of the candidate knowledge texts.
Specifically, each candidate term and knowledge points of each candidate term may be spliced, for example, into a BERT form: "[ CLS ] T(i) [SEP]C(i)", which is input to a pre-trained BERT model to obtain a characterization vector of [ CLS ] position as a text feature of the candidate term and knowledge points of the candidate term (i.e., candidate term-candidate knowledge point representation)
In step S240, a similarity between the video feature and each text feature is determined by the first model, and a plurality of target candidate knowledge texts are determined from the plurality of candidate knowledge texts according to the similarity.
In general, after extracting a video to be processed and text information of the video to be processed (i.e. after step S230), a plurality of candidate knowledge texts can be obtained, and at this time, the candidate knowledge texts can be filtered through the first model, and a target candidate text can be screened from the candidate knowledge texts.
In the embodiment of the disclosure, the first model may be a fast coarse-ranking model for screening a plurality of target candidate texts from a plurality of candidate knowledge texts, where the number of candidate knowledge texts is greater than the target candidate texts, for example, the number of candidate knowledge texts is 200, and the number of target candidate knowledge texts is 20.
In the embodiment of the disclosure, the first model may be a multi-modal feature similarity matching model based on metric learning, and only video features and each text feature (for example, features of a plurality of entry-knowledge points) need to be acquired, so that coarse ranking can be performed according to the similarity between the video features and each text feature.
For example, the text characteristics of 200 videos to be processed are obtained through the stepsThe similarity between the video feature V(i) of the video to be processed and the 200 text features can be determined by the first model
In the embodiment of the disclosure, a similarity threshold value can be set, candidate knowledge texts with similarity greater than the similarity threshold value are determined to be target candidate knowledge texts, the preset number of the target candidate knowledge texts can be set, the preset number of the target candidate knowledge texts can be determined from the plurality of candidate knowledge texts, the preset proportion can be set, and the target candidate knowledge texts with the preset proportion can be determined from the plurality of candidate knowledge texts.
The target candidate knowledge text may include target candidate entries and target candidate knowledge points of the target candidate entries, and the target candidate knowledge text of the target candidate entries and the target candidate entries may be determined from the candidate entries and the candidate knowledge points of the target candidate entries according to the similarity between the video feature and the text feature of each candidate entry and the knowledge point of each target candidate entry.
In step S250, a degree of matching between each target candidate knowledge text and video content of the video to be processed is determined by the second model, and the target knowledge text is determined from the plurality of target candidate knowledge texts according to the degree of matching.
In the embodiment of the disclosure, the target candidate knowledge text is obtained by screening from the candidate knowledge text according to the similarity by the first model, so that the matching efficiency and recall rate of the video and the knowledge text can be improved.
In the case of considering semantic information of a video, for a video to be processed, a candidate knowledge text having a high similarity to the video to be processed is not necessarily a knowledge text that can more express the content of the video to be processed.
For example, the video to be processed is a video related to the development history of an automobile, the candidate knowledge text includes an "automobile model", and although the similarity between the video feature of the video related to the development history of an automobile and the text feature of the "automobile model" may be high, the matching degree between the video content of the video and the "automobile model" is not high, and if the "automobile model" is used as the target knowledge text of the video, the content spoken by the video cannot be expressed very accurately.
Therefore, the embodiment of the disclosure screens the target candidate text screened by the first model through the second model (fine-ranking model), and screens the target knowledge text which is more matched with the video content of the video to be processed from the target candidate text.
In the embodiment of the disclosure, the matching degree of each target candidate knowledge text and the video content of the video to be processed can be determined through the trained second model.
Specifically, whether the training videos and training knowledge texts corresponding to the training videos are matched or not can be used as labels, and the training videos and the training knowledge texts corresponding to the training videos are input into a second model to be trained to train the second model. The training video and the training knowledge text corresponding to the training video are determined by manually judging whether the video content of the training video and the training knowledge text are matched.
In the embodiment of the present disclosure, the second model may be a BERT model based on a prompt (prompt) template, or may be another neural network model, which is not limited in this disclosure.
In an exemplary embodiment, the target candidate knowledge text comprises target candidate entries and target candidate knowledge points of the target candidate entries, the target knowledge text comprises target entries and target knowledge points of the target knowledge text, determining a matching degree between each target candidate knowledge text and video content of the video to be processed comprises determining a first matching degree between each target candidate entry and video content of the video to be processed, determining a second matching degree between each target candidate knowledge point and video content of the video to be processed, and determining the target knowledge text from the plurality of target candidate knowledge texts according to the matching degree comprises determining target entries and target knowledge points of the target entries from the plurality of target candidate entries and target candidate knowledge points of the target candidate entries according to the first matching degree and the second matching degree.
And determining a second matching degree between each target candidate entry and the video content of the video to be processed, so that the target entry and the target knowledge point of the target entry can be determined from a plurality of target candidate entries and the target candidate knowledge points thereof according to the first matching degree between each target candidate entry and the video content of the video to be processed and the second matching degree between each target candidate entry and the video content of the video to be processed.
The target knowledge text is a knowledge text which is determined from a plurality of target knowledge texts and used for describing the content of the video to be processed, and one or more target knowledge texts can be used. The target knowledge text may include target vocabulary entries and target knowledge points of the target vocabulary entries, for example, the vocabulary entries "apple" and the knowledge points "nutritional value" may be used as target vocabulary entries of a video that teaches the nutritional value of apple and target knowledge points of the target vocabulary entries.
In the embodiment of the disclosure, a matching degree threshold value can be set, target candidate knowledge texts with the matching degree larger than the matching degree threshold value are determined to be target knowledge texts, the preset number of the target knowledge texts (the preset number of the target knowledge texts is smaller than that of the target candidate knowledge texts) can be set, the preset number of the target knowledge texts can be determined from a plurality of target candidate knowledge texts, the preset proportion can be set, and the target knowledge texts with the preset proportion can be determined from a plurality of target candidate knowledge texts.
According to the knowledge extraction method provided by the embodiment of the disclosure, on one hand, a plurality of candidate knowledge texts are extracted from the video to be processed and text information of the video to be processed, the similarity between the video characteristics of the video to be processed and the text characteristics of each candidate knowledge text is determined through the first model, a plurality of target candidate knowledge texts are determined from the plurality of candidate knowledge texts according to the similarity, multi-mode information of the video to be processed is comprehensively considered, and the matching efficiency and recall rate of the video and the knowledge texts are improved; on the other hand, the matching degree between each target candidate knowledge text and the video content to be processed is determined through the second model, the target knowledge text is determined from a plurality of target candidate knowledge texts according to the matching degree, the matching degree of semantic information and knowledge text of the video is considered, and the matching accuracy of the video and the knowledge text is improved. In addition, the method can extract the video to be processed and the text information of the video to be processed to obtain candidate knowledge texts, and knowledge texts (also called tags) are not required to be defined in advance, so that the method can be applied to knowledge extraction scenes in closed domains and knowledge extraction scenes in open domains.
In addition, the knowledge extraction method provided by the embodiment of the disclosure can generate more conceptual labels related to videos for video creators, so as to attract more traffic, and can characterize videos browsed by video consumers more conceptually, so that the characterization of interest is more accurate, and more related videos can be recommended to video consumers.
FIG. 3 is a flow chart illustrating another knowledge extraction method, according to an example embodiment.
As shown in fig. 3, the first model in the knowledge extraction method shown in fig. 2 can be obtained by training in the following steps.
In step S310, a first training sample is obtained, where the first training sample includes a plurality of first training video text groups, and training videos in the first training video text groups correspond to training knowledge texts.
In the embodiment of the disclosure, a plurality of training videos and training knowledge texts corresponding to the training videos may be obtained, each training video and the training knowledge text corresponding to the training video are taken as positive examples (i.e., a first training sample), each training video and the training knowledge text not corresponding to the training video are taken as negative examples (i.e., a second training sample described below), and the first initial model is trained by using the first training sample and/or the second training sample, so that when input data is positive examples, an output similarity value is maximized, and when input data is negative examples, an output similarity value is minimized.
In a batch of training data, for example, there are N training videos and training knowledge texts corresponding to the training videos, then the N training videos and training knowledge texts corresponding to the training videos may be formed into N groups of positive examples, and the N training videos and other (N-1) training knowledge texts not corresponding to the N training videos may be formed into N (N-1) groups of negative examples.
In the embodiment of the disclosure, the first initial model may be trained by using the first training sample alone to obtain the first model (i.e., step S310 to step S330), the second initial model may be trained by using the second training sample alone to obtain the first model (i.e., step S340 to step S360), the first initial model may be trained sequentially (or alternately) by using the first training sample and the second training sample, and when the first initial model is trained sequentially (or alternately) by using the first training sample and the second training sample, the first initial model may be trained by using the first training sample and then the first initial model may be trained by using the second training sample, or the first initial model may be trained by using the second training sample and then the first initial model may be trained by using the first training sample, which is not limited in this disclosure.
The training knowledge text may include training vocabulary entries and training knowledge points of the training vocabulary entries.
Specifically, the format may be obtained asIn the training process of the first initial model, the ith training video can be represented by V(i)=[v(i),t(i), V(i) represents a video frame in the training video, T(i) represents video text information of the training video, the training vocabulary corresponding to the ith training video can be represented by T(i), the training knowledge point corresponding to the ith video to be processed can be represented by C(i), the training vocabulary T(i) can be expanded to obtain a subtitle, description information, category information and the like of the vocabulary, and the training vocabulary is added to T(i), and the training knowledge point C(i) of the training vocabulary can be subjected to data enhancement, and alias, synonym and the like of the knowledge point are selected and added to C(i).
In particular, video frame v(i) may be represented as a picture frameCoding each picture frame by ResNet to obtain K1 picture frame vectorsAnd averaging the K1 picture frame vectors to obtain the video frame characteristic Vv(i) of the training video.
In particular, the video text information t(i) of the training video may be represented as a sequence of charactersEncoding by BERT to obtain character vector with length of K2The video text feature Vt(i) of the training video is obtained by averaging the K2 character vectors, and the video frame feature Vv(i) and the video text feature Vt(i) can be combined to obtain the video feature V(i)=Vv(i)Vt(i).
Specifically, each training vocabulary entry and the training knowledge points of each training vocabulary entry can be spliced, for example, into a BERT form, "[ CLS ] T(i) [SEP]C(i)", and the BERT form is input into a pre-training BERT model to obtain a representation vector of the [ CLS ] position as the text feature of the training vocabulary entry and the training knowledge points (namely, the training vocabulary entry-training knowledge points)
In step S320, the first initial model is trained by the first training video text set, and a first similarity between the training video and the training knowledge text in the first training video text set is output.
In this embodiment of the present disclosure, the first initial model may be a neural network model, where the training videos and training knowledge texts in the first training video text group are corresponding, the training videos and training knowledge texts in the first training video text group are input as positive examples into the first initial model to perform training, and a first similarity between the training videos and training knowledge texts in the first training video text group is output.
Specifically, the similarity between the ith training video and the ith training knowledge text can be calculated through the first initial model
In step S330, if the first similarity is smaller than or equal to the first preset value, the model parameters of the first initial model are adjusted, and the adjusted first initial model is trained again through the first training video text set until the first similarity output by the adjusted first initial model is greater than the first preset value, and the first model is determined according to the model parameters of the adjusted first initial model.
When the first training video text group is input into the first initial model, obtaining the similarity between the training video and the training knowledge text in the first training video text group as the first similarity through the calculation of the first initial model, and taking the model of the adjusted first initial model as the first model by adjusting the model parameters of the first initial model to maximize the first similarity output by the adjusted first initial model.
For example, the first preset value may be used as a label, and if the first similarity is smaller than or equal to the first preset value, the model parameters of the first initial model are adjusted, so that the first similarity output by the adjusted first initial model is larger than the first preset value.
The first preset value may be a value close to the right end point of the similarity interval, for example, the similarity interval is [0,1], and the first preset value may be set to 0.9, for example.
In an exemplary embodiment, the first model in the knowledge extraction method shown in fig. 2 may also be trained by the following steps.
In step S340, a second training sample is obtained, where the second training sample includes a plurality of second training video text groups, and training videos in the second training video text groups do not correspond to training knowledge texts.
In the embodiment of the disclosure, a plurality of training videos and training knowledge texts corresponding to the training videos may be acquired, the training knowledge texts not corresponding to the training videos and the training videos are taken as negative examples (i.e., the second training samples), and the second initial model is trained by using the second training samples, so that when input data is negative examples, the output similarity value is minimized.
In a batch of training data, for example, there are N training videos and training knowledge texts corresponding to the training videos, and then N training videos and other (N-1) training knowledge texts not corresponding to the N training videos can be formed into N (N-1) negative examples.
In step S350, the second initial model is trained by the second training video text set, and a second similarity between the training video and the training knowledge text in the second training video text set is output.
In this embodiment of the present disclosure, the second initial model may be a neural network model, where the training videos and training knowledge texts in the second training video text set are not corresponding, the training videos and training knowledge texts in the second training video text set are input as positive examples to the second initial model to perform training, and the second similarity between the training videos and training knowledge texts in the second training video text set is output.
Specifically, the similarity between the ith training video and the jth training knowledge text can be calculated through the second initial modelWhere i+.j.
It should be noted that in the embodiment of the present disclosure, the second training sample may be used to train the second initial model alone to obtain the first model (i.e. step S340 to step S360), or the first training sample and the second training sample may be used to train the first initial model sequentially (or alternatively), and when the first training sample and the second training sample are used to train the first initial model sequentially (or alternatively), the second initial model is the first initial model in step S340 to step S360.
In step S360, if the second similarity is greater than or equal to the second preset value, the model parameters of the second initial model are adjusted, and the adjusted second initial model is trained again through the second training video text set until the second similarity output by the adjusted second initial model is less than the second preset value, and the first model is determined according to the adjusted model parameters of the second initial model.
In the embodiment of the disclosure, when the second training video text set is input into the second initial model, the similarity between the training video and the training knowledge text in the second training video text set is obtained through calculation of the second initial model, the model parameters of the second initial model are adjusted to minimize the second similarity output through the adjusted second initial model, and the adjusted model of the second initial model is used as the first model.
For example, the second preset value may be used as a label, and if the second similarity is greater than or equal to the second preset value, the model parameters of the second initial model are adjusted, so that the second similarity output by the adjusted second initial model is smaller than the second preset value.
The second preset value may be a value close to the left end point of the similarity interval, for example, the similarity interval is [0,1], and the second preset value may be set to 0.1, for example.
It should be noted that, the present disclosure is not limited to the sequence of the step of inputting the positive example into the first initial model for training (i.e., steps S310 to S330) and the step of inputting the negative example into the second initial model for training (i.e., steps S340 to S360), and the positive example may be input into the first initial model for training, then the negative example may be input into the second initial model for training, or the negative example may be input into the second initial model for training, then the positive example may be input into the first initial model for training, or alternatively training may be performed.
According to the knowledge extraction method, the first initial model is trained by inputting the corresponding training video and training knowledge texts, so that the first similarity determined by the first initial model is maximized, the second initial model is trained by inputting the non-corresponding training video and training knowledge texts, so that the second similarity determined by the second initial model is minimized, and the similarity between the video to be processed and the candidate knowledge texts can be determined more accurately through the first model obtained through the training process, and meanwhile, the matching efficiency of the video to be processed and the candidate knowledge texts is improved.
FIG. 4 is a flowchart illustrating another knowledge extraction method, according to an example embodiment.
As shown in fig. 4, the second model in the knowledge extraction method shown in fig. 2 can be trained by the following steps.
In step S410, a third training sample is obtained, where the third training sample includes a plurality of third training video text groups and matching degree labels between training videos and training knowledge texts in the third training video text groups.
The training knowledge text comprises training vocabulary entries and training knowledge points of the training vocabulary entries.
In the embodiment of the present disclosure, the training video and training knowledge text in the third training sample, and the training video and training knowledge text in the first training sample and the second training sample may be the same batch of training video and training knowledge text, or may be different training video and training knowledge text, which is not limited in this disclosure.
In the embodiment of the disclosure, the acquisition format may beThe training video, training vocabulary, and training knowledge points in the training data pair herein may or may not be matched, as is the case with the first training video text combination second training video text set of the embodiment of fig. 3.
In an embodiment of the disclosure, each third training video text group may include a training video and a training knowledge text, and the matching degree label of the training video and the training knowledge text in each third training video text group may be obtained through manual labeling.
Specifically, the input data of the third initial model may be expressed as: Where Y(i) represents the matching degree label of the training video and training knowledge text (i.e., training vocabulary entry-training knowledge point pairs).
In an exemplary embodiment, the match-level tags include a first match-level tag between the training video and the training vocabulary entry, and a second match-level tag between the training video and the training knowledge point.
For example, Y(i)=[ya(i),yb(i)],ya(i) represents a first degree of matching tag between the training video and the training vocabulary entry, yb(i) represents a second degree of matching tag between the training video and the training knowledge point.
In step S420, the third initial model is trained by the third training video text set, and the prediction matching degree between the training video and the training knowledge text in the third training video text set is output.
In an exemplary embodiment, the predicted match level includes a first predicted match level between the training video and the training vocabulary entry, and a second predicted match level between the training video and the training knowledge point.
The method comprises the steps of inputting training videos and training vocabulary entries into a third initial model, outputting first prediction matching degrees between the training videos and the training vocabulary entries, inputting training videos and training knowledge points into the third initial model, and outputting second prediction matching degrees between the training videos and the training knowledge points.
In step S430, if the predicted matching degree and the matching degree label are inconsistent, the model parameters of the third initial model are adjusted, and the adjusted third initial model is trained again through the third training video text set until the predicted matching degree and the matching degree label output by the adjusted third initial model are consistent, and the second model is determined according to the adjusted model parameters of the third initial model.
In the embodiment of the disclosure, a training video and a training knowledge text can be input into a third initial model to obtain a prediction matching degree between the training video and the training knowledge text, a matching degree label between the training video and the training knowledge text is used as a label, parameters of the third initial model are adjusted according to the prediction matching degree and the matching degree label corresponding to the prediction matching degree, so that the prediction matching degree output by the adjusted third initial model is consistent (or similar) to the matching degree label corresponding to the adjusted third initial model, and the model of the adjusted third initial model is used as a second model.
In an exemplary embodiment, training a third initial model through a third training video text group, outputting a prediction matching degree between training videos and training knowledge texts in the third training video text group, wherein the method comprises the steps of obtaining a prompt template, encoding the training videos in the third training video text group, filling the encoded training videos into the first to-be-filled part, filling the training knowledge texts in the third training video text group into the text part, filling a matching degree label into the second to-be-filled part to generate prompt data, and processing the prompt data through the third initial model to obtain the prediction matching degree between the training videos and the training knowledge texts in the third training video text group.
For example, for input data, a prompt input template is constructed:
[CLS][MASK]t(i)[SEP]T(i)[SEP]C(i)[SEP]=>[MASK][MASK]
The first [ MASK ] on the left side is a first part to be filled, the two [ MASK ] on the right side are second parts to be filled, the first part to be filled and the second part to be filled can be respectively filled with matching degree labels of training videos and training vocabulary entries, and matching degree labels of training videos and training knowledge points, T(i)、T(i)、C(i) is a text part, T(i) represents video text of the training videos, T(i) represents text of the training vocabulary entries, and C(i) represents text of the training knowledge points.
Similar to the multi-modal representation of the first model training phase, video frame v(i) may be represented by ResNet, then the result is substituted for the representation of the first [ MASK ], and then the entire sample (i.e., hint data) is input into the BERT encoder, predicting the last two values of [ MASK ], which represent whether the training video-training vocabulary matches, and whether the training video-training knowledge points match, respectively.
According to the knowledge extraction method, the training video and the training knowledge text are input into the third initial model to be trained, so that the predicted matching degree between the training video and the training knowledge text determined through the third initial model is consistent with (or similar to) the matching degree label corresponding to the training video and the training knowledge text, and the target knowledge text with higher matching degree with the video content of the video to be processed can be more accurately determined through the second model obtained through the training process, so that the matching accuracy of the video to be processed and the target knowledge text is improved.
It should also be understood that the above is only intended to assist those skilled in the art in better understanding the embodiments of the present disclosure, and is not intended to limit the scope of the embodiments of the present disclosure. It will be apparent to those skilled in the art from the foregoing examples that various equivalent modifications or variations can be made, for example, some steps of the methods described above may not be necessary, or some steps may be newly added, etc. Or a combination of any two or more of the above. Such modifications, variations, or combinations thereof are also within the scope of the embodiments of the present disclosure.
It should also be understood that the foregoing description of the embodiments of the present disclosure focuses on highlighting differences between the various embodiments and that the same or similar elements not mentioned may be referred to each other and are not repeated here for brevity.
It should also be understood that the sequence numbers of the above processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present disclosure.
It is also to be understood that in the various embodiments of the disclosure, terms and/or descriptions of the various embodiments are consistent and may be referenced to one another in the absence of a particular explanation or logic conflict, and that the features of the various embodiments may be combined to form new embodiments in accordance with their inherent logic relationships.
Examples of knowledge extraction methods provided by the present disclosure are described in detail above. It will be appreciated that the computer device, in order to carry out the functions described above, comprises corresponding hardware structures and/or software modules that perform the respective functions. Those of skill in the art will readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
The following are device embodiments of the present disclosure that may be used to perform method embodiments of the present disclosure. For details not disclosed in the embodiments of the apparatus of the present disclosure, please refer to the embodiments of the method of the present disclosure.
Fig. 5 is a block diagram illustrating a knowledge extraction apparatus, according to an example embodiment. Referring to fig. 5, the knowledge extraction apparatus 500 may include an acquisition module 510, an extraction module 520, and a determination module 530.
Wherein the acquisition module 510 is configured to perform acquiring a video to be processed and text information of the video to be processed, the extraction module 520 is configured to perform extracting a plurality of candidate knowledge texts from the video to be processed and the text information thereof, the acquisition module 510 is further configured to perform acquiring video features of the video to be processed and text features of respective candidate knowledge texts, the determination module 530 is configured to perform determining a similarity between the video features and the respective text features by a first model and determining a plurality of target candidate knowledge texts from the plurality of candidate knowledge texts according to the similarity, the determination module 530 is further configured to perform determining a degree of matching between the respective target candidate knowledge texts and video content of the video to be processed by a second model and determining a target knowledge text from the plurality of target candidate knowledge texts according to the degree of matching.
In some exemplary embodiments of the present disclosure, the obtaining module 510 is further configured to perform encoding processing on each video frame in the to-be-processed video to obtain a video frame feature of each video frame of the to-be-processed video, encoding processing on text information of the to-be-processed video to obtain a video text feature of the to-be-processed video, and determining the video frame feature of each video frame and the video text feature as the video feature of the to-be-processed video.
In some exemplary embodiments of the present disclosure, the candidate knowledge text includes candidate terms and candidate knowledge points of the candidate terms, the obtaining module 510 is further configured to perform, for each candidate knowledge text, a stitching process on the candidate terms and the knowledge points of the candidate terms of the candidate knowledge text to obtain a stitched knowledge text, and an encoding process on the stitched knowledge text to obtain text features of the candidate knowledge text.
In some exemplary embodiments of the present disclosure, the first model is obtained by obtaining a first training sample, where the first training sample includes a plurality of first training video text groups, training a first initial model through the first training video text groups, and outputting a first similarity between the training video and the training knowledge text in the first training video text groups, if the first similarity is less than or equal to a first preset value, adjusting model parameters of the first initial model, and training the adjusted first initial model again through the first training video text groups until the first similarity output by the adjusted first initial model is greater than the first preset value, and determining the first model according to the model parameters of the adjusted first initial model.
In some exemplary embodiments of the present disclosure, the first model is trained by obtaining a second training sample, where the second training sample includes a plurality of second training video text groups, training a second initial model through the second training video text groups, outputting a second similarity between the training video and the training knowledge text in the second training video text groups, if the second similarity is greater than or equal to a second preset value, adjusting model parameters of the second initial model, and training the adjusted second initial model again through the second training video text groups until the second similarity output by the adjusted second initial model is less than the second preset value, and determining the first model according to the model parameters of the adjusted second initial model.
In some exemplary embodiments of the present disclosure, the target candidate knowledge text includes target candidate entries and target candidate knowledge points of the target candidate entries, the target knowledge text includes target entries and target knowledge points of the target knowledge text, the determining module 530 is further configured to perform determining a first degree of matching between each target candidate entry and video content of the video to be processed, determining a second degree of matching between each target candidate knowledge point and video content of the video to be processed, and the determining module 530 is further configured to perform determining target knowledge points of target entries and target entries from among the plurality of target candidate entries and target candidate knowledge points of the target candidate entries according to the first degree of matching and the second degree of matching.
In some exemplary embodiments of the present disclosure, the second model is obtained by obtaining a third training sample, where the third training sample includes a plurality of third training video text groups and matching degree labels between training videos and training knowledge texts in the third training video text groups, training a third initial model through the third training video text groups, outputting a predicted matching degree between training videos and training knowledge texts in the third training video text groups, if the predicted matching degree is inconsistent with the matching degree labels, adjusting model parameters of the third initial model, and training the adjusted third initial model again through the third training video text groups until the predicted matching degree outputted by the adjusted third initial model is consistent with the matching degree labels, and determining the second model according to model parameters of the adjusted third initial model.
In some exemplary embodiments of the present disclosure, the training knowledge text includes a training vocabulary item and training knowledge points of the training vocabulary item, the matching degree label includes a first matching degree label between the training video and the training vocabulary item, and a second matching degree label between the training video and the training knowledge points, and the predicted matching degree includes a first predicted matching degree between the training video and the training vocabulary item, and a second predicted matching degree between the training video and the training knowledge points.
In some exemplary embodiments of the present disclosure, the candidate knowledge text includes candidate terms and candidate knowledge points of the candidate terms, the extraction module 520 is further configured to perform extraction of a plurality of candidate terms from text information of the video to be processed, and obtain the candidate knowledge points of the candidate terms according to the locations of the candidate terms in the text information.
It should be noted that the block diagrams shown in the above figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor terminals and/or microcontroller terminals.
The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.
An electronic device 600 according to such an embodiment of the present disclosure is described below with reference to fig. 6. The electronic device 600 shown in fig. 6 is merely an example and should not be construed to limit the functionality and scope of use of embodiments of the present disclosure in any way.
As shown in fig. 6, the electronic device 600 is in the form of a general purpose computing device. The components of the electronic device 600 may include, but are not limited to, the at least one processing unit 610, the at least one memory unit 620, a bus 630 connecting the different system components (including the memory unit 620 and the processing unit 610), and a display unit 640.
Wherein the storage unit stores program code that is executable by the processing unit 610 such that the processing unit 610 performs steps according to various exemplary embodiments of the present disclosure described in the above-described "exemplary methods" section of the present specification. For example, the processing unit 610 may perform various steps as shown in fig. 2.
As another example, the electronic device may implement the various steps shown in fig. 2.
The storage unit 620 may include readable media in the form of volatile storage units, such as Random Access Memory (RAM) 621 and/or cache memory 622, and may further include Read Only Memory (ROM) 623.
The storage unit 620 may also include a program/utility 624 having a set (at least one) of program modules 625, such program modules 625 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.
Bus 630 may be a local bus representing one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or using any of a variety of bus architectures.
The electronic device 600 may also communicate with one or more external devices 670 (e.g., keyboard, pointing device, bluetooth device, etc.), one or more devices that enable a user to interact with the electronic device 600, and/or any devices (e.g., routers, modems, etc.) that enable the electronic device 600 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 650. Also, electronic device 600 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through network adapter 660. As shown, network adapter 660 communicates with other modules of electronic device 600 over bus 630. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 600, including, but not limited to, microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.
From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause a computing device (may be a personal computer, a server, a terminal device, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.
In an exemplary embodiment, a computer readable storage medium is also provided, e.g., a memory, comprising instructions executable by a processor of an apparatus to perform the above method. Alternatively, the computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
In an exemplary embodiment, a computer program product is also provided, comprising a computer program/instruction which, when executed by a processor, implements the knowledge extraction method in the above embodiments.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.