Disclosure of Invention
Aiming at the problems of single-mode preference in video semantic location, low generalization capability under cross-scene and low location precision, the invention aims to provide a training framework to prevent models from being overfitted to preference information in video clips, so that the training framework can truly and simultaneously understand two modes of a video and sentences and perform semantic location in the video according to semantic information of the two modes. The core content of the invention is to use two twin models simultaneously, wherein one model aims at learning preference information of video clips from training data and is used for further adaptively eliminating the video unimodal preference problem of the other model, so that the generalization capability and the positioning capability of the video clips across scenes are improved.
Different from the prior semantic locating model which only uses a single model for training, the invention simultaneously uses two twin models with the same backbone network structure, and eliminates the single-mode preference information about visual content, time interval and the like which is learned during training in the models through the two twin models. The two twin models differ in that the first model only reads the video input and not the sentence, while the other model normally reads the complete video input and the queried sentence at the same time. The first model is used for learning preference information, predicting a positioning result according to a single video mode, and further adjusting the weight of a training sample according to the preference information obtained by learning, so that the training sample received by the second model does not have data preference information, cannot be guessed according to the single video mode, and forces the second model to simultaneously understand semantic information common to the two video and language modes.
Specifically, the technical scheme adopted by the invention is as follows:
A cross-scene video semantic locating method based on sample weight adjustment comprises the following steps:
Extracting, with a video encoder, visual feature representations of video candidate windows from an input video;
Encoding the sentence by using a language encoder to obtain the characteristic representation of the sentence;
Fusing the visual characteristic representation of the video candidate window and the characteristic representation of the sentence to obtain the visual-semantic characteristic representation of the video candidate window;
Utilizing a visual localizer to predict a locating result only according to visual characteristics of a video candidate window, learning preference information of video clips from a training sample, and adjusting the weight of the training sample according to the preference information obtained by learning;
The visual-semantic locator is utilized to predict a locating result according to visual-semantic feature representation of a video candidate window, and training is carried out on the visual-semantic locator by utilizing training samples after weight adjustment to obtain the visual-semantic locator with the preference removed;
And for the video and sentences to be positioned, performing video semantic positioning by using the trained visual-semantic locator.
Further, the extracting, with the video encoder, visual feature representations of video candidate windows from the input video, includes:
The video encoder divides an input video into a plurality of video small fragments, samples the video small fragments at fixed intervals to obtain N video basic fragments, and extracts a series of I3D basic features from each video basic fragment by using a pre-trained I3D model;
and applying a boundary matching operator to all the I3D basic features contained in the video candidate window to obtain a visual feature representation of the video candidate window.
Further, the applying the boundary matching operator to obtain the visual feature representation of the video candidate window includes:
performing bilinear interpolation on all I3D basic features covered by the video candidate windows (a, b) with the starting time being a and the ending time being b, and sampling to obtain K basic feature vectors, wherein K is a preset super parameter;
Passing K basic eigenvectors through a convolution layer with a convolution kernel size of K and a nonlinear function ReLU layer to obtain 1 eigenvectorAs a visual feature representation of the video candidate window;
Repeating the process for all the N values of which the a is more than or equal to 1 and the b is more than or equal to 1, and obtaining the feature vectors of all the video candidate windows
Further, the encoding the sentence by the language encoder to obtain the feature representation of the sentence includes:
and (3) taking a sentence sequence formed by a plurality of word characteristics as input, sending the sentence sequence to a long and short time memory network LSTM, and extracting to obtain the characteristic representation of the sentence.
Further, the training process of the visual localizer and the visual-semantic localizer includes:
In the visual locator, visual characteristic representations of the video candidate windows are directly transferred to a fully connected layer and a sigmoid layer to generate predicted values of a visual perception score map; in the visual-semantic locator, the visual-semantic feature representation of the video candidate window is input into a full-connection layer and a sigmoid layer to generate a predicted value of a visual-semantic score graph of the candidate window;
And respectively calculating the loss functions of the visual localizer and the visual-semantic localizer, adjusting the weight of the training sample according to the locating result of the visual localizer, and training the visual localizer and the visual-semantic localizer end to obtain the visual-semantic localizer with the dispreferential treatment.
Further, the calculating the loss functions of the visual localizer and the visual-semantic localizer respectively, and adjusting the weight of the training sample according to the locating result of the visual localizer, and training the visual localizer and the visual-semantic localizer end-to-end comprises:
For each candidate window in the visual and semantic score graphs, calculating IoU score IoUab between its time boundary (a, b) and label T, and then assigning a soft label gtab to IoUab according to preset super-parameters μmin and μmax;
Training the visual localizer and the visual-semantic localizer according to the obtained soft labels gtab, and adjusting the weight of the training sample in the visual-semantic localizer according to the locating result of the visual localizer, wherein the training comprises the following steps:
using cross entropy function as loss functionThe visual localizer is trained to be used,Is defined as:
Calculating a loss function of the visual-semantic locator without weighting adjustment using the cross entropy function:
Calculating the cosine similarity s of the predicted value p '= { p'ab } and the true value gt= { gtab } of the visual localizer, estimating the importance 1-sα of the training sample according to the cosine similarity s, and adjusting the loss function of the visual-semantic localizer to beWhere α is a hyper-parameter controlling weight variation;
From the following componentsAndTwo parts make up the final loss functionAccording toThe visual locator and the visual-semantic locator are trained simultaneously in an end-to-end fashion to mitigate video unimodal preferences in the model.
A sample weight adjustment-based cross-scene video semantic locating device employing the above method, comprising:
A video encoder for extracting visual feature representations of video candidate windows from an input video;
The language encoder is used for encoding the sentences to obtain the characteristic representation of the sentences;
the visual locator is used for predicting locating results only according to visual characteristic representation of the video candidate window and learning preference information of the video clips from the training samples;
the visual-semantic locator is used for fusing the visual characteristic representation of the video candidate window with the characteristic representation of the sentence to obtain the visual-semantic characteristic representation of the video candidate window, and then predicting the locating result according to the visual-semantic characteristic representation of the video candidate window;
And the sample weight adjusting module is used for adjusting the weight of the training sample according to the preference information obtained by the learning of the visual locator, and training the visual-semantic locator by using the training sample with the weight adjusted to obtain the visual-semantic locator with the preference removed.
Compared with the prior art, the invention has the advantages that:
(1) The generalization capability under the cross-scene condition is enhanced, namely the conventional video semantic locating model is influenced by the single-mode preference problem, and the generalization capability under the cross-scene setting is seriously damaged because the data preference in the training data set does not exist when the video semantic locating model is applied to the cross-scene data set. After training by sample weight adjustment, the single-mode preference problem of the model is corrected, so that the generalization capability advantage of the model under the condition of cross-scene is obvious.
(2) The addition of the additional balance training data label is beneficial to alleviating the single-mode preference of the video semantic locating model and improving the cross-scene generalization capability of the video semantic locating model, however, the addition of the label consumes a great amount of additional human resources, and the difficulty of acquiring the balanced video semantic locating data is high. According to the method, the weight and distribution of the training samples can be adjusted in a self-adaptive mode according to the distribution characteristics of the training data, and the problem of single-mode preference of the video semantic locating model is corrected through an algorithm layer rather than a data layer.
(3) The method is flexible and simple in training mode, the problem of video single-mode preference is eliminated based on a training framework of a double model, and the method can be flexibly carried out end to end, namely training data are adaptively balanced through the importance of learning training samples in the training process, and the influence of video fragment preference is automatically eliminated for a video semantic positioning model.
(4) Compared with the existing semantic locating technology, the test stage does not need extra operation and storage resources, namely, although the invention needs to train two semantic locating models at the same time in the training stage, the visual locators in the test stage are deleted, and only a single visual-semantic locator is needed, and although the invention eliminates the single-mode preference of the models, the calculation amount in the test stage is the same as that in the existing semantic locating technology, and extra calculation and storage resources are not needed.
Detailed Description
The invention will now be described in further detail by means of specific examples and the accompanying drawings.
The specific flow of the cross-scene video semantic locating method based on sample weight adjustment is shown in fig. 1, and the method comprises the following steps:
Step1, the video encoder divides the input video into a plurality of video small fragments, and samples the video small fragments at fixed intervals to obtain N video base fragments. For each sampled video base segment, a series of I3D basic features are extracted by using a pre-trained I3D model.
Step2, the language encoder encodes each word in the sentence, and sends a sentence sequence consisting of a plurality of word features as input to a long and short time memory network (LSTM) for extraction to obtain sentence features.
Step3, constructing a visual feature representation of each video candidate window from the I3D basic features, namely applying a boundary matching operator to all the I3D basic features contained in the candidate window to obtain the visual feature representation of the video candidate window.
Step4, fusing the visual characteristic representation of the video candidate window and the characteristic representation of the sentence, and constructing the visual-semantic characteristics of the video candidate window after interaction of the visual characteristic representation and the characteristic representation of the sentence.
Step5, in the visual locator, the visual characteristics of each video candidate window are directly transferred to a fully connected layer and a sigmoid layer to directly generate the predicted value of the visual perception score map. Meanwhile, the visual-semantic meaning of the video candidate window is input into the other full-connection layer and the sigmoid layer, and a predicted value of the visual-semantic meaning diagram of the candidate window is generated. And respectively obtaining positioning results of the two positioners from the visual score diagram and the visual-semantic score diagram.
Step6, respectively calculating the loss functions of the visual localizer and the visual-semantic localizer, and adjusting the weight of the training sample according to the locating result of the visual localizer. The two positioners are trained end to end by utilizing an Adam algorithm, and the visual-semantic positioners subjected to the preference processing are obtained by utilizing training samples after weight adjustment.
Step7, discarding the visual localizer in the test stage, and performing semantic localization by using only the visual-semantic localizer subjected to the de-favoritization treatment in the training process.
As shown in FIG. 1, the method learns the importance of the training samples based on 5 constructed basic modules, thereby rebalancing the training data and eliminating the influence of video clip preference for the video semantic location model. The names and functions of these 5 basic modules are respectively:
1. The language encoder encodes the sentence to be located, and obtains a characteristic representation of the sentence for effectively retrieving a plurality of video segments of interest of the paragraph description in the video.
2. Video encoder for extracting temporal and spatial visual features of portions covered by video candidate windows from original video frames to obtain feature representations capable of characterizing visual structures of different video candidate windows
3. Visual locator predicts locating result based on video candidate window visual features extracted by video encoder only, without inputting sentence to be queried. The visual locator can learn the preference information for the video clip from the training data and be used to further adjust the loss function of the visual-semantic locator.
4. Visual-semantic locator-visual-semantic locator inputs the two modes of video and sentence to be inquired completely, and predicts locating result according to visual-semantic feature representation of video candidate window.
5. The sample weight adjusting module predicts the importance of the training sample by using the prediction output of the visual localizer, and correspondingly adjusts the weight of the training sample in the loss function of the visual-semantic localizer according to the importance of the sample, so that the data preference problem of the training sample received by the visual-semantic localizer is relieved, namely the de-preference processing is realized.
The implementation of the steps of the present invention is specifically described below.
1. Video preprocessing
The step preprocesses long video which is to be positioned and is not clipped, cuts the long video into a plurality of video small fragments, samples and extracts visual characteristics of the small video fragments from the small video fragments for subsequent understanding of semantic content and positioning. An I3D model pre-trained on a Kinetics dataset is used in extracting visual features of the video. The pretreatment process is shown in fig. 2, and comprises the following steps:
(Step 1) dividing the input video into a plurality of video clips, wherein each video clip contains T frames.
(Step 2) sampling the video small segments at fixed intervals to obtain N video base segments.
(Step 3) extracting I3D basic feature vectors from each sampled video basic segment by using a pre-trained I3D model. In total, N I3D basic features v= { Vi } (i=1..n) can be obtained, which represent the visual content of from 1 to N video segments in the video.
2. Sentence feature extraction
In order to locate the semantic content of a sentence description in a video, it is necessary to extract its semantic features and vectorize it for the sentence to be queried. The method adopts a word vector model pre-trained on large-scale text data and a long-short-time memory network (long-short term memory, LSTM) to extract semantic features of sentences, and the number of words in the sentences is set as Vs, and the specific feature extraction process is shown in figure 3 and comprises the following steps:
(Step 1) a pre-trained word vector (word embedding) model on the large-scale text data encodes each word in the sentence, each word yielding a word feature vector wi, respectively. The sentence is totally extracted with Vs word features, which can be expressed as a sentence sequence { wi}(i=1..Vs).
(Step 2) sequentially sending a sentence sequence { wi}(i=1..Vs ] consisting of a plurality of word features as input to the LSTM network and outputting the last hidden state of the LSTM network.
(Step 3) inputting the last hidden state of the LSTM network into a full connection layer to extract the final sentence characteristic fS
3. Generating visual features for candidate windows
Starting from the I3D characteristics of the video basic fragments obtained through preprocessing, semantic characteristics of candidate windows can be generated, and visual characteristics of the candidate windows serve to describe and represent video visual contents covered by the candidate windows. For all I3D basic features contained within this candidate window, a boundary matching operator (Boundary Matching Operation) BM is applied in the method to obtain a visual feature representation of the video candidate window.
The boundary matching operator BM is able to efficiently generate features of candidate windows from the video base segment through a series of bilinear sampling and convolution operations. For the video candidate window (a, b) with the starting time being a and the ending time being b, the specific steps of the BM operator are as follows:
(Step 1) performing bilinear interpolation and sampling on all the I3D basic features covered by (a, b), and obtaining K basic feature vectors through sampling. Where K is a preset hyper-parameter.
(Step 2) passing the K basic features through a convolution layer with a convolution kernel size K and a nonlinear function ReLU layer to obtain 1 feature vectorAs a visual feature representation of this video candidate window.
(Step 3) repeating the above process for all the values of a and b which are not less than 1 and not more than b and not more than N to obtain the feature vectors of all the video candidate windowsWhere N represents the number of video base segments.
4. Generating visual-semantic features for candidate windows
The visual-semantic features of the candidate windows are generated specifically, and then the visual-semantic features of each candidate window are generated by fusing the features of the visual and language modes.
(Step 1) for candidate window (a, b), visual characteristics of the candidate window are determinedInteract with language feature fS, i.e. fS withPoint-wise multiplication to obtain visual-semantic features
(Step 2) visual-semantic featuresNormalized with its L2 norm: obtaining visual-semantic features Mab of the candidate window;
(Step 3) repeating the above-described process for all candidate windows (a, b) to obtain visual-semantic features thereof
5. Visual locator and visual-semantic locator for locating
In the method, two twin locator models of a visual locator and a visual-semantic locator are simultaneously utilized to locate the video clip. Wherein the visual locator functions to directly from the visual features of all candidate windows according to the single mode preference information of the videoThe most likely predicted fragments are guessed without the need to input sentences to be queried, thereby learning their annotated single-mode preferences from the training data. The visual-semantic locator inputs visual-semantic features of the candidate window, and locates fragments of sentence descriptions in the video by simultaneously understanding visual and semantic content of two modalities of the video and the sentence.
The process of locating with a visual locator and a visual-semantic locator, respectively, is described as follows:
for the visual locator, visual features of each video candidate window are presentedInput to a fully connected layer and sigmoid layer, a predicted value p'ab of the visual perception map for each candidate window (a, b) is directly generated. The candidate window argmax(a,b)p′ab corresponding to its maximum value is used as the final visual positioning result. For a visual-semantic locator, the video clip is located in a similar manner to the visual locator, which differs from the visual locator in the input features. The input feature of the visual-semantic locator is the visual-semantic feature { Mab } of each video candidate window, which is passed to another fully-connected layer and to the sigmoid layer, thereby generating a predictor pab of the visual-semantic score map for each candidate window (a, b). The candidate window argmax(a,b)pab corresponding to its maximum value is used as the final visual-semantic localization result.
6. Training process sample weight adjustment
6.1 Determining the labels of training samples
In the training phase, each training sample contains an input video V, a sentence S, and a time segment annotation T corresponding to the sentence. In the training process, it is required to determine which time segment in the visual-semantic score map corresponds to the true value of the label and train the model accordingly. At training, first, ioU scores of each candidate window and the marked time segment are calculated, and a soft label is used for designating which candidate window belongs to a real result according to the IoU score. For each candidate window in the visual and visual-semantic score graphs, a IoU score IoUab is calculated between its time boundary (a, b) and the label T. Then, according to preset super parameters mumin and mumax, a soft labelab is allocated to IoUab, and the allocation process is as shown in fig. 4, and includes the following steps:
(i) When IoUab≤μmin, gtab =0
(Ii) When the amount of the fluorescent dye is mumin≤IoUab≤μmax,
(Iii) When μmax≤IoUab, gtab =1
6.2 Calculating the loss function and adjusting the sample weights
In the sample weight adjustment module, the visual localizer and the visual-semantic localizer are respectively trained according to the obtained soft labels gtab, and meanwhile, the weight of the training sample in the visual-semantic localizer is adjusted according to the locating result of the visual localizer. The process is described as shown in fig. 5, and includes the steps of:
(Step 1) use of a conventional Cross-entropy function as the loss functionTraining a visual locator for each training sample, defined as:
(Step 2) similarly to Step1, a loss function is calculated for the visual-semantic locator that is not weight adjusted using a conventional cross entropy function:
(Step 3) calculating the cosine similarity of the visual locator predictor p '= { p'ab } and the true value gt= { gtab }:
(Step 4) estimating importance of training samples 1-sα based on cosine similarity, and adjusting the loss function of the visual-semantic locator to beWhere α is a hyper-parameter that controls the weight change.
(Step 5) balancing the two loss functions from the loss function of the visual locatorAnd an adjusted visual-semantic locator loss functionTwo parts make up the final loss function
(Step 6) optimization algorithm using Adam algorithm as model according to the above-mentioned loss functionThe visual locator and the visual-semantic locator are trained simultaneously in an end-to-end fashion to mitigate video unimodal preferences in the model.
7. Test phase
The training phase described above involves two locators, namely a visual locator and a visual-semantic locator, whereas during the testing phase the visual locator is omitted, involving only the visual-semantic locator. After training by the method, the visual-semantic locator is adjusted by training sample weights, so that the video unimodal preference of the visual-semantic locator is reduced. Therefore, the biased visual localizer can be deleted in the test stage, and only the visual-semantic localizer subjected to the bias removal is used for video semantic localization. The visual-semantic locator finally outputs the predictive scores pab for all candidate windows. Taking the maximum max(a,b)pab and taking the corresponding time segment argmax(a,b)pab as a final positioning result.
The invention evaluates the positioning precision of the model under a cross scene, and the most common evaluation index for evaluation is R@K and theta. R@K, θ represents the ratio of the region coincidence degree IoU between at least one time slice and the true value in the previous K positioning result to exceed θ. The existing commonly used positioning video semantic positioning data sets are ACTIVITYNET CAPTIONS, charades-STA and DiDeMo. The invention trains and evaluates models on data sets of two different scenarios, respectively. Given the large scale of dataset ACTIVITYNET CAPTIONS, the variety of scenarios and activities involved, ACTIVITYNET CAPTIONS is trained as a training set herein, after which models are tested on datasets Charades-STA and DiDeMo (referred to as AcNet2Charades and AcNet2DiDeMo, respectively). Tables 1,2 are comparison of positioning accuracy with existing other methods at cross-scene settings AcNet, charades and AcNet, 2 DiDeMo.
TABLE 1 comparison of model Performance at Cross-scene setting AcNet, 2Charade
| Method of | R@1,IoU=0.5 | R@1,IoU=0.7 | R@5,IoU=0.5 | R@5,IoU=0.7 |
| PFGA | 5.75 | 1.53 | - | - |
| SCDM | 15.91 | 6.19 | 54.04 | 30.39 |
| 2D-TAN | 15.81 | 6.30 | 59.06 | 31.53 |
| The invention is that | 21.45 | 10.38 | 62.34 | 32.90 |
TABLE 2 comparison of model Performance at Cross-scene setting AcNet, 2DiDeMo
| Method of | R@1,IoU=0.5 | R@1,IoU=0.7 | R@5,IoU=0.5 | R@5,IoU=0.7 |
| PFGA | 6.24 | 2.01 | - | - |
| SCDM | 10.88 | 4.34 | 43.30 | 18.40 |
| 2D-TAN | 12.50 | 5.50 | 44.88 | 20.73 |
| The invention is that | 13.11 | 7.70 | 44.98 | 21.32 |
The invention adjusts the weights of different training samples in the loss function according to the cosine similarity s to beIn fact, the adjustment of the weights in the loss function is not limited to the above-described manner. The training data distribution can be balanced only by giving a higher weight to the training samples with low similarity in the loss function and giving a lower weight to the samples with high similarity.
Based on the same inventive concept, another embodiment of the present invention provides a cross-scene video semantic locating device based on sample weight adjustment using the above method, which includes:
A video encoder for extracting visual feature representations of video candidate windows from an input video;
The language encoder is used for encoding the sentences to obtain the characteristic representation of the sentences;
the visual locator is used for predicting locating results only according to visual characteristic representation of the video candidate window and learning preference information of the video clips from the training samples;
the visual-semantic locator is used for fusing the visual characteristic representation of the video candidate window with the characteristic representation of the sentence to obtain the visual-semantic characteristic representation of the video candidate window, and then predicting the locating result according to the visual-semantic characteristic representation of the video candidate window;
And the sample weight adjusting module is used for adjusting the weight of the training sample according to the preference information obtained by the learning of the visual locator, and training the visual-semantic locator by using the training sample with the weight adjusted to obtain the visual-semantic locator with the preference removed.
Based on the same inventive concept, another embodiment of the present invention provides an electronic device (computer, server, smart phone, etc.) comprising a memory storing a computer program configured to be executed by the processor, and a processor, the computer program comprising instructions for performing the steps in the inventive method.
Based on the same inventive concept, another embodiment of the present invention provides a computer readable storage medium (e.g., ROM/RAM, magnetic disk, optical disk) storing a computer program which, when executed by a computer, implements the steps of the inventive method.
The above-disclosed embodiments of the present invention are intended to aid in understanding the contents of the present invention and to enable the same to be carried into practice, and it will be understood by those of ordinary skill in the art that various alternatives, variations and modifications are possible without departing from the spirit and scope of the invention. The invention should not be limited to what has been disclosed in the examples of the specification, but rather by the scope of the invention as defined in the claims.