Movatterモバイル変換


[0]ホーム

URL:


CN115761560B - A cross-scene video semantic localization method and device based on sample weight adjustment - Google Patents

A cross-scene video semantic localization method and device based on sample weight adjustment

Info

Publication number
CN115761560B
CN115761560BCN202111026168.6ACN202111026168ACN115761560BCN 115761560 BCN115761560 BCN 115761560BCN 202111026168 ACN202111026168 ACN 202111026168ACN 115761560 BCN115761560 BCN 115761560B
Authority
CN
China
Prior art keywords
visual
video
semantic
locator
localizer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111026168.6A
Other languages
Chinese (zh)
Other versions
CN115761560A (en
Inventor
包培钧
穆亚东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking UniversityfiledCriticalPeking University
Priority to CN202111026168.6ApriorityCriticalpatent/CN115761560B/en
Publication of CN115761560ApublicationCriticalpatent/CN115761560A/en
Application grantedgrantedCritical
Publication of CN115761560BpublicationCriticalpatent/CN115761560B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Landscapes

Abstract

Translated fromChinese

本发明涉及一种基于样本权重调整的跨场景视频语义定位方法和装置。本发明同时使用两个具有相同主干网络结构的孪生模型,第一个模型仅读取视频输入而不读取句子,第二个模型同时读取完整的视频输入和句子;第一个模型用来学习偏好信息,仅根据视频单个模态就预测定位结果,并根据其学习得到的偏好信息去调整训练样本的权重,使得第二个模型收到的训练样本不具备数据偏好信息,迫使第二个模型同时理解视频和语言两个模态中共同的语义信息。本发明的提供了一种训练框架以防止模型过拟合于视频片段中的偏好信息,使之能够真正同时理解视频和句子两个模态并根据两者的语义信息在视频中做语义定位。本发明在跨场景条件下的泛化能力具有明显优势。

The present invention relates to a method and device for cross-scene video semantic localization based on sample weight adjustment. The present invention uses two twin models with the same backbone network structure at the same time, the first model only reads the video input but not the sentence, and the second model reads the complete video input and the sentence at the same time; the first model is used to learn preference information, predicts the localization result based on only the single mode of the video, and adjusts the weight of the training sample according to the preference information learned, so that the training sample received by the second model does not have data preference information, forcing the second model to understand the common semantic information in both the video and language modes at the same time. The present invention provides a training framework to prevent the model from overfitting to the preference information in the video clip, so that it can truly understand both the video and sentence modes at the same time and perform semantic localization in the video according to the semantic information of both. The generalization ability of the present invention under cross-scene conditions has obvious advantages.

Description

Cross-scene video semantic locating method and device based on sample weight adjustment
Technical Field
The invention relates to a method for positioning cross-scene video semantics, in particular to a method and a device for improving positioning accuracy and generalization capability of a cross-scene video semantics positioning model by using a sample weight adjustment method, and belongs to the field of computer vision.
Technical Field
Video semantic localization is one of the most important problems in the field of computer vision, and has received increasing attention in recent years. The visual semantic locating model has great application potential in various fields such as video monitoring, robots, multimedia retrieval and the like. For a given long video without clipping and a sentence composed of natural language, the goal of video semantic localization is to locate the start and end moments of the event described by the sentence in the video. In visual semantic location, the use of natural language not only allows the content of the action to be located to be unrestricted by a predefined list of action tags, but also allows for further flexible interpretation of object properties and relationships. For example, people can use natural language to locate video clips with complex semantic information such as "red-clothing men take cups out of refrigerators and drink water in cups".
Visual semantic localization is a typical multimodal understanding task, and the model needs to understand semantic information of two modalities of video and language simultaneously to give a correct localization result. However, the existing video semantic locating model has a remarkable single-mode preference phenomenon that the model can directly make predictions according to a single mode of a video without understanding the concrete content of sentences. This results in a reduced accuracy of the positioning of the model under cross-scene test data. The reason for the single-mode preference phenomenon is that obvious data preference exists in the annotation of the training data, and the visual content and the time interval of the positioned fragments in the annotation of the training data are distributed in long tails, namely, part of the visual content and the time interval are high-frequency in the fragments to be positioned in the training data. For example in training data, visual content "stand" is located far more frequently than "cut" (cut), and likewise there is a similar preference in the time interval distribution, for example video clips with longer time intervals appear in the annotation of training data far more frequently than clips with shorter time interval lengths. When such data preferences are strong enough in the training data, the model also produces a single-modality preference, i.e., a tendency to guess the video clip to be located using only the visual content and time intervals in a single modality of video without understanding the sentence. This results in a biased model that is not able to understand both video and sentence modalities simultaneously, and thus is not able to infer common semantic information involved in both modalities from one another. Although biased models perform well on test sets that have the same scene as the training data, with data preferences similar to the training data, once the model is used on real application data across the scene, the visual content and time interval distribution of the segments to be located will change, with data preferences that are no longer the same as the training data. Because the data preference of the training stage in the cross-scene application is not existed, and the biased semantic locating model can not truly carry out semantic locating in the cross-video and language modes, the generalization capability and the locating precision of the model can be damaged. This single-modality preference problem of video semantic localization models severely limits the potential and prospects for its industrial application.
Disclosure of Invention
Aiming at the problems of single-mode preference in video semantic location, low generalization capability under cross-scene and low location precision, the invention aims to provide a training framework to prevent models from being overfitted to preference information in video clips, so that the training framework can truly and simultaneously understand two modes of a video and sentences and perform semantic location in the video according to semantic information of the two modes. The core content of the invention is to use two twin models simultaneously, wherein one model aims at learning preference information of video clips from training data and is used for further adaptively eliminating the video unimodal preference problem of the other model, so that the generalization capability and the positioning capability of the video clips across scenes are improved.
Different from the prior semantic locating model which only uses a single model for training, the invention simultaneously uses two twin models with the same backbone network structure, and eliminates the single-mode preference information about visual content, time interval and the like which is learned during training in the models through the two twin models. The two twin models differ in that the first model only reads the video input and not the sentence, while the other model normally reads the complete video input and the queried sentence at the same time. The first model is used for learning preference information, predicting a positioning result according to a single video mode, and further adjusting the weight of a training sample according to the preference information obtained by learning, so that the training sample received by the second model does not have data preference information, cannot be guessed according to the single video mode, and forces the second model to simultaneously understand semantic information common to the two video and language modes.
Specifically, the technical scheme adopted by the invention is as follows:
A cross-scene video semantic locating method based on sample weight adjustment comprises the following steps:
Extracting, with a video encoder, visual feature representations of video candidate windows from an input video;
Encoding the sentence by using a language encoder to obtain the characteristic representation of the sentence;
Fusing the visual characteristic representation of the video candidate window and the characteristic representation of the sentence to obtain the visual-semantic characteristic representation of the video candidate window;
Utilizing a visual localizer to predict a locating result only according to visual characteristics of a video candidate window, learning preference information of video clips from a training sample, and adjusting the weight of the training sample according to the preference information obtained by learning;
The visual-semantic locator is utilized to predict a locating result according to visual-semantic feature representation of a video candidate window, and training is carried out on the visual-semantic locator by utilizing training samples after weight adjustment to obtain the visual-semantic locator with the preference removed;
And for the video and sentences to be positioned, performing video semantic positioning by using the trained visual-semantic locator.
Further, the extracting, with the video encoder, visual feature representations of video candidate windows from the input video, includes:
The video encoder divides an input video into a plurality of video small fragments, samples the video small fragments at fixed intervals to obtain N video basic fragments, and extracts a series of I3D basic features from each video basic fragment by using a pre-trained I3D model;
and applying a boundary matching operator to all the I3D basic features contained in the video candidate window to obtain a visual feature representation of the video candidate window.
Further, the applying the boundary matching operator to obtain the visual feature representation of the video candidate window includes:
performing bilinear interpolation on all I3D basic features covered by the video candidate windows (a, b) with the starting time being a and the ending time being b, and sampling to obtain K basic feature vectors, wherein K is a preset super parameter;
Passing K basic eigenvectors through a convolution layer with a convolution kernel size of K and a nonlinear function ReLU layer to obtain 1 eigenvectorAs a visual feature representation of the video candidate window;
Repeating the process for all the N values of which the a is more than or equal to 1 and the b is more than or equal to 1, and obtaining the feature vectors of all the video candidate windows
Further, the encoding the sentence by the language encoder to obtain the feature representation of the sentence includes:
and (3) taking a sentence sequence formed by a plurality of word characteristics as input, sending the sentence sequence to a long and short time memory network LSTM, and extracting to obtain the characteristic representation of the sentence.
Further, the training process of the visual localizer and the visual-semantic localizer includes:
In the visual locator, visual characteristic representations of the video candidate windows are directly transferred to a fully connected layer and a sigmoid layer to generate predicted values of a visual perception score map; in the visual-semantic locator, the visual-semantic feature representation of the video candidate window is input into a full-connection layer and a sigmoid layer to generate a predicted value of a visual-semantic score graph of the candidate window;
And respectively calculating the loss functions of the visual localizer and the visual-semantic localizer, adjusting the weight of the training sample according to the locating result of the visual localizer, and training the visual localizer and the visual-semantic localizer end to obtain the visual-semantic localizer with the dispreferential treatment.
Further, the calculating the loss functions of the visual localizer and the visual-semantic localizer respectively, and adjusting the weight of the training sample according to the locating result of the visual localizer, and training the visual localizer and the visual-semantic localizer end-to-end comprises:
For each candidate window in the visual and semantic score graphs, calculating IoU score IoUab between its time boundary (a, b) and label T, and then assigning a soft label gtab to IoUab according to preset super-parameters μmin and μmax;
Training the visual localizer and the visual-semantic localizer according to the obtained soft labels gtab, and adjusting the weight of the training sample in the visual-semantic localizer according to the locating result of the visual localizer, wherein the training comprises the following steps:
using cross entropy function as loss functionThe visual localizer is trained to be used,Is defined as:
Calculating a loss function of the visual-semantic locator without weighting adjustment using the cross entropy function:
Calculating the cosine similarity s of the predicted value p '= { p'ab } and the true value gt= { gtab } of the visual localizer, estimating the importance 1-sα of the training sample according to the cosine similarity s, and adjusting the loss function of the visual-semantic localizer to beWhere α is a hyper-parameter controlling weight variation;
From the following componentsAndTwo parts make up the final loss functionAccording toThe visual locator and the visual-semantic locator are trained simultaneously in an end-to-end fashion to mitigate video unimodal preferences in the model.
A sample weight adjustment-based cross-scene video semantic locating device employing the above method, comprising:
A video encoder for extracting visual feature representations of video candidate windows from an input video;
The language encoder is used for encoding the sentences to obtain the characteristic representation of the sentences;
the visual locator is used for predicting locating results only according to visual characteristic representation of the video candidate window and learning preference information of the video clips from the training samples;
the visual-semantic locator is used for fusing the visual characteristic representation of the video candidate window with the characteristic representation of the sentence to obtain the visual-semantic characteristic representation of the video candidate window, and then predicting the locating result according to the visual-semantic characteristic representation of the video candidate window;
And the sample weight adjusting module is used for adjusting the weight of the training sample according to the preference information obtained by the learning of the visual locator, and training the visual-semantic locator by using the training sample with the weight adjusted to obtain the visual-semantic locator with the preference removed.
Compared with the prior art, the invention has the advantages that:
(1) The generalization capability under the cross-scene condition is enhanced, namely the conventional video semantic locating model is influenced by the single-mode preference problem, and the generalization capability under the cross-scene setting is seriously damaged because the data preference in the training data set does not exist when the video semantic locating model is applied to the cross-scene data set. After training by sample weight adjustment, the single-mode preference problem of the model is corrected, so that the generalization capability advantage of the model under the condition of cross-scene is obvious.
(2) The addition of the additional balance training data label is beneficial to alleviating the single-mode preference of the video semantic locating model and improving the cross-scene generalization capability of the video semantic locating model, however, the addition of the label consumes a great amount of additional human resources, and the difficulty of acquiring the balanced video semantic locating data is high. According to the method, the weight and distribution of the training samples can be adjusted in a self-adaptive mode according to the distribution characteristics of the training data, and the problem of single-mode preference of the video semantic locating model is corrected through an algorithm layer rather than a data layer.
(3) The method is flexible and simple in training mode, the problem of video single-mode preference is eliminated based on a training framework of a double model, and the method can be flexibly carried out end to end, namely training data are adaptively balanced through the importance of learning training samples in the training process, and the influence of video fragment preference is automatically eliminated for a video semantic positioning model.
(4) Compared with the existing semantic locating technology, the test stage does not need extra operation and storage resources, namely, although the invention needs to train two semantic locating models at the same time in the training stage, the visual locators in the test stage are deleted, and only a single visual-semantic locator is needed, and although the invention eliminates the single-mode preference of the models, the calculation amount in the test stage is the same as that in the existing semantic locating technology, and extra calculation and storage resources are not needed.
Drawings
FIG. 1 is a schematic flow chart of the method of the present invention;
FIG. 2 is a video processing flow diagram of a video encoder;
FIG. 3 is a sentence feature extraction flow chart for a speech coder;
FIG. 4 is a soft label distribution strategy diagram;
fig. 5 is a sample weight adjustment chart.
Detailed Description
The invention will now be described in further detail by means of specific examples and the accompanying drawings.
The specific flow of the cross-scene video semantic locating method based on sample weight adjustment is shown in fig. 1, and the method comprises the following steps:
Step1, the video encoder divides the input video into a plurality of video small fragments, and samples the video small fragments at fixed intervals to obtain N video base fragments. For each sampled video base segment, a series of I3D basic features are extracted by using a pre-trained I3D model.
Step2, the language encoder encodes each word in the sentence, and sends a sentence sequence consisting of a plurality of word features as input to a long and short time memory network (LSTM) for extraction to obtain sentence features.
Step3, constructing a visual feature representation of each video candidate window from the I3D basic features, namely applying a boundary matching operator to all the I3D basic features contained in the candidate window to obtain the visual feature representation of the video candidate window.
Step4, fusing the visual characteristic representation of the video candidate window and the characteristic representation of the sentence, and constructing the visual-semantic characteristics of the video candidate window after interaction of the visual characteristic representation and the characteristic representation of the sentence.
Step5, in the visual locator, the visual characteristics of each video candidate window are directly transferred to a fully connected layer and a sigmoid layer to directly generate the predicted value of the visual perception score map. Meanwhile, the visual-semantic meaning of the video candidate window is input into the other full-connection layer and the sigmoid layer, and a predicted value of the visual-semantic meaning diagram of the candidate window is generated. And respectively obtaining positioning results of the two positioners from the visual score diagram and the visual-semantic score diagram.
Step6, respectively calculating the loss functions of the visual localizer and the visual-semantic localizer, and adjusting the weight of the training sample according to the locating result of the visual localizer. The two positioners are trained end to end by utilizing an Adam algorithm, and the visual-semantic positioners subjected to the preference processing are obtained by utilizing training samples after weight adjustment.
Step7, discarding the visual localizer in the test stage, and performing semantic localization by using only the visual-semantic localizer subjected to the de-favoritization treatment in the training process.
As shown in FIG. 1, the method learns the importance of the training samples based on 5 constructed basic modules, thereby rebalancing the training data and eliminating the influence of video clip preference for the video semantic location model. The names and functions of these 5 basic modules are respectively:
1. The language encoder encodes the sentence to be located, and obtains a characteristic representation of the sentence for effectively retrieving a plurality of video segments of interest of the paragraph description in the video.
2. Video encoder for extracting temporal and spatial visual features of portions covered by video candidate windows from original video frames to obtain feature representations capable of characterizing visual structures of different video candidate windows
3. Visual locator predicts locating result based on video candidate window visual features extracted by video encoder only, without inputting sentence to be queried. The visual locator can learn the preference information for the video clip from the training data and be used to further adjust the loss function of the visual-semantic locator.
4. Visual-semantic locator-visual-semantic locator inputs the two modes of video and sentence to be inquired completely, and predicts locating result according to visual-semantic feature representation of video candidate window.
5. The sample weight adjusting module predicts the importance of the training sample by using the prediction output of the visual localizer, and correspondingly adjusts the weight of the training sample in the loss function of the visual-semantic localizer according to the importance of the sample, so that the data preference problem of the training sample received by the visual-semantic localizer is relieved, namely the de-preference processing is realized.
The implementation of the steps of the present invention is specifically described below.
1. Video preprocessing
The step preprocesses long video which is to be positioned and is not clipped, cuts the long video into a plurality of video small fragments, samples and extracts visual characteristics of the small video fragments from the small video fragments for subsequent understanding of semantic content and positioning. An I3D model pre-trained on a Kinetics dataset is used in extracting visual features of the video. The pretreatment process is shown in fig. 2, and comprises the following steps:
(Step 1) dividing the input video into a plurality of video clips, wherein each video clip contains T frames.
(Step 2) sampling the video small segments at fixed intervals to obtain N video base segments.
(Step 3) extracting I3D basic feature vectors from each sampled video basic segment by using a pre-trained I3D model. In total, N I3D basic features v= { Vi } (i=1..n) can be obtained, which represent the visual content of from 1 to N video segments in the video.
2. Sentence feature extraction
In order to locate the semantic content of a sentence description in a video, it is necessary to extract its semantic features and vectorize it for the sentence to be queried. The method adopts a word vector model pre-trained on large-scale text data and a long-short-time memory network (long-short term memory, LSTM) to extract semantic features of sentences, and the number of words in the sentences is set as Vs, and the specific feature extraction process is shown in figure 3 and comprises the following steps:
(Step 1) a pre-trained word vector (word embedding) model on the large-scale text data encodes each word in the sentence, each word yielding a word feature vector wi, respectively. The sentence is totally extracted with Vs word features, which can be expressed as a sentence sequence { wi}(i=1..Vs).
(Step 2) sequentially sending a sentence sequence { wi}(i=1..Vs ] consisting of a plurality of word features as input to the LSTM network and outputting the last hidden state of the LSTM network.
(Step 3) inputting the last hidden state of the LSTM network into a full connection layer to extract the final sentence characteristic fS
3. Generating visual features for candidate windows
Starting from the I3D characteristics of the video basic fragments obtained through preprocessing, semantic characteristics of candidate windows can be generated, and visual characteristics of the candidate windows serve to describe and represent video visual contents covered by the candidate windows. For all I3D basic features contained within this candidate window, a boundary matching operator (Boundary Matching Operation) BM is applied in the method to obtain a visual feature representation of the video candidate window.
The boundary matching operator BM is able to efficiently generate features of candidate windows from the video base segment through a series of bilinear sampling and convolution operations. For the video candidate window (a, b) with the starting time being a and the ending time being b, the specific steps of the BM operator are as follows:
(Step 1) performing bilinear interpolation and sampling on all the I3D basic features covered by (a, b), and obtaining K basic feature vectors through sampling. Where K is a preset hyper-parameter.
(Step 2) passing the K basic features through a convolution layer with a convolution kernel size K and a nonlinear function ReLU layer to obtain 1 feature vectorAs a visual feature representation of this video candidate window.
(Step 3) repeating the above process for all the values of a and b which are not less than 1 and not more than b and not more than N to obtain the feature vectors of all the video candidate windowsWhere N represents the number of video base segments.
4. Generating visual-semantic features for candidate windows
The visual-semantic features of the candidate windows are generated specifically, and then the visual-semantic features of each candidate window are generated by fusing the features of the visual and language modes.
(Step 1) for candidate window (a, b), visual characteristics of the candidate window are determinedInteract with language feature fS, i.e. fS withPoint-wise multiplication to obtain visual-semantic features
(Step 2) visual-semantic featuresNormalized with its L2 norm: obtaining visual-semantic features Mab of the candidate window;
(Step 3) repeating the above-described process for all candidate windows (a, b) to obtain visual-semantic features thereof
5. Visual locator and visual-semantic locator for locating
In the method, two twin locator models of a visual locator and a visual-semantic locator are simultaneously utilized to locate the video clip. Wherein the visual locator functions to directly from the visual features of all candidate windows according to the single mode preference information of the videoThe most likely predicted fragments are guessed without the need to input sentences to be queried, thereby learning their annotated single-mode preferences from the training data. The visual-semantic locator inputs visual-semantic features of the candidate window, and locates fragments of sentence descriptions in the video by simultaneously understanding visual and semantic content of two modalities of the video and the sentence.
The process of locating with a visual locator and a visual-semantic locator, respectively, is described as follows:
for the visual locator, visual features of each video candidate window are presentedInput to a fully connected layer and sigmoid layer, a predicted value p'ab of the visual perception map for each candidate window (a, b) is directly generated. The candidate window argmax(a,b)p′ab corresponding to its maximum value is used as the final visual positioning result. For a visual-semantic locator, the video clip is located in a similar manner to the visual locator, which differs from the visual locator in the input features. The input feature of the visual-semantic locator is the visual-semantic feature { Mab } of each video candidate window, which is passed to another fully-connected layer and to the sigmoid layer, thereby generating a predictor pab of the visual-semantic score map for each candidate window (a, b). The candidate window argmax(a,b)pab corresponding to its maximum value is used as the final visual-semantic localization result.
6. Training process sample weight adjustment
6.1 Determining the labels of training samples
In the training phase, each training sample contains an input video V, a sentence S, and a time segment annotation T corresponding to the sentence. In the training process, it is required to determine which time segment in the visual-semantic score map corresponds to the true value of the label and train the model accordingly. At training, first, ioU scores of each candidate window and the marked time segment are calculated, and a soft label is used for designating which candidate window belongs to a real result according to the IoU score. For each candidate window in the visual and visual-semantic score graphs, a IoU score IoUab is calculated between its time boundary (a, b) and the label T. Then, according to preset super parameters mumin and mumax, a soft labelab is allocated to IoUab, and the allocation process is as shown in fig. 4, and includes the following steps:
(i) When IoUab≤μmin, gtab =0
(Ii) When the amount of the fluorescent dye is mumin≤IoUab≤μmax,
(Iii) When μmax≤IoUab, gtab =1
6.2 Calculating the loss function and adjusting the sample weights
In the sample weight adjustment module, the visual localizer and the visual-semantic localizer are respectively trained according to the obtained soft labels gtab, and meanwhile, the weight of the training sample in the visual-semantic localizer is adjusted according to the locating result of the visual localizer. The process is described as shown in fig. 5, and includes the steps of:
(Step 1) use of a conventional Cross-entropy function as the loss functionTraining a visual locator for each training sample, defined as:
(Step 2) similarly to Step1, a loss function is calculated for the visual-semantic locator that is not weight adjusted using a conventional cross entropy function:
(Step 3) calculating the cosine similarity of the visual locator predictor p '= { p'ab } and the true value gt= { gtab }:
(Step 4) estimating importance of training samples 1-sα based on cosine similarity, and adjusting the loss function of the visual-semantic locator to beWhere α is a hyper-parameter that controls the weight change.
(Step 5) balancing the two loss functions from the loss function of the visual locatorAnd an adjusted visual-semantic locator loss functionTwo parts make up the final loss function
(Step 6) optimization algorithm using Adam algorithm as model according to the above-mentioned loss functionThe visual locator and the visual-semantic locator are trained simultaneously in an end-to-end fashion to mitigate video unimodal preferences in the model.
7. Test phase
The training phase described above involves two locators, namely a visual locator and a visual-semantic locator, whereas during the testing phase the visual locator is omitted, involving only the visual-semantic locator. After training by the method, the visual-semantic locator is adjusted by training sample weights, so that the video unimodal preference of the visual-semantic locator is reduced. Therefore, the biased visual localizer can be deleted in the test stage, and only the visual-semantic localizer subjected to the bias removal is used for video semantic localization. The visual-semantic locator finally outputs the predictive scores pab for all candidate windows. Taking the maximum max(a,b)pab and taking the corresponding time segment argmax(a,b)pab as a final positioning result.
The invention evaluates the positioning precision of the model under a cross scene, and the most common evaluation index for evaluation is R@K and theta. R@K, θ represents the ratio of the region coincidence degree IoU between at least one time slice and the true value in the previous K positioning result to exceed θ. The existing commonly used positioning video semantic positioning data sets are ACTIVITYNET CAPTIONS, charades-STA and DiDeMo. The invention trains and evaluates models on data sets of two different scenarios, respectively. Given the large scale of dataset ACTIVITYNET CAPTIONS, the variety of scenarios and activities involved, ACTIVITYNET CAPTIONS is trained as a training set herein, after which models are tested on datasets Charades-STA and DiDeMo (referred to as AcNet2Charades and AcNet2DiDeMo, respectively). Tables 1,2 are comparison of positioning accuracy with existing other methods at cross-scene settings AcNet, charades and AcNet, 2 DiDeMo.
TABLE 1 comparison of model Performance at Cross-scene setting AcNet, 2Charade
Method ofR@1,IoU=0.5R@1,IoU=0.7R@5,IoU=0.5R@5,IoU=0.7
PFGA5.751.53--
SCDM15.916.1954.0430.39
2D-TAN15.816.3059.0631.53
The invention is that21.4510.3862.3432.90
TABLE 2 comparison of model Performance at Cross-scene setting AcNet, 2DiDeMo
Method ofR@1,IoU=0.5R@1,IoU=0.7R@5,IoU=0.5R@5,IoU=0.7
PFGA6.242.01--
SCDM10.884.3443.3018.40
2D-TAN12.505.5044.8820.73
The invention is that13.117.7044.9821.32
The invention adjusts the weights of different training samples in the loss function according to the cosine similarity s to beIn fact, the adjustment of the weights in the loss function is not limited to the above-described manner. The training data distribution can be balanced only by giving a higher weight to the training samples with low similarity in the loss function and giving a lower weight to the samples with high similarity.
Based on the same inventive concept, another embodiment of the present invention provides a cross-scene video semantic locating device based on sample weight adjustment using the above method, which includes:
A video encoder for extracting visual feature representations of video candidate windows from an input video;
The language encoder is used for encoding the sentences to obtain the characteristic representation of the sentences;
the visual locator is used for predicting locating results only according to visual characteristic representation of the video candidate window and learning preference information of the video clips from the training samples;
the visual-semantic locator is used for fusing the visual characteristic representation of the video candidate window with the characteristic representation of the sentence to obtain the visual-semantic characteristic representation of the video candidate window, and then predicting the locating result according to the visual-semantic characteristic representation of the video candidate window;
And the sample weight adjusting module is used for adjusting the weight of the training sample according to the preference information obtained by the learning of the visual locator, and training the visual-semantic locator by using the training sample with the weight adjusted to obtain the visual-semantic locator with the preference removed.
Based on the same inventive concept, another embodiment of the present invention provides an electronic device (computer, server, smart phone, etc.) comprising a memory storing a computer program configured to be executed by the processor, and a processor, the computer program comprising instructions for performing the steps in the inventive method.
Based on the same inventive concept, another embodiment of the present invention provides a computer readable storage medium (e.g., ROM/RAM, magnetic disk, optical disk) storing a computer program which, when executed by a computer, implements the steps of the inventive method.
The above-disclosed embodiments of the present invention are intended to aid in understanding the contents of the present invention and to enable the same to be carried into practice, and it will be understood by those of ordinary skill in the art that various alternatives, variations and modifications are possible without departing from the spirit and scope of the invention. The invention should not be limited to what has been disclosed in the examples of the specification, but rather by the scope of the invention as defined in the claims.

Claims (10)

Translated fromChinese
1.一种基于样本权重调整的跨场景视频语义定位方法,其特征在于,包括以下步骤:1. A cross-scene video semantic localization method based on sample weight adjustment, characterized by comprising the following steps:利用视频编码器从输入视频提取视频候选窗口的视觉特征表示;Extracting visual feature representations of video candidate windows from an input video using a video encoder;利用语言编码器对句子进行编码,获取句子的特征表示;Encode the sentence using a language encoder to obtain the feature representation of the sentence;将视频候选窗口的视觉特征表示和句子的特征表示进行融合,得到视频候选窗口的视觉-语义特征表示;The visual feature representation of the video candidate window and the feature representation of the sentence are fused to obtain the visual-semantic feature representation of the video candidate window;利用视觉定位器,仅依据视频候选窗口的视觉特征表示预测定位结果,并从训练样本中学习视频片段的偏好信息,根据学习得到的偏好信息调整训练样本的权重;Using a visual locator, the positioning result is predicted based only on the visual features of the video candidate window, and the preference information of the video clips is learned from the training samples. The weights of the training samples are adjusted according to the learned preference information.利用视觉-语义定位器,依据视频候选窗口的视觉-语义特征表示预测定位结果,并利用调整权重后的训练样本对视觉-语义定位器进行训练,得到去偏好化处理的视觉-语义定位器;Using a visual-semantic locator, the positioning result is predicted based on the visual-semantic feature representation of the video candidate window, and the visual-semantic locator is trained using the training samples after the weight adjustment to obtain a de-biased visual-semantic locator;对于待定位的视频和句子,利用训练完成的视觉-语义定位器进行视频语义定位。For the videos and sentences to be located, the trained visual-semantic localizer is used to perform video semantic localization.2.根据权利要求1所述的方法,其特征在于,所述利用视频编码器从输入视频提取视频候选窗口的视觉特征表示,包括:2. The method according to claim 1, characterized in that the step of extracting the visual feature representation of the video candidate window from the input video using a video encoder comprises:视频编码器将输入视频分割成多个视频小片段,对视频小片段固定间隔采样,获得N个视频基础片段,对于每个视频基础片段用预先训练好的I3D模型提取一系列的I3D基本特征;The video encoder divides the input video into multiple small video clips, samples the small video clips at fixed intervals, obtains N basic video clips, and extracts a series of I3D basic features from each basic video clip using a pre-trained I3D model;对包含在视频候选窗口内所有的I3D基本特征,应用边界匹配算子获得视频候选窗口的视觉特征表示。For all I3D basic features contained in the video candidate window, a boundary matching operator is applied to obtain the visual feature representation of the video candidate window.3.根据权利要求2所述的方法,其特征在于,所述应用边界匹配算子获得视频候选窗口的视觉特征表示,包括:3. The method according to claim 2, characterized in that the applying a boundary matching operator to obtain a visual feature representation of a video candidate window comprises:对起始时刻为a、终止时刻为b的视频候选窗口(a,b)覆盖到的所有I3D基本特征进行双线性插值并采样,通过采样得到K个基本特征向量,其中K是预设的超参数;Perform bilinear interpolation and sampling on all I3D basic features covered by the video candidate window (a, b) with the starting time a and the ending time b, and obtain K basic feature vectors through sampling, where K is a preset hyperparameter;将K个基本特征向量通过卷积核大小为K的卷积层和非线性函数ReLU层,得到1个特征向量作为该视频候选窗口的视觉特征表示;The K basic feature vectors are passed through a convolution layer with a convolution kernel size of K and a nonlinear function ReLU layer to obtain a feature vector As a visual feature representation of the candidate window of the video;对于所有1≤a≤b≤N均重复上述过程,获取所有视频候选窗口的特征向量Repeat the above process for all 1≤a≤b≤N to obtain the feature vectors of all video candidate windows4.根据权利要求1所述的方法,其特征在于,所述利用语言编码器对句子进行编码,获取句子的特征表示,包括:4. The method according to claim 1, wherein encoding the sentence using a language encoder to obtain a feature representation of the sentence comprises:将由多个单词特征组成的句子序列作为输入送到长短时记忆网络LSTM,提取得到句子的特征表示。The sentence sequence consisting of multiple word features is sent as input to the long short-term memory network LSTM to extract the feature representation of the sentence.5.根据权利要求1所述的方法,其特征在于,所述视觉定位器和所述视觉-语义定位器的训练过程包括:5. The method according to claim 1, wherein the training process of the visual locator and the visual-semantic locator comprises:在视觉定位器中,视频候选窗口的视觉特征表示被直接传递到一个全连接层和sigmoid层,生成视觉得分图的预测值p′ab;在视觉-语义定位器中,视频候选窗口的视觉-语义特征表示被输入到一个全连接层和sigmoid层,生成候选窗口的视觉-语义得分图的预测值pab;从视觉得分图和视觉-语义得分图中分别得到视觉定位器、视觉-语义定位器的定位结果;In the visual localizer, the visual feature representation of the video candidate window is directly passed to a fully connected layer and a sigmoid layer to generate a predicted value p′ab of the visual score map; in the visual-semantic localizer, the visual-semantic feature representation of the video candidate window is input into a fully connected layer and a sigmoid layer to generate a predicted value pab of the visual-semantic score map of the candidate window; the positioning results of the visual localizer and the visual-semantic localizer are obtained from the visual score map and the visual-semantic score map respectively;分别计算视觉定位器和视觉-语义定位器的损失函数,并根据视觉定位器的定位结果调整训练样本的权重,对视觉定位器、视觉-语义定位器端对端地进行训练,得到去偏好化处理的视觉-语义定位器。The loss functions of the visual locator and the visual-semantic locator are calculated respectively, and the weights of the training samples are adjusted according to the positioning results of the visual locator. The visual locator and the visual-semantic locator are trained end-to-end to obtain a de-biased visual-semantic locator.6.根据权利要求5所述的方法,其特征在于,所述分别计算视觉定位器和视觉-语义定位器的损失函数,并根据视觉定位器的定位结果调整训练样本的权重,对视觉定位器、视觉-语义定位器端对端地进行训练,包括:6. The method according to claim 5, characterized in that the loss functions of the visual locator and the visual-semantic locator are calculated respectively, and the weights of the training samples are adjusted according to the positioning results of the visual locator, and the visual locator and the visual-semantic locator are trained end-to-end, comprising:对于视觉得分图和视觉-语义得分图中的每个候选窗口,计算其时间边界(a,b)和标注T之间的IoU得分IoUab,然后根据预先设定的超参数μmin和μmax,为IoUab分配一个软标签gtabFor each candidate window in the visual score map and the visual-semantic score map, calculate the IoU score IoUab between its temporal boundary (a, b) and the annotation T, and then assign a soft label gtab to IoUab according to the pre-set hyperparameters μmin and μmax ;根据得到的软标签gtab分别对视觉定位器和视觉-语义定位器进行训练,同时根据视觉定位器的定位结果调整训练样本在视觉-语义定位器中的权重,包括:The visual localizer and the visual-semantic localizer are trained respectively according to the obtained soft labels gtab , and the weights of the training samples in the visual-semantic localizer are adjusted according to the positioning results of the visual localizer, including:使用交叉熵函数作为损失函数训练视觉定位器,的定义为:Use cross entropy as loss function Train the visual localizer, is defined as:使用交叉熵函数计算出视觉-语义定位器未经权重调整过的损失函数:The unweighted loss function of the visual-semantic localizer is calculated using the cross entropy function:计算视觉定位器的预测值p′={p′ab}和真实值gt={gtab}的余弦相似度s,根据余弦相似度s估计出训练样本的重要性1-sα,并调整视觉-语义定位器的损失函数为其中α是控制权重变化的超参数;Calculate the cosine similarity s between the predicted value p′={p′ab } of the visual locator and the true value gt={gtab }, estimate the importance of the training sample 1-sα based on the cosine similarity s, and adjust the loss function of the visual-semantic locator to Where α is a hyperparameter that controls the change in weight;两部分组成最终的损失函数根据以端到端的方式同时对视觉定位器和视觉-语义定位器进行训练以减轻模型中的视频单模态偏好。Depend on and The final loss function consists of two parts according to The visual localizer and the visual-semantic localizer are trained simultaneously in an end-to-end manner to alleviate the video unimodal bias in the model.7.根据权利要求6所述的方法,其特征在于,所述软标签gtab的分配过程包括:7. The method according to claim 6, wherein the allocation process of the soft label gtab comprises:(i)当IoUab≤μmin时,gtab=0;(i) When IoUabμ min , gtab = 0;(ii)当μmin≤IoUab≤μmax时,其中μmin、μmax为预先设定的超参数;(ii) When μmin ≤ IoUabμ max , Among them, μmin and μmax are pre-set hyperparameters;(iii)当μmax≤IoUab时,gtab=1。(iii) When μmax ≤ IoUab , gtab =1.8.一种采用权利要求1~7中任一项所述方法的基于样本权重调整的跨场景视频语义定位装置,其特征在于,包括:8. A cross-scene video semantic localization device based on sample weight adjustment using the method according to any one of claims 1 to 7, characterized in that it comprises:视频编码器,用于从输入视频提取视频候选窗口的视觉特征表示;A video encoder for extracting visual feature representations of video candidate windows from an input video;语言编码器,用于对句子进行编码,获取句子的特征表示;Language encoder, used to encode sentences and obtain feature representations of sentences;视觉定位器,用于仅依据视频候选窗口的视觉特征表示预测定位结果,并从训练样本中学习视频片段的偏好信息;The visual localizer is used to predict the localization result based on the visual feature representation of the video candidate window only, and learn the preference information of the video segment from the training samples;视觉-语义定位器,用于将视频候选窗口的视觉特征表示和句子的特征表示进行融合,得到视频候选窗口的视觉-语义特征表示,然后依据视频候选窗口的视觉-语义特征表示预测定位结果;A visual-semantic localizer is used to fuse the visual feature representation of the video candidate window with the feature representation of the sentence to obtain the visual-semantic feature representation of the video candidate window, and then predict the localization result based on the visual-semantic feature representation of the video candidate window;样本权重调整模块,用于根据视觉定位器学习得到的偏好信息调整训练样本的权重,并利用调整权重后的训练样本对视觉-语义定位器进行训练,得到去偏好化处理的视觉-语义定位器。The sample weight adjustment module is used to adjust the weight of the training sample according to the preference information learned by the visual locator, and use the training sample with adjusted weight to train the visual-semantic locator to obtain a de-biased visual-semantic locator.9.一种电子装置,其特征在于,包括存储器和处理器,所述存储器存储计算机程序,所述计算机程序被配置为由所述处理器执行,所述计算机程序包括用于执行权利要求1~7中任一项所述方法的指令。9. An electronic device, comprising a memory and a processor, wherein the memory stores a computer program, the computer program is configured to be executed by the processor, and the computer program includes instructions for executing the method according to any one of claims 1 to 7.10.一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储计算机程序,所述计算机程序被计算机执行时,实现权利要求1~7中任一项所述的方法。10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, and when the computer program is executed by a computer, the method according to any one of claims 1 to 7 is implemented.
CN202111026168.6A2021-09-022021-09-02 A cross-scene video semantic localization method and device based on sample weight adjustmentActiveCN115761560B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202111026168.6ACN115761560B (en)2021-09-022021-09-02 A cross-scene video semantic localization method and device based on sample weight adjustment

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202111026168.6ACN115761560B (en)2021-09-022021-09-02 A cross-scene video semantic localization method and device based on sample weight adjustment

Publications (2)

Publication NumberPublication Date
CN115761560A CN115761560A (en)2023-03-07
CN115761560Btrue CN115761560B (en)2025-07-15

Family

ID=85332130

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202111026168.6AActiveCN115761560B (en)2021-09-022021-09-02 A cross-scene video semantic localization method and device based on sample weight adjustment

Country Status (1)

CountryLink
CN (1)CN115761560B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN107403430A (en)*2017-06-152017-11-28中山大学A kind of RGBD image, semantics dividing method
CN110188239A (en)*2018-12-262019-08-30北京大学 A two-stream video classification method and device based on cross-modal attention mechanism

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN111491187B (en)*2020-04-152023-10-31腾讯科技(深圳)有限公司Video recommendation method, device, equipment and storage medium
CN113111836B (en)*2021-04-252022-08-19山东省人工智能研究院Video analysis method based on cross-modal Hash learning

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN107403430A (en)*2017-06-152017-11-28中山大学A kind of RGBD image, semantics dividing method
CN110188239A (en)*2018-12-262019-08-30北京大学 A two-stream video classification method and device based on cross-modal attention mechanism

Also Published As

Publication numberPublication date
CN115761560A (en)2023-03-07

Similar Documents

PublicationPublication DateTitle
CN111950269B (en) Text sentence processing method, device, computer equipment and storage medium
US12361215B2 (en)Performing machine learning tasks using instruction-tuned neural networks
CN108733837B (en)Natural language structuring method and device for medical history text
US20220188636A1 (en)Meta pseudo-labels
CN111460101B (en) Knowledge point type identification method, device and processor
CN115861462B (en) Training method, device, electronic equipment and storage medium for image generation model
WO2023137911A1 (en)Intention classification method and apparatus based on small-sample corpus, and computer device
CN105069483B (en) A method for testing on categorical datasets
CN110990531B (en)Text emotion recognition method and device
CN113555005B (en) Model training, confidence determination method and device, electronic device, storage medium
CN113569018A (en)Question and answer pair mining method and device
CN113407776A (en)Label recommendation method and device, training method and medium of label recommendation model
CN113723077B (en)Sentence vector generation method and device based on bidirectional characterization model and computer equipment
CN118070775B (en)Performance evaluation method and device of abstract generation model and computer equipment
CN117033961A (en)Multi-mode image-text classification method for context awareness
CN110852103A (en)Named entity identification method and device
CN117556275B (en)Correlation model data processing method, device, computer equipment and storage medium
CN115761560B (en) A cross-scene video semantic localization method and device based on sample weight adjustment
CN110147881B (en)Language processing method, device, equipment and storage medium
CN117610666A (en)Question-answer model training and answer determining method, device, equipment and medium
CN115129963B (en) Search processing method and device
CN113254596B (en)User quality inspection requirement classification method and system based on rule matching and deep learning
CN116958585A (en) Image processing methods, devices, electronic equipment and storage media
Li et al.Semantic latent decomposition with normalizing flows for face editing
CN116975618A (en)Data processing method, device, equipment and readable storage medium

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp