CN115761560B

Movatterモバイル変換

Info

Publication number: CN115761560B
Application number: CN202111026168.6A
Authority: CN
Inventors: 包培钧; 穆亚东
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2021-09-02
Filing date: 2021-09-02
Publication date: 2025-07-15
Anticipated expiration: 2041-09-02
Also published as: CN115761560A

Abstract

本发明涉及一种基于样本权重调整的跨场景视频语义定位方法和装置。本发明同时使用两个具有相同主干网络结构的孪生模型，第一个模型仅读取视频输入而不读取句子，第二个模型同时读取完整的视频输入和句子；第一个模型用来学习偏好信息，仅根据视频单个模态就预测定位结果，并根据其学习得到的偏好信息去调整训练样本的权重，使得第二个模型收到的训练样本不具备数据偏好信息，迫使第二个模型同时理解视频和语言两个模态中共同的语义信息。本发明的提供了一种训练框架以防止模型过拟合于视频片段中的偏好信息，使之能够真正同时理解视频和句子两个模态并根据两者的语义信息在视频中做语义定位。本发明在跨场景条件下的泛化能力具有明显优势。

The present invention relates to a method and device for cross-scene video semantic localization based on sample weight adjustment. The present invention uses two twin models with the same backbone network structure at the same time, the first model only reads the video input but not the sentence, and the second model reads the complete video input and the sentence at the same time; the first model is used to learn preference information, predicts the localization result based on only the single mode of the video, and adjusts the weight of the training sample according to the preference information learned, so that the training sample received by the second model does not have data preference information, forcing the second model to understand the common semantic information in both the video and language modes at the same time. The present invention provides a training framework to prevent the model from overfitting to the preference information in the video clip, so that it can truly understand both the video and sentence modes at the same time and perform semantic localization in the video according to the semantic information of both. The generalization ability of the present invention under cross-scene conditions has obvious advantages.