CN119918545A

Movatterモバイル変換

Info

Publication number: CN119918545A
Application number: CN202411732284.3A
Authority: CN
Inventors: 张瑞; 毕严先; 鲍帆; 常安; 李嘉辰; 罗敏; 李思
Original assignee: Science And Technology Group Co ltd Of Cetc; Electronic Science Research Institute Of China Electronics Technology Group Co ltd
Current assignee: Science And Technology Group Co ltd Of Cetc; Electronic Science Research Institute Of China Electronics Technology Group Co ltd
Priority date: 2024-11-29
Filing date: 2024-11-29
Publication date: 2025-05-02

Abstract

Translated fromChinese

本申请公开了一种多层次视觉引导的多模态对话摘要方法，涉及互联网、人工智能技术领域，本申请使用预训练的CLIP模型提取对话所包含的视觉信息的全局特征和局部特征，使用预训练模型T5对对话的文本进行文本特征提取，获得富含深层语义信息的视觉特征和文本特征，通过局部多模态注意力交叉模块和全局多模态注意力交叉模块，将全局视觉信息和局部视觉信息与文本特征进行融合对齐，并通过模态融合模块，将全局视觉引导的文本特征和语义引导的局部视觉特征进行融合和拼接，使得多模态对话信息能够相互补充，并且关注对话的上下文，从而提高生成摘要的质量和准确性。

The present application discloses a multi-level visually guided multimodal dialogue summarization method, which relates to the fields of Internet and artificial intelligence technology. The present application uses a pre-trained CLIP model to extract global features and local features of visual information contained in a dialogue, uses a pre-trained model T5 to extract text features of the dialogue text, obtains visual features and text features rich in deep semantic information, fuses and aligns global visual information and local visual information with text features through a local multimodal attention cross module and a global multimodal attention cross module, and fuses and splices global visually guided text features and semantically guided local visual features through a modal fusion module, so that multimodal dialogue information can complement each other and pay attention to the context of the dialogue, thereby improving the quality and accuracy of the generated summary.