Movatterモバイル変換


[0]ホーム

URL:


CN119918545A - A multimodal dialogue summarization method based on multi-level visual guidance - Google Patents

A multimodal dialogue summarization method based on multi-level visual guidance
Download PDF

Info

Publication number
CN119918545A
CN119918545ACN202411732284.3ACN202411732284ACN119918545ACN 119918545 ACN119918545 ACN 119918545ACN 202411732284 ACN202411732284 ACN 202411732284ACN 119918545 ACN119918545 ACN 119918545A
Authority
CN
China
Prior art keywords
visual
features
text
local
global
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202411732284.3A
Other languages
Chinese (zh)
Inventor
张瑞
毕严先
鲍帆
常安
李嘉辰
罗敏
李思
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Science And Technology Group Co ltd Of Cetc
Electronic Science Research Institute Of China Electronics Technology Group Co ltd
Original Assignee
Science And Technology Group Co ltd Of Cetc
Electronic Science Research Institute Of China Electronics Technology Group Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Science And Technology Group Co ltd Of Cetc, Electronic Science Research Institute Of China Electronics Technology Group Co ltdfiledCriticalScience And Technology Group Co ltd Of Cetc
Priority to CN202411732284.3ApriorityCriticalpatent/CN119918545A/en
Publication of CN119918545ApublicationCriticalpatent/CN119918545A/en
Pendinglegal-statusCriticalCurrent

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本申请公开了一种多层次视觉引导的多模态对话摘要方法,涉及互联网、人工智能技术领域,本申请使用预训练的CLIP模型提取对话所包含的视觉信息的全局特征和局部特征,使用预训练模型T5对对话的文本进行文本特征提取,获得富含深层语义信息的视觉特征和文本特征,通过局部多模态注意力交叉模块和全局多模态注意力交叉模块,将全局视觉信息和局部视觉信息与文本特征进行融合对齐,并通过模态融合模块,将全局视觉引导的文本特征和语义引导的局部视觉特征进行融合和拼接,使得多模态对话信息能够相互补充,并且关注对话的上下文,从而提高生成摘要的质量和准确性。

The present application discloses a multi-level visually guided multimodal dialogue summarization method, which relates to the fields of Internet and artificial intelligence technology. The present application uses a pre-trained CLIP model to extract global features and local features of visual information contained in a dialogue, uses a pre-trained model T5 to extract text features of the dialogue text, obtains visual features and text features rich in deep semantic information, fuses and aligns global visual information and local visual information with text features through a local multimodal attention cross module and a global multimodal attention cross module, and fuses and splices global visually guided text features and semantically guided local visual features through a modal fusion module, so that multimodal dialogue information can complement each other and pay attention to the context of the dialogue, thereby improving the quality and accuracy of the generated summary.

Description

Multi-mode dialogue abstracting method based on multi-level visual guidance
Technical Field
The application relates to the technical field of Internet and artificial intelligence, in particular to a multi-mode dialogue abstracting method based on multi-level visual guidance.
Background
In recent years, with the rapid development of information technology and human society, massive information brings about unprecedented challenges to people. In contrast, conversations, which are indispensable communication systems for human society, are core systems for information exchange, expression of views, and emotion communication, and the explosion growth of information is also gradually making it difficult for users to handle them. The user faces the multi-mode and massive dialogue information, and needs to acquire the core information of the dialogue efficiently and accurately to assist related decisions or know the related scene, which becomes a general requirement of the user. And the application of machine learning, deep learning and large language model technology in natural language processing in the artificial intelligence field is deepened continuously, so that the processing capacity of dialogue information is improved. In such a context, dialog summarization techniques have evolved. The development of the dialogue abstract technology has great significance for solving a great number of dialogue information problems which are difficult to capture key information, plays a key role in improving the information acquisition efficiency of users and the accuracy of capturing the key information, has great potential in the fields of meeting records, customer service systems, social media industries and the like, and has wide application prospect.
The dialogue abstracting task aims at extracting key content from dialogue data and generating a concise abstract on the basis of keeping original text core information, and a dialogue containing a lot of information can be summarized by a short text. With the rapid development of artificial intelligence, the generation and understanding of dialogue abstracts have been widely focused and applied in various application scenes, and the task has important roles and application values. The task not only requires the method to identify and extract the core information, but also captures the semantic information hidden in the dialogue, and further requires a certain language generating capability to ensure that the generated abstract is easy to understand and can accurately convey the core information of the dialogue. Dialog summaries are considered a major challenge in the summary task because dialog summaries exist with changes to the speaking characters and constant transformations of the conversation topics, the underlying logic in the dialog needs to be captured, and the summaries generated need to be understood by the user. In the customer service system industry, the dialogue abstract technology can automatically generate dialogue abstracts between customer service personnel and users, provides quick references for the customer service personnel, helps to shorten response time and improve service quality so as to improve services and optimize strategies, in the aspect of meeting record, the dialogue abstract technology can help to generate meeting overviews, discussion points, decisions and backlogs in the meeting are presented in a concise manner, in the social media industry, the content of user interaction is rich and complex, the dialogue abstract can automatically summarize core views and user feedback of the discussion, and helps a platform to better understand user requirements and improves community management efficiency. The dialogue abstract is not only a key technology of information retrieval and information extraction, but also provides important support for other text abstract tasks, and is one of the most important tasks in the abstract tasks.
The dialogue abstracting task is to extract the characteristics of dialogue data to be processed, and a designed algorithm or a trained model is used to enable a computer to abstract the data automatically. The dialogue abstract is conventionally provided with two methods of extracting the abstract and generating the abstract, wherein the former directly picks important sentences or fragments from dialogue texts to be combined to form a concise dialogue abstract, and the latter automatically generates new sentences to describe the core content of the dialogue by understanding and recombining the dialogue content. However, conventional approaches may create information redundancy that makes it difficult to accurately capture dialog topics and context associations for long dialogs or complex dialogs. In order to solve the problems, some researchers combine deep learning methods, such as sequence-to-sequence methods, RNN (RNN) -based and LSTM (LSTM) architecture model-based methods and the like, so that the quality and application breadth of the text abstract are improved.
With the development of multimedia technology and the popularization of social media platforms, information generated by user dialogues is gradually multi-modal, such as the dialog of users on the social media platform, which is usually a text dialog combined with video or images. In such a scenario, only in combination with multimodal information, key information in a conversation can be captured more efficiently. However, the existing methods of partial dialogue abstracts are text-based methods, and do not consider the contribution of visual information, so that dialogue abstracts are not complete enough due to a certain one-sided dialogue abstracts result in a multi-mode scene. Therefore, some researchers in recent years have proposed a multi-modal dialog summarization method, which aims to reflect the core content of a dialog more comprehensively and accurately by comprehensively utilizing information of multiple modalities such as a dialog, an image, a video and the like. The multi-mode dialogue abstract is different from the traditional dialogue abstract, the multi-mode dialogue abstract needs to combine the text and the data of other modes, after the characteristics are obtained, the characteristics of the text and the characteristics of the other modes are fused to obtain the final characteristics, and then the characteristics are decoded to generate a more comprehensive and more accurate dialogue abstract. Unlike a plain text conversation abstract, a multimodal conversation abstract faces two challenges, one of which is how to capture implicit semantics of different modalities to avoid that conversation information is ignored due to the complexity of multimodal information, and two of which is how to maintain the memory and context logic of the conversation and other modality information in the case of multiple rounds of conversations to ensure contributions to conversation abstract generation.
To address the challenges of multimodal dialog summarization described above, one part of the approach fuses image and text features by building complex network structures, and another part of the approach is to improve the quality of the text summary generated by the model by using a pre-trained model.
As shown in fig. 1, in an article of "multi-modal abstract model facing to aspects" in the prior art, mention is made of using visual information to perform text abstract in combination with text information, and multi-modal pointer generation network and enhanced maximum likelihood training of aspects are proposed, and in combination with aspect coverage mechanism, excellent text abstract effect is obtained:
First, on text embedding, text data is converted into an embedded representation using a pre-trained word vector model, the text embedding is input into an encoder of the model, and encoded together with visual features into a contextual representation. Second, visual features are extracted by ResNet and the hidden states of the encoder and decoder are initialized, while local visual features extracted by fast R-CNN together with text embedding generate a contextual representation for multimodal fusion of the hierarchical attention mechanisms. Thirdly, an aspect-oriented reward enhanced maximum likelihood mechanism is introduced in model training to ensure that the abstract covers important aspects of text data, the model combines text embedding and visual information in training to generate abstract content, and the same aspects are prevented from being repeatedly described through an aspect coverage mechanism. Finally, the same aspect is prevented from repeatedly appearing through the aspect consistency strategy and constraint decoding. The text embedding and visual characteristics cooperate in decoding to ensure that the generated content is consistent with the text data, thereby improving the continuity and accuracy of the abstract.
As shown in fig. 2, in the article of "multi-modal summary generation task based on video" in the prior art, mention is made of a self-attention mechanism and a global attention mechanism using a dual interaction module, capturing global and local semantics of text and video, and obtaining a high-quality text summary:
Firstly, respectively coding text data and video data, for the text, using bidirectional RNN to code semantic information, for the video, dividing the video into a plurality of fragments, using ResNet to extract frame characteristics, and using time sequence dependency relationship among the bidirectional RNN to code fragments. And secondly, realizing deep interaction by adopting a dual interaction module, wherein the module comprises a conditional self-attention mechanism and a global attention mechanism, the conditional self-attention captures local semantic information in the video clip and highlights key contents under the guidance of the text, and the global attention processes the high-level semantic relation between the text and the video to realize deep fusion of the text and the video. Thirdly, the multi-mode generator generates a text abstract and selects a video cover frame, the text abstract is generated by using an editing gate mechanism to fuse a text representation of video perception with an original text representation and avoid generated vocabulary deletion through a pointer network, the cover frame selection is based on a hierarchical video representation, matching scoring is carried out on candidate frames by using hierarchical attention, and a most representative frame is selected as a cover. And finally, optimizing the whole model by combining a loss function, wherein the model comprises negative log likelihood loss of a text abstract and paired hinge loss of cover selection, so that multi-mode generation and frame selection are mutually promoted, and the generation effect is improved.
But the prior art "aspect-oriented multimodal summary model", "video-based multimodal summary generation task":
1. On the multimodal summary task, the visual feature extraction method used has a key role in the quality of the generated summary, but the extraction method used does not fully understand the internal semantics related to the text.
2. Previous work has focused mainly on feature filtering and modal fusion at the encoder side, but neglecting the fusion at the decoder side.
There are thus the following disadvantages:
1. The visual feature extraction method is not related to the internal semantics of the text data, so that the extracted visual features are difficult to supplement with the text features semantically, and the addition of the visual features is not necessarily beneficial to the generation of the text abstract, and may lead to redundancy of information.
2. Lack of contextual understanding of the dialog and capture of the information content of the multimodal dialog may result in loss of summary information, losing the emotional intent understanding underlying the dialog.
Disclosure of Invention
The embodiment of the application provides a multi-modal dialog abstract method based on multi-level visual guidance, which uses a pre-trained CLIP model to extract global features and local features of visual information contained in a dialog, uses a pre-trained model T5 to extract text features of the dialog, and fuses and splices the text features of the global visual guidance and the local visual features of semantic guidance, so that the multi-modal dialog information can be mutually supplemented, and the context of the dialog is focused, thereby improving the quality and accuracy of the generated abstract.
The embodiment of the application provides a multi-modal dialogue summarization method based on multi-level visual guidance, which comprises the following steps of:
Extracting text features and abstract text features of the multi-modal dialogue by using an encoder of a T5 model, and obtaining global visual features and local visual features by using a visual encoder of a CLIP model from an image of the input multi-modal dialogue data, and constructing indexes of the text features and the visual features;
Inputting the text features and the local visual features into a local multi-modal attention crossing module and inputting the text features into a long-term and short-term memory network to obtain a hidden state containing global semantic information;
Calculating self-attention weights based on a conditional self-attention mechanism at the local multi-mode attention crossing module, calculating conditional weights of local visual features from hidden states containing global semantic information and the local visual features, weighting the local visual features through the conditional weights and the self-attention weights, and obtaining the local visual features guided by semantics through a transducer encoder;
Inputting the text features and the global visual features into a global multi-mode attention crossing module, inputting the text features into a long-short-period memory network, and acquiring the text features capturing the context information;
in a global multi-mode attention crossing module, a cross-mode multi-head attention mechanism and a cross-mode bilinear attention fusion mechanism are used for carrying out weighted sum on text features which are obtained by the two mechanisms and are combined with visual information, and the text features of global visual guidance are obtained through a transducer encoder;
Inputting the local visual feature of the semantic guidance and the text feature of the global visual guidance into a modal fusion module, guiding the text feature of the global visual guidance by using the local visual feature of the semantic guidance through a cross-modal attention mechanism to obtain the text feature of the visual information guidance;
Inputting the two-way enhanced visual-text features and abstract text features into a transducer decoder, converting the dimensions of a high-dimensional vector into vector dimensions which can be mapped by a vocabulary through a linear layer by the decoded text features, inputting the obtained vector into a Softmax layer, and converting the vector into probability distribution by using a beam search method;
during the training process, cross entropy loss between the probability distribution of model predictions and the target abstract text is calculated.
Optionally, extracting text features and abstract text features of the multi-modal dialog by using an encoder of the T5 model satisfies:
Ti=T5(ti)
Si=T5(si)
Where Ti is the i-th dialog input, Si is the target abstract of the i-th dialog, T5 is the T5 model, Ti is the embedded vector of the dialog input Ti, and Si is the embedded vector of the abstract text Si.
Optionally, a visual encoder of the CLIP model is utilized to obtain global visual features and local visual features satisfying:
Vfi=CLIPvis(Vi)
Vfi={Va,V1,…,VP}
Wherein Vfi is a visual feature acquired using the CLIP model visual encoder CLIPvis,Dv is the visual feature dimension, Va is the global visual feature of the image, and Vi is the local visual feature of the image.
Optionally, in the local multi-modal attention crossing module, calculating the self-attention weight based on a conditional self-attention mechanism, and calculating the conditional weight of the local visual feature from the hidden state and the local visual feature including global semantic information includes:
Acquiring a query Q, a key K and a value V of local visual features through a mapping matrix based on a conditional self-attention mechanism at the local multi-mode attention crossing module;
Calculating a self-care weight using query Q, key K, and value V;
And calculating the conditional weights of the local visual features by using the hidden states containing the global semantic information and the local visual features.
Optionally, inputting the text feature into the long-short term memory network satisfies:
Wherein,In the hidden state of the time step t,The cell state of time step t, the hidden state of the last step isThe output is Ts.
Optionally, in the local multi-modal attention crossing module, calculating the self-attention weight based on a conditional self-attention mechanism, and calculating the conditional weight of the local visual feature from the hidden state and the local visual feature including global semantic information includes:
Wherein W ε Rd×d is a weight matrix, αi,j is a self-attention weight, βi is a conditional weight, and Vi obtains semantic guided local visual features Xv via a transducer encoder.
Optionally, the global visual feature is input into a long-short-term memory network, and the obtaining the visual feature capturing global information includes:
Wherein,In the hidden state of the time step t,The cell state at time step t is output as Vs.
Optionally, in the global multi-mode attention crossing module, using a cross-mode multi-head attention mechanism and a cross-mode bilinear attention fusion mechanism, performing weighted sum on text features obtained by the two mechanisms and combined with visual information, and obtaining text features of global visual guidance through a transducer encoder includes:
A=VsBTs
A=softmax(A)
T2=ATs
T=αT1+(1-α)T2
Wherein CMA represents cross-modal multi-headed attention, W is a weight matrix, B is a bilinear weight matrix, α is a trainable weight, and the output obtained through the transducer encoder layer is Xt.
Optionally, directing the text feature of the global visual guide using the local visual feature of the global visual guide to obtain the text feature of the visual information guide by a cross-modal attentiveness mechanism, directing the local visual feature of the semantic guide using the text feature of the global visual guide to obtain the visual feature of the text information guide, and stitching the text feature of the visual information guide with the visual feature of the text information guide to obtain the bidirectionally enhanced visual-text feature comprises:
Oenc=concat(T,V)
where T is a text feature directed by visual information, V is a visual feature directed by text information, concat is a stitching operation, and Oenc is a bidirectionally enhanced visual-text feature.
Optionally, in the training process, calculating the cross entropy loss between the probability distribution of the model prediction and the target abstract text includes:
in the training process, calculating cross entropy loss between probability distribution of model prediction and target abstract text;
Averaging the cross entropy loss of all time steps, and as the overall average loss, carrying out back propagation on the calculated average loss, and updating model parameters by using a gradient descent method to minimize the loss;
The cross entropy loss between the probability distribution predicted by the calculation model and the target abstract text meets the following conditions:
where yi is the true distribution of the target abstract text, pi is the probability distribution of model predictions, and θ is the model parameter.
According to the embodiment of the application, the pre-trained CLIP model is used for extracting the global features and the local features of visual information contained in the dialogue, the pre-trained model T5 is used for extracting the text features of the dialogue, and the text features guided by the global visual sense and the local visual features guided by the semantic sense are fused and spliced, so that the multi-mode dialogue information can be mutually supplemented, and the context of the dialogue is focused, so that the quality and the accuracy of the generated abstract are improved.
The foregoing description is only an overview of the present application, and is intended to be implemented in accordance with the teachings of the present application in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present application more readily apparent.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:
FIG. 1 is a technical flow of an aspect-oriented multimodal summary model of the prior art;
FIG. 2 is a prior art technical flow of a video-based multimodal summary generation task;
FIG. 3 is a basic flow diagram of a multi-modal dialog summarization method based on multi-level visual guidance according to an embodiment of the present application;
fig. 4 is an application flow diagram of a multi-modal dialog summarization method based on multi-level visual guidance according to an embodiment of the present application at GPLM.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
Abbreviations and key terms appearing in this embodiment are defined as follows:
back Propagation;
contrastive Language-Image Pre-training contrast language-Image Pre-training model;
T5 Text-To-Text Transfer Transformer Text-To-Text transform;
CMA Cross-modal Multi-modal Attention Cross-modal multi-head attention mechanism;
LSTM is Long Short-Term Memory network;
ViT: vision Transformer visual transducer;
GPLM: GENERATIVE PRE-trained Language Model to generate a pre-training language model;
the embodiment of the application provides a multi-modal dialogue summarization method based on multi-level visual guidance, which is implemented in a model training stage as shown in fig. 3 and 4 by executing the following steps:
Extracting text features and abstract text features of the multi-modal dialog by using an encoder of a T5 model, and extracting text features and abstract text features of the multi-modal dialog by using an encoder of the T5 model in some embodiments, wherein the text features and abstract text features of the multi-modal dialog satisfy the following conditions:
Ti=T5(ti)
Si=T5(si)
Where Ti is the i-th dialog input, Si is the target abstract of the i-th dialog, T5 is the T5 model, Ti is the embedded vector of the dialog input Ti, and Si is the embedded vector of the abstract text Si. In this embodiment, the text feature is also encoded as 768-dimensional vectors.
And utilizing a visual encoder of the CLIP model to obtain global visual features and local visual features and constructing indexes of text features and visual features by using the images of the input multi-modal dialogue data.
In some embodiments, a visual encoder of the CLIP model is utilized to obtain global visual features and local visual features satisfying:
Vfi=CLIPvis(Vi)
Vfi={Va,V1,…,VP}
Wherein Vfi is a visual feature acquired using the CLIP model visual encoder CLIPvis,Dv is the visual feature dimension, Va is the global visual feature of the image, and Vi is the local visual feature of the image.
In this embodiment, the pretrained multimodal language model CLIP is obtained by pretraining 4 billion pairs of image-text pairs data sets constructed by crawling public images and corresponding descriptive texts on the internet, and can capture deep semantic information of data. The output of the visual encoder was set to 7 x 7 and the architecture selected was the CLIP-vit-base-32 model. The image is converted into 224 x 224, each 7 x 7 being encoded as a 768-dimensional vector.
The text features and the local visual features are input into a local multi-modal attention crossing module, and the text features are input into a long-short-term memory network to obtain a last-step hidden state containing global semantic information.
In some embodiments, entering text features into the long and short term memory network satisfies:
Wherein,In the hidden state of the time step t,The cell state of time step t, the hidden state of the last step isThe output is Ts.
And calculating self-attention weights based on a conditional self-attention mechanism in the local multi-mode attention crossing module, calculating conditional weights of local visual features by using hidden states containing global semantic information and the local visual features, weighting the local visual features by the conditional weights and the self-attention weights, and passing through a transducer encoder layer to obtain the local visual features of semantic guidance.
In some embodiments, at the local multi-modal attention intersection module, computing self-attention weights based on conditional self-attention mechanisms, computing conditional weights for local visual features from hidden states containing global semantic information and the local visual features comprises:
Acquiring a query Q, a key K and a value V of local visual features through a mapping matrix based on a conditional self-attention mechanism at the local multi-mode attention crossing module;
Calculating a self-care weight using query Q, key K, and value V;
And finally, carrying out weighted calculation on the self-care weight and the local visual feature through the conditional weight, and obtaining the local visual feature guided by the semantic through a transducer encoder layer.
In some embodiments, at the local multi-modal attention intersection module, computing self-attention weights based on conditional self-attention mechanisms, computing conditional weights for local visual features from hidden states containing global semantic information and the local visual features comprises:
Wherein W ε Rd×d is a weight matrix, αi,j is a self-attention weight, βi is a conditional weight, and Vi obtains semantic guided local visual features Xv via a transducer encoder.
The method comprises the steps of inputting text features and global visual features into a global multi-mode attention crossing module, inputting the text features into a long-period memory network to obtain text features capturing context information, and inputting the global visual features into the long-period memory network to obtain visual features capturing global information.
In some embodiments, the global visual feature input into the long-term memory network, the obtaining the visual feature capturing the global information comprises:
Wherein,In the hidden state of the time step t,The cell state at time step t is output as Vs.
And in the global multi-mode attention crossing module, a cross-mode multi-head attention mechanism and a cross-mode bilinear attention fusion mechanism are used for carrying out weighted sum on text features obtained by the two mechanisms and combined with visual information, and the text features of global visual guidance are obtained through a transducer encoder.
In some embodiments, at the global multi-modal attention intersection module, using a cross-modal multi-head attention mechanism and a cross-modal bilinear attention fusion mechanism, weighting and summing text features obtained by the two mechanisms and combined with visual information, and obtaining text features of the global visual guidance through a transducer encoder includes:
A=VsBTs
A=softmax(A)
T2=ATs
T=αT1+(1-α)T2
Wherein CMA represents cross-modal multi-headed attention, W is a weight matrix, B is a bilinear weight matrix, α is a trainable weight, and the output obtained through the transducer encoder layer is Xt.
In a specific example, the global multi-modal attention cross module head number is set to 8.
The method comprises the steps of inputting a local visual feature of semantic guidance and a text feature of global visual guidance into a modal fusion module, guiding the text feature of global visual guidance by using the local visual feature of semantic guidance through a cross-modal attention mechanism to obtain the text feature of visual information guidance, guiding the local visual feature of semantic guidance by using the text feature of global visual guidance to obtain the visual feature of text information guidance, and splicing the text feature of visual information guidance and the visual feature of text information guidance to obtain the visual-text feature of bidirectional enhancement.
In some embodiments, directing text features of a global visual guide using local visual features of a global visual guide to obtain text features of a visual information guide by a cross-modal attentiveness mechanism, directing local visual features of a semantic guide using text features of a global visual guide to obtain visual features of a text information guide, and stitching the text features of a visual information guide with the visual features of a text information guide to obtain bi-directionally enhanced visual-text features includes:
Oenc=concat(T,V)
where T is a text feature directed by visual information, V is a visual feature directed by text information, concat is a stitching operation, and Oenc is a bidirectionally enhanced visual-text feature.
The two-way enhanced visual-text features and abstract text features are input to a transducer decoder, the decoded text features convert the dimension of a high-dimensional vector into a vector dimension which can be mapped by a vocabulary through a linear layer, the obtained vector is input to a Softmax layer, converted into probability distribution, and the final output probability distribution is obtained by using a beam search. In the present embodiment, the number of beam searches is set to 5.
In the training process, the cross entropy loss between the probability distribution predicted by the model and the target abstract text is calculated, and after training, the trained model is deployed and applied. And averaging the cross entropy loss of all time steps as the overall average loss, carrying out back propagation on the calculated average loss, and updating model parameters by using a gradient descent method to minimize the loss.
In this embodiment, the learning rate is set to 1e-5 and the batch is set to 4.
In some embodiments, during training, calculating the cross entropy loss between the model predicted probability distribution and the target abstract text comprises:
in the course of the training process, the user can perform,
The cross entropy loss between the probability distribution predicted by the calculation model and the target abstract text meets the following conditions:
where yi is the true distribution of the target abstract text, pi is the probability distribution of model predictions, and θ is the model parameter.
As shown in FIG. 4, the application example of the application firstly utilizes T5 and CLIP models to extract text features and image global and local visual features of a dialogue respectively, then the models send the text features and the local visual features into a local multi-modal attention module, a conditional self-attention mechanism is utilized to calculate weights to generate semantic guided local visual features, meanwhile, the text features and the global visual features are input into the global multi-modal attention module to be fused through multi-head and bilinear attention mechanisms to generate global visual guided text features, then the features are mutually guided and enhanced in a two-way mode fusion module to obtain fused visual-text features, finally, the models input the fused features into a transducer decoder to obtain probability distribution of generating abstract texts, under the training condition, the difference between the abstract and the target abstract is calculated by using cross entropy loss calculation, and the gradient descent method is used to optimize the generated results to be closer to the target abstract, under the non-training condition, and the dialogue abstract text with the highest probability is selected as the result to be output.
The method aims at a dialogue abstracting task containing visual information, a pre-training model CLIP is used for extracting global features and local features of the visual information contained in the dialogue, a pre-training model T5 is used for extracting text features of the dialogue, comprehensive and deeper visual feature representations and text feature representations with rich semantics are obtained, a local multi-mode attention crossing module is used for capturing deep semantic association of the local visual information features and the text features by using a long-short-term memory network and a conditional self-attention mechanism, the local visual features guided by semantics are obtained, deep semantic expression of the model on the visual information is enhanced, a global multi-mode attention crossing module is used for capturing the text features guided by using a long-short-term memory network and a multi-mode fusion mechanism parallel to the cross-mode bilinear attention mechanism, the integrity and effectiveness of the information are guaranteed, the quality of abstracts is improved, in the mode fusion module, the multi-head codec-decoding attention layer of a trans-for decoding is used for decoding by using the cross-attention mechanism, finally, the abstracts quality of the dialogue abstracts is improved, and the new strategy is provided for abstracts is provided.
According to the embodiment of the application, the pre-trained CLIP model is used for extracting the global features and the local features of visual information contained in the dialogue, the pre-trained model T5 is used for extracting the text features of the dialogue, and the text features guided by the global visual sense and the local visual features guided by the semantic sense are fused and spliced, so that the multi-mode dialogue information can be mutually supplemented, and the context of the dialogue is focused, so that the quality and the accuracy of the generated abstract are improved.
It should be noted that, in various embodiments of the present application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present application.
The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are to be protected by the present application.

Claims (10)

CN202411732284.3A2024-11-292024-11-29 A multimodal dialogue summarization method based on multi-level visual guidancePendingCN119918545A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202411732284.3ACN119918545A (en)2024-11-292024-11-29 A multimodal dialogue summarization method based on multi-level visual guidance

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202411732284.3ACN119918545A (en)2024-11-292024-11-29 A multimodal dialogue summarization method based on multi-level visual guidance

Publications (1)

Publication NumberPublication Date
CN119918545Atrue CN119918545A (en)2025-05-02

Family

ID=95501325

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202411732284.3APendingCN119918545A (en)2024-11-292024-11-29 A multimodal dialogue summarization method based on multi-level visual guidance

Country Status (1)

CountryLink
CN (1)CN119918545A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN120235250A (en)*2025-05-302025-07-01浙江有鹿机器人科技有限公司 A method and device for processing images and texts of a marked compression frame
CN120336493A (en)*2025-06-182025-07-18西湖心辰(杭州)科技有限公司 AI multimodal dialogue system based on multimodal recognition

Cited By (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN120235250A (en)*2025-05-302025-07-01浙江有鹿机器人科技有限公司 A method and device for processing images and texts of a marked compression frame
CN120336493A (en)*2025-06-182025-07-18西湖心辰(杭州)科技有限公司 AI multimodal dialogue system based on multimodal recognition

Similar Documents

PublicationPublication DateTitle
CN113723166B (en) Content identification method, device, computer equipment and storage medium
JP2025077914A (en) Method and system for intelligent analysis of bills based on semantic graph model
CN110390103A (en) Short text automatic summarization method and system based on dual encoders
CN116720004B (en)Recommendation reason generation method, device, equipment and storage medium
CN113392265B (en) Multimedia processing method, device and equipment
CN110795549B (en)Short text conversation method, device, equipment and storage medium
CN119918545A (en) A multimodal dialogue summarization method based on multi-level visual guidance
CN119862861B (en)Visual-text collaborative abstract generation method and system based on multi-modal learning
CN118051635B (en) Conversational image retrieval method and device based on large language model
CN114091466A (en) A Multimodal Sentiment Analysis Method and System Based on Transformer and Multitask Learning
CN111382257A (en)Method and system for generating dialog context
CN113435216B (en) Neural network machine translation model training method, machine translation method and device
CN114360502A (en) Speech recognition model processing method, speech recognition method and device
CN117789076A (en)Video description generation method and system oriented to semantic characteristic selection and attention fusion
CN115630145A (en) A dialogue recommendation method and system based on multi-granularity emotion
CN117496388A (en) Cross-modal video description model based on dynamic memory network
CN118468224A (en) A multimodal sarcasm detection method based on visual instruction fine-tuning and demonstration learning enhancement
CN118760359B (en)Multi-mode multi-task intelligent interaction method and device, electronic equipment and storage medium
CN119739840A (en) A multimodal intelligent question-answering and recommendation system supporting emotional speech output
CN118260711A (en)Multi-mode emotion recognition method and device
WO2025055581A1 (en)Speech encoder training method and apparatus, and device, medium and program product
CN119415684B (en)Multi-document abstract generation system and method based on entity guidance
Li et al.CCMA: CapsNet for audio–video sentiment analysis using cross-modal attention
Zhang et al.AI-powered text generation for harmonious human-machine interaction: current state and future directions
CN118964603A (en) A multimodal summarization method and system based on visual information fusion

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination

[8]ページ先頭

©2009-2025 Movatter.jp