Movatterモバイル変換


[0]ホーム

URL:


CN119068502A - A method for authenticity identification of multimodal information in social media - Google Patents

A method for authenticity identification of multimodal information in social media
Download PDF

Info

Publication number
CN119068502A
CN119068502ACN202411165269.5ACN202411165269ACN119068502ACN 119068502 ACN119068502 ACN 119068502ACN 202411165269 ACN202411165269 ACN 202411165269ACN 119068502 ACN119068502 ACN 119068502A
Authority
CN
China
Prior art keywords
modal
text
features
cross
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202411165269.5A
Other languages
Chinese (zh)
Inventor
李江峰
王博文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongji University
Original Assignee
Tongji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongji UniversityfiledCriticalTongji University
Priority to CN202411165269.5ApriorityCriticalpatent/CN119068502A/en
Publication of CN119068502ApublicationCriticalpatent/CN119068502A/en
Pendinglegal-statusCriticalCurrent

Links

Classifications

Landscapes

Abstract

The invention relates to a social media multi-modal information-oriented authenticity identification method, which comprises the following steps of S1, constructing a multi-modal feature extraction module, S2, iteratively training a mixed pooling expert framework, S3, outputting cross-modal correlation fusion features by a cross-modal semantic fusion module, S4, inputting the cross-modal correlation fusion features into an authenticity identification classifier, iteratively training the cross-modal semantic fusion module and an authenticity identification classification network based on the output of the authenticity identification classifier to obtain the trained cross-modal semantic fusion module and the trained authenticity identification classifier, and then carrying out actual authenticity identification. Compared with the prior art, the method has the advantages of improving the identification accuracy of the social media multi-mode information true and false identification model and the like.

Description

Authenticity identification method for social media multi-mode information
Technical Field
The invention relates to the field of authentication of multi-mode information, in particular to an authentication method for multi-mode information of social media.
Background
The world is the age of self-media, the easy participation of the social network makes the information on the social media explosively grow, and false and useless information is scattered in the unbearable tide while the true and useful information is transmitted, which definitely causes large-scale negative influence, and the artificial intelligence technology such as deep learning and the like is rapidly developed, so that the research on the true and false identification problem of the social media information based on the related technology is also continuously emerging, and becomes the mainstream direction of the present day. Early research approaches often focused on text content, which is also the primary descriptive form of network media information. As social media information content evolves from a single text to a combination of modalities (typically text and pictures), spurious information detection studies that analyze only the text content have not met the current evolving needs.
The current research method for authentication based on multi-modal information is mostly based on pre-training model construction in feature extraction, singhal et al [1] propose SpotFake framework, wherein BERT [2] is used for extracting text features, image features are obtained by encoding based on VGG-19[4] pre-trained on ImageNet dataset [3], and authentication detection is carried out by combining the two modal features. On this basis, the team [5] then proposes a SpotFake + framework, and the text feature extractor is replaced with a pre-trained XLNet [6], so that the complete article can be detected. Singh et al [7] selects a combination of BERT [8] and ELECTRA [9] to extract text modality features. Some studies consider that only analyzing the visual mode features and the text mode features is insufficient for the authenticity identification detection task, but the three-part features are combined to obtain the classification result. Liu et al [10] extract text content embedded in the image based on DenseNet [11], zhang et al [12] additionally conduct modeling research on context scene information, xue et al [13] additionally extract image tampering features through ELA algorithm [14] for detection. In the aspect of multi-mode feature fusion, the current research mostly uses simple splicing [15] [16] [17] or addition operation [18] to combine different features to obtain a final classification result, and considering that the fusion method is too simple and cannot fully utilize multi-mode information, the single et al [19] fuses the visual mode and text mode features through average probability to obtain decision level, so that the training parameters are greatly reduced, the performance is improved, and the speed is accelerated.
On the other hand, cross-modal semantic correlation has received more and more attention in the study of authentication and identification of multi-modal information [20] [21] [22]. The Zhou et al [23] converts visual information to textual information based on a generative model, defining cross-modal correlation between the two modalities as cosine similarity. Khattar et al [24] designed a special variation to reconstruct visual and textual information from the encoder to quantify cross-modal correlation between text and pictures, which, while having good results, is computationally expensive. Wang et al [25] then designed a multi-task learning framework, spliced multi-modal information and input to an event discriminator to filter out information about events, and only retained the general features, thereby achieving accurate detection. However, the current research lacks in-depth modeling of cross-modal semantic relativity, so that the cross-modal semantic relation cannot be fully learned, and the effect of the true and false identification model is difficult to obtain further breakthrough.
Disclosure of Invention
The invention aims to provide the authenticity identification method for the social media multi-modal information, which aims to fully learn the semantic relativity among different modal characteristics and improve the identification accuracy of the authenticity identification model of the social media multi-modal information.
The aim of the invention can be achieved by the following technical scheme:
A true and false identification method facing to multi-mode information of social media includes the following steps:
S1, constructing a multi-mode feature extraction module;
S2, constructing an image-text pair dataset based on an original dataset, inputting the image-text pair dataset into a multi-mode feature extraction module to obtain visual features and text features of an embedded space, inputting the visual features and the text features into a mixed pooling expert frame, outputting a loss function, and iteratively training the mixed pooling expert frame based on the loss function;
S3, acquiring given semantically aligned visual mode features and text mode features, inputting the semantically aligned visual mode features and the text mode features into a cross-mode semantic fusion module, and combining an activation function and an adjustable factor, and outputting cross-mode correlation fusion features by the cross-mode semantic fusion module;
S4, inputting the cross-modal correlation fusion characteristics into a true-false identification classification network, training the cross-modal semantic fusion module and the true-false identification classification network based on output iteration of the true-false identification classification network to obtain a trained cross-modal semantic fusion module and a true-false identification classifier, then acquiring actual multi-modal data of social media to be identified, inputting the actual multi-modal data of the social media to be identified into a trained mixed pooling expert framework, outputting actual alignment data, inputting the actual alignment data into the trained cross-modal semantic fusion module and the true-false identification classifier, and outputting a social media multi-modal information identification result.
Further, the inputting the image-text pair data set into the multi-mode feature extraction module to obtain the visual feature and the text feature of the embedded space specifically comprises:
the graphics context is matched with the data setInputting a multi-modal feature extraction module, wherein the multi-modal feature extraction module comprises a visual encoder and a text encoder constructed based on ViT, and a graphic-text pair datasetAfter the multi-modal feature extraction module is input, the multi-modal feature extraction module adjusts image data in a data set into a flattened two-dimensional patch sequence, converts the two-dimensional patch sequence into linear embedding, combines the linear embedding and the position embedding of the two-dimensional patch sequence as the input of a visual encoder, and the visual encoder outputs visual features of an embedded space;
the multimodal feature extraction module embeds text data in the dataset and locations of the text data in an input text encoder, the text encoder outputting text features of the embedded space.
Further, the graphic-text pair data set comprises m pairs of positive examples of graphic texts and m (m-1) pairs of negative examples of graphic texts.
Further, the text encoder constructs based on the pre-trained BERT and processes token embedding using linear mapping.
Further, the semantically aligned visual mode features and text mode features are input into a cross-mode semantic fusion module, and an activation function and an adjustable factor are combined, and the cross-mode semantic fusion module outputs cross-mode correlation fusion features by the steps of:
Visual modality features aligned for a given semanticAnd text modality featuresThe cross-modal semantic fusion module calculates attention weights among modalities, wherein the attention weights are visual modal characteristicsAnd text modality featuresThrough a ReLU activation function after the attention weight is added with the adjustability factor, a text attention related score coreT→G and a visual attention related score coreG→T are obtained, and the text attention related score coreT→G and the text modal characteristicsMultiplying the visual attention-related score CorreG→T by the visual modality characteristicsMultiplying and calculating to obtain text modal characteristicsAnd visual modality characteristicsAnd (3) respectively corresponding correlation features Tcorre and Gcorre, and obtaining a cross-modal correlation fusion feature F after the two correlation features are spliced.
Further, the cross-modal correlation fusion feature is input into the true-false identification classification network, and the output iteration training cross-modal semantic fusion module based on the true-false identification classification network and the true-false identification classification network comprises the following specific steps:
the method comprises the steps of obtaining given semanteme aligned visual mode characteristics and text mode characteristics, and cross-mode correlation fusion characteristics, inputting the visual mode characteristics and the text mode characteristics into an authenticity identification classification network together, enabling an attention module in the authenticity identification classification network to give different weights to the input of the authenticity identification classification network, calculating a weight loss function through characteristic distribution generated by a variation automatic encoder, and training the cross-mode semantic fusion module and the authenticity identification classification network based on the loss function iteration training.
Further, the loss function of the feature distribution calculation weight generated by the variation automatic encoder is specifically:
And respectively inputting the given semantically aligned visual mode characteristics and text mode characteristics into a variation automatic encoder, calculating the KL divergence of the two distributions based on the characteristic distribution generated by the variation automatic encoder, and calculating the loss function of the weight based on the KL divergence and the weight given by the attention module.
Further, the given semantically aligned visual modality features and text modality features are output by a trained hybrid pooling expert framework.
Further, the attention module is a SE-ResNet attention module.
Further, the mixed pooling expert framework is composed of a routing gate module, an aggregation expert module and a loss function module.
Compared with the prior art, the invention has the following beneficial effects:
The invention integrates a multi-mode feature extraction module, a cross-mode semantic alignment module, a cross-mode semantic fusion module and an authenticity identification classifier module, fully learns semantic relativity among different mode features, and makes a decision on an authenticity identification result by utilizing the aligned single-mode features and the cross-mode semantic fusion features together, thereby improving the detection effect.
Drawings
FIG. 1 is a cross-modal semantic fusion module framework diagram of the present invention;
FIG. 2 is a schematic diagram of the structure of the authenticity identification classifier module according to the present invention;
fig. 3 is a flow chart of the present invention.
Detailed Description
The invention will now be described in detail with reference to the drawings and specific examples. The present embodiment is implemented on the premise of the technical scheme of the present invention, and a detailed implementation manner and a specific operation process are given, but the protection scope of the present invention is not limited to the following examples.
The invention aims to provide a social media multi-modal information-oriented authenticity identification method, which is based on the field of deep learning and provides a multi-modal information authenticity identification method based on cross-modal semantic relation. The invention integrates a multi-mode feature extraction module, a cross-mode semantic alignment module, a cross-mode semantic fusion module and an authenticity identification classifier module, fully learns semantic relativity among different mode features, and makes a decision on an authenticity identification result by utilizing the aligned single-mode features and the cross-mode semantic fusion features together, thereby improving the detection effect.
The flow chart of the present invention is shown in fig. 3. The method of the invention comprises the following steps:
S1, constructing a multi-mode feature extraction module;
S2, constructing an image-text pair dataset based on an original dataset, inputting the image-text pair dataset into a multi-mode feature extraction module to obtain visual features and text features of an embedded space, inputting the visual features and the text features into a mixed pooling expert frame, outputting a loss function, and iteratively training the mixed pooling expert frame based on the loss function;
S3, acquiring given semantically aligned visual mode features and text mode features, inputting the semantically aligned visual mode features and the text mode features into a cross-mode semantic fusion module, and combining an activation function and an adjustable factor, and outputting cross-mode correlation fusion features by the cross-mode semantic fusion module;
S4, inputting the cross-modal correlation fusion characteristics into a true-false identification classification network, training the cross-modal semantic fusion module and the true-false identification classification network based on output iteration of the true-false identification classification network to obtain a trained cross-modal semantic fusion module and a true-false identification classifier, then acquiring actual multi-modal data of social media to be identified, inputting the actual multi-modal data of the social media to be identified into a trained mixed pooling expert framework, outputting actual alignment data, inputting the actual alignment data into the trained cross-modal semantic fusion module and the true-false identification classifier, and outputting a social media multi-modal information identification result.
In S1, for a given batch of inputs, i.eWherein the method comprises the steps ofRefers to a batchEach sample is a graphic pair. For the i-th sample (ti,gi),A sentence is represented by a word of sentence,Represents a single picture, where (H e 0,255, W e 0, 255) is the resolution of picture gi, c=3 represents the number of channels of gi, and L represents the number of token in sentence ti.
To extract visual features, the present module designs a specialized visual encoder that is constructed based on Vision Transformer (ViT). Consider ViT that the input received is a one-dimensional embedded sequence, but the original image is three-dimensional. Thus readjusting the image to a flattened two-dimensional patch sequenceWhere P refers to the height and width of the patch, the number of patches n=hw/P2. After the two-dimensional patch sequence is converted into linear embedding, the linear embedding is embedded with the two-dimensional positionCombining the input of the visual encoder as a patch-oriented, calculated
In order to extract context-enhanced text features, the present module designs a specialized encoder for text modalities that is constructed based on pre-trained BERT and processes token embedding using linear mapping, and in addition, the input of BERT also includes position embeddingFinally calculate to obtain
In S2, the original data set is firstly basedImage-text pair data set for cross-modal semantic alignment module is constructedWherein the method comprises the steps ofM-pair positive examples are included, and then m (m-1) pair negative examples are correspondingly included. Then inputting the data into a multi-mode semantic feature extraction module for processing to obtain visual features in a D-dimension embedded spaceAnd text featuresThe visual features in patch blocks and the text features in words are then separately aggregated within the visual semantic shared embedding space by a specialized feature aggregation strategy, i.e., a hybrid pooling expert framework. The framework comprises three parts, namely a routing gate module for routing different samples, an aggregation expert module for aggregating segment features of the different samples into an integral vector, and a loss function module for optimizing semantic alignment and aggregation expert load balance, wherein a similarity score matrix between pictures and texts is obtained through calculation of the framework, then difficult negative examples of the picture and text pairs are mined according to the matrix, new picture and text pairs are built, and corresponding picture and text pairs are classified.
In S3, visual modality features aligned for a given semanticAnd text modality featuresThe module firstly calculates the attention weight among the modes based on the relation among the single-mode embedded representations, and in order to strengthen the study on the correlation, an activation function Relu =max { x,0} and an adjustable factor epsilon are introduced to keep the weight score with high correlation, so that the characteristic of low correlation among the modes is discarded. The module frame diagram is shown in fig. 1:
S4, the input of the current module consists of three parts, namely, the aligned visual mode characteristics learned by the cross-mode semantic alignment moduleWith text modality featuresAnd cross-modal correlation fusion features derived from cross-modal semantic fusion modulesAs shown in fig. 2, the authenticity identification classifier module assigns different weights to the features aligned with the single-mode semantics and the fusion features through a special SE-ResNet attention module, and constrains the weights through feature distribution generated by a Variation Automatic Encoder (VAE), so as to finally obtain the features in the input classifier and obtain a detection result. The schematic structure of the authentication classifier module is shown in fig. 2.
The mixed pooling expert framework of the invention specifically comprises:
For input image data and text data, the method fully utilizes a self-attention mechanism based on intra-modal relations through a routing gate module, and routes each sample to a proper pooling expert for feature aggregation. Notably, only the detailed routing process of the visual branch will be described herein, as the routing principle of the text branch is the same as it. First a set of regional features g= { G1,g2,…,gn } is used as input to the routing gate policy, whereOrder theRepresenting query, key, and value vectors, respectively, in the attention mechanism, where n represents the number of regions and d refers to the dimension of the vector. Considering that the dot product between the query and key may contain noise, the present module proposes a door mechanism to filter unwanted information and retain useful information. First, the module needs to calculate the gate mask for the query and key vectors, then the model uses the gate mask to reduce the noise for the query and key vectors, and calculates the value vector attn after adding attention, and then converts the attn feature with added gate attention into the routing gate representation vector Z using a layer of full-join. Finally, each picture G and each text T are routed to the pooling expert of the corresponding modality with the highest probability, based on the normalized distribution over the m pooling experts, respectively, also referred to as the aggregation operator.
In the routing gate module, a gate mechanism is proposed to filter unwanted information and retain useful information, taking into account that the dot product between the query and key may contain noise. First, the module needs a gate mask to calculate the query and key vectors, and the calculation process can be expressed as:
Wherein the method comprises the steps ofIs the result of the element level multiplication of the query and key vectors. Thereafter, gate masks MQ,MK for Q and K are generated from the two full connection layers and the sigmoid activation function, where σ represents the sigmoid operation,
The model then uses the gate mask to denoise the query and key vectors and calculates an attention-added value vector attn, and then converts the gate-attention-added attn feature into a routing gate representation vector Z using a layer of full-join. The calculation process can be expressed as:
Z=attnWattn+battn
where a softmax function is applied to each row,M represents the number of pooling experts.
Finally, each picture G and each text T are routed to the pooling expert of the corresponding modality with the highest probability based on the normalized distribution over m pooling experts, respectively, also referred to as the aggregation operator, wherein the formulation process of calculating the probabilities of G and T being routed to expert i can be expressed as:
The routing probability of the optimal pooling expert for G can be calculated from p (G) =max { pi (G) }. For the word-level text feature t= { T1,t2,…,ts }, the best pooling expert routing probability calculation expression for sentence T is then p (T) =max { pi (T) }.
In order to respectively aggregate the vision and text segment features in the vision semantic shared space into an integral vector for embedding, thereby measuring the similarity of vision and text modes, the method designs a special aggregation expert module. The module consists of m pooling experts, where each expert pooles together the routed pictures and text based on the same model structure. Specifically, the module firstly sorts the single-mode fragment features on feature dimensions to be aggregated, then assigns weights to the sorted features based on a certain mechanism, calculates weighted sums among the features to obtain corresponding overall vectors, and finally multiplies the overall vectors with probability values calculated by the routing gate module to obtain a final aggregate vector result. For a set of inputs having n regional featuresThe selected pooling specialists aggregate them into a fixed length vectorLet maxk (·) represent a function of extracting the first k values from the feature ordered list, the coefficient θk is a corresponding weight of the kth highest value after being ordered according to the feature value size on the dimension to be aggregated, i.e. the region feature dimension of the feature, and 1.ltoreq.k.ltoreq.n, the weight value is a normalized distribution on n regions, and p (G) represents a route probability value of the picture G calculated from the route gate module. The pooling specialist needs to derive the value of θk based on a certain computational mechanism, an encoder-decoder architecture is used herein to achieve this goal. Specifically, the structure comprises two parts, namely a position encoder, a position encoding function based on a trigonometric function, and a position decoder, wherein the position encoder can encode the region position of the feature to obtain a corresponding vectorized representation, and the position decoder can process the vector obtained by the encoder based on a BiGRU sequence model capable of generating a pooling coefficient.
To constrain the process of cross-modal semantic alignment, i.e., to keep a higher similarity score between semantically similar samples and a lower score between semantically unrelated samples, a specialized penalty function module is designed herein that includes two penalty functions. The method comprises the steps of firstly, a bidirectional ternary ordering loss function which can optimize the distance between cross-modal semantic features, and secondly, a load balancing loss function which can balance and optimize the load of samples which are routed by each pooling expert.
The bidirectional ternary ordering loss function is a semantic alignment objective function, the load balancing loss function can realize balanced load of routing experts, and the routing experts are given m experts and batchesThe formula for the load balancing loss function can be expressed as:
Where Si represents the proportion of samples assigned to expert i and Pi represents the proportion of routing probabilities assigned to expert i.
The routing gate module can adaptively assign each sample to an appropriate pooling operator based on differences in intra-modal relationships, while the aggregation expert module combines multiple pooling experts to pooling aggregate segment features of different samples, and additional auxiliary load balancing objective functions can facilitate load balancing among the pooling experts. The framework is realized based on a visual semantic shared space paradigm, is suitable for various picture encoders and text encoders, and can promote cross-modal semantic alignment in different multi-modal related tasks through plug and play.
The documents cited in the present invention are as follows:
[1]Singhal S,Shah R R,Chakraborty T,et al.Spotfake:A multi-modal framework for fake news detection[C]//2019IEEE fifth international conference on multimedia big data(BigMM).IEEE,2019:39-47.
[2]Kenton J D M W C,Toutanova L K.BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding[C]//Proceedings of NAACL-HLT.2019:4171-4186.
[3]Deng J,Dong W,Socher R,et al.Imagenet:A large-scale hierarchical image database[C]//2009IEEE conference on computer vision and pattern recognition.
Ieee,2009:248-255.
[4]Simonyan K,Zisserman A.Very Deep Convolutional Networks for Large-Scale Image Recognition[C]//Proceedings of the International Conference on Learning Representations(ICLR),San Diego,CA,USA,2015.
[5]Singhal S,Kabra A,Sharma M,et al.Spotfake+:A multimodal framework for fake news detection via transfer learning(student abstract)[C]//Proceedings of the AAAI conference on artificial intelligence.2020,34(10):13915-13916.
[6]Yang Z,Dai Z,Yang Y,et al.Xlnet:Generalized autoregressive pretraining for language understanding[J].Advances in neural information processing systems,2019,32.
[7]Singh P,Srivastava R,Rana K P S,et al.SEMI-FND:Stacked ensemble based multimodal inferencing framework for faster fake news detection[J].Expert systems with applications,2023,215:119302.
[8]Clark K,Luong M T,Le Q V,Manning C D.ELECTRA:Pre-training Text Encoders as Discriminators Rather Than Generators[C]//Proceedings of the 8th International Conference on Learning Representations(ICLR),Addis Ababa,
Ethiopia,2020.
[9] Liu Jinshuo, feng Kuo, pan J Z.MSRD method for multimodal network rumor detection [ J ]. Computer research and development
Exhibited by 2020,57 (11): 2328-2336.
[10]Huang G,Liu Z,Van Der Maaten L,et al.Densely connected convolutional networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition.2017:4700-4708.
[11]Zhang G,Giachanou A,Rosso P.SceneFND:Multimodal fake news detection by modelling scene context information[J].Journal of Information Science,2024,
50(2):355-367.
[12]Xue J,Wang Y,Tian Y,et al.Detecting fake news by exploring the consistency of multimodal data[J].Information Processing&Management,2021,58(5):102610.[13]Huang N E,Shen Z,Long S R,et al.The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis[J].
Proceedings of the Royal Society of London.Series A:mathematical,physical and engineering sciences,1998,454(1971):903-995.
[14]Segura-Bedmar I,Alonso-Bartolome S.Multimodal fake news detection[J].
Information,2022,13(6):284.
[15]Xiong S,Zhang G,Batra V,et al.Trimoon:two-round inconsistency-based multi-modal fusion network for fake news detection[J].Information fusion,2023,93:150-158.
[16]Jin Z,Cao J,Guo H,et al.Multimodal fusion with recurrent neural networks for rumor detection on microblogs[C]//Proceedings of the 25th ACM international conference on Multimedia.2017:795-816.
[17]Wu Y,Zhan P,Zhang Y,et al.Multimodal fusion with co-attention networks for fake news detection[C]//Findings of the association for computational linguistics:
ACL-IJCNLP 2021.2021:2560-2569.
[18]Müller-Budack E,Theiner J,Diering S,et al.Multimodal analytics for real-world news using measures of cross-modal entity consistency[C]//Proceedings of the
2020international conference on multimedia retrieval.2020:16-25.
[19]Zhang W,Gui L,He Y.Supervised contrastive learning for multimodal unreliable news detection in COVID-19pandemic[C]//Proceedings of the 30th ACM international conference on information&knowledge management.2021:3637-3641.
[20]Shang L,Kou Z,Zhang Y,et al.A duo-generative approach to explainable multimodal covid-19misinformation detection[C]//Proceedings of the ACM Web Conference 2022.2022:3623-3631.
[21]Zhou X,Wu J,Zafarani R.:Similarity-Aware Multi-modal Fake News Detection[C]//Pacific-Asia Conference on knowledge discovery and data mining.
Cham:Springer International Publishing,2020:354-367.
[22]Khattar D,Goud J S,Gupta M,et al.Mvae:Multimodal variational autoencoder for fake news detection[C]//The world wide web conference.2019:2915-2921.
[23]Wang Y,Ma F,Jin Z,et al.Eann:Event adversarial neural networks for multi-modal fake news detection[C]//Proceedings of the 24th acm sigkdd international conference on knowledge discovery&data mining.2018:849-857.
[24]Wei Z,Pan H,Qiao L,et al.Cross-modal knowledge distillation in multi-modal fake news detection[C]//ICASSP 2022-2022IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2022:4733-4737.
[25]Singhal S,Dhawan M,Shah R R,et al.Inter-modality discordance for multimodal fake news detection[C]//Proceedings of the 3rd ACM International Conference on Multimedia in Asia.2021:1-7.
The foregoing describes in detail preferred embodiments of the present invention. It should be understood that numerous modifications and variations can be made in accordance with the concepts of the invention by one of ordinary skill in the art without undue burden. Therefore, all technical solutions which can be obtained by logic analysis, reasoning or limited experiments based on the prior art by the person skilled in the art according to the inventive concept shall be within the scope of protection defined by the claims.

Claims (10)

S4, inputting the cross-modal correlation fusion characteristics into a true-false identification classification network, training the cross-modal semantic fusion module and the true-false identification classification network based on output iteration of the true-false identification classification network to obtain a trained cross-modal semantic fusion module and a true-false identification classifier, then acquiring actual multi-modal data of social media to be identified, inputting the actual multi-modal data of the social media to be identified into a trained mixed pooling expert framework, outputting actual alignment data, inputting the actual alignment data into the trained cross-modal semantic fusion module and the true-false identification classifier, and outputting a social media multi-modal information identification result.
Visual modality features aligned for a given semanticAnd text modality featuresThe cross-modal semantic fusion module calculates attention weights among modalities, wherein the attention weights are visual modal characteristicsAnd text modality featuresThrough a ReLU activation function after the attention weight is added with the adjustability factor, a text attention related score coreT→G and a visual attention related score coreG→T are obtained, and the text attention related score coreT→G and the text modal characteristicsMultiplying the visual attention-related score CorreG→T by the visual modality characteristicsMultiplying and calculating to obtain text modal characteristicsAnd visual modality characteristicsAnd (3) respectively corresponding correlation features Tcorre and Gcorre, and obtaining a cross-modal correlation fusion feature F after the two correlation features are spliced.
CN202411165269.5A2024-08-232024-08-23 A method for authenticity identification of multimodal information in social mediaPendingCN119068502A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202411165269.5ACN119068502A (en)2024-08-232024-08-23 A method for authenticity identification of multimodal information in social media

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202411165269.5ACN119068502A (en)2024-08-232024-08-23 A method for authenticity identification of multimodal information in social media

Publications (1)

Publication NumberPublication Date
CN119068502Atrue CN119068502A (en)2024-12-03

Family

ID=93642269

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202411165269.5APendingCN119068502A (en)2024-08-232024-08-23 A method for authenticity identification of multimodal information in social media

Country Status (1)

CountryLink
CN (1)CN119068502A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN119691434A (en)*2025-02-252025-03-25北京智慧易科技有限公司 A data authenticity identification method, system, device and storage medium
CN120431529A (en)*2025-07-092025-08-05东北大学Multi-agent deep forgery attack detection system for multi-mode data

Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN113822224A (en)*2021-10-122021-12-21中国人民解放军国防科技大学 Rumor detection method and device integrating multi-modal learning and multi-granularity structure learning
CN114386534A (en)*2022-01-292022-04-22安徽农业大学 An image augmentation model training method and image classification method based on variational autoencoder and adversarial generative network
CN118039056A (en)*2023-12-132024-05-14中国科学技术大学苏州高等研究院Pre-training method, system and application of context-aware medical visual language model
CN118114188A (en)*2024-04-302024-05-31江西师范大学False news detection method based on multi-view and layered fusion

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN113822224A (en)*2021-10-122021-12-21中国人民解放军国防科技大学 Rumor detection method and device integrating multi-modal learning and multi-granularity structure learning
CN114386534A (en)*2022-01-292022-04-22安徽农业大学 An image augmentation model training method and image classification method based on variational autoencoder and adversarial generative network
CN118039056A (en)*2023-12-132024-05-14中国科学技术大学苏州高等研究院Pre-training method, system and application of context-aware medical visual language model
CN118114188A (en)*2024-04-302024-05-31江西师范大学False news detection method based on multi-view and layered fusion

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
JIANGFENG LI等: "MoPE: Mixture of Pooling Experts Framework for Image-Text Retrieval", 《MULTIMEDIA MODELING》, 2 February 2024 (2024-02-02), pages 396 - 409*

Cited By (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN119691434A (en)*2025-02-252025-03-25北京智慧易科技有限公司 A data authenticity identification method, system, device and storage medium
CN120431529A (en)*2025-07-092025-08-05东北大学Multi-agent deep forgery attack detection system for multi-mode data
CN120431529B (en)*2025-07-092025-09-12东北大学 Multi-agent deepfake attack detection system for multimodal data

Similar Documents

PublicationPublication DateTitle
CN119068502A (en) A method for authenticity identification of multimodal information in social media
CN112256866B (en)Text fine-grained emotion analysis algorithm based on deep learning
CN119862861B (en)Visual-text collaborative abstract generation method and system based on multi-modal learning
CN117312577B (en) Traffic incident knowledge graph construction method based on multi-layer semantic graph convolutional neural network
CN113705238A (en)Method and model for analyzing aspect level emotion based on BERT and aspect feature positioning model
Al-Tameemi et al.Multi-model fusion framework using deep learning for visual-textual sentiment classification
CN119380144B (en)Multi-mode large model training data acquisition method and system
CN118153016B (en)Authentication system based on artificial intelligence
Sun et al.Image steganalysis based on convolutional neural network and feature selection
CN114220145A (en)Face detection model generation method and device and fake face detection method and device
Lu et al.Self‐supervised domain adaptation for cross‐domain fault diagnosis
Chen et al.CNFRD: A Few‐Shot Rumor Detection Framework via Capsule Network for COVID‐19
CN119538005A (en) A sentiment classification algorithm combining dual attention mechanism and Bi-LSTM
CN118690273A (en) Enhanced Heterogeneous Graph Attention Network for Fake News Detection
ZhuA graph neural network-enhanced knowledge graph framework for intelligent analysis of policing cases
CN116881792A (en) Power quality signal identification method based on MEEMD-CNN-BiLSTM-ATT hybrid model
Ermatita et al.Sentiment Analysis of COVID-19 using Multimodal Fusion Neural Networks.
CN118940164B (en) A method for detecting public safety emergencies based on big data
Huang et al.Application of Fashion Element Trend Prediction Model Integrating AM and EfficientNet-b7 Models in Art Design
CN119067124A (en) A cross-modal semantic alignment method for authenticity identification of social media information
CN119129607B (en)Multi-mode aspect-level emotion analysis method and system
CN119962535B (en)Attention mechanism-based optimal feature selection multi-modal named entity recognition method
Liang et al.AMEMD-FSL: fuse attention mechanism and earth mover’s distance metric network to deep learning for few-shot image recognition
CN120045694A (en)Content management method and system based on big data
Sun et al.Multi-Modal Fake News Detection Aided by Multi-Viewpoint Representation from a Multi-Modal Large Language Model

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination

[8]ページ先頭

©2009-2025 Movatter.jp