Disclosure of Invention
The invention aims to provide the authenticity identification method for the social media multi-modal information, which aims to fully learn the semantic relativity among different modal characteristics and improve the identification accuracy of the authenticity identification model of the social media multi-modal information.
The aim of the invention can be achieved by the following technical scheme:
A true and false identification method facing to multi-mode information of social media includes the following steps:
S1, constructing a multi-mode feature extraction module;
S2, constructing an image-text pair dataset based on an original dataset, inputting the image-text pair dataset into a multi-mode feature extraction module to obtain visual features and text features of an embedded space, inputting the visual features and the text features into a mixed pooling expert frame, outputting a loss function, and iteratively training the mixed pooling expert frame based on the loss function;
S3, acquiring given semantically aligned visual mode features and text mode features, inputting the semantically aligned visual mode features and the text mode features into a cross-mode semantic fusion module, and combining an activation function and an adjustable factor, and outputting cross-mode correlation fusion features by the cross-mode semantic fusion module;
S4, inputting the cross-modal correlation fusion characteristics into a true-false identification classification network, training the cross-modal semantic fusion module and the true-false identification classification network based on output iteration of the true-false identification classification network to obtain a trained cross-modal semantic fusion module and a true-false identification classifier, then acquiring actual multi-modal data of social media to be identified, inputting the actual multi-modal data of the social media to be identified into a trained mixed pooling expert framework, outputting actual alignment data, inputting the actual alignment data into the trained cross-modal semantic fusion module and the true-false identification classifier, and outputting a social media multi-modal information identification result.
Further, the inputting the image-text pair data set into the multi-mode feature extraction module to obtain the visual feature and the text feature of the embedded space specifically comprises:
the graphics context is matched with the data setInputting a multi-modal feature extraction module, wherein the multi-modal feature extraction module comprises a visual encoder and a text encoder constructed based on ViT, and a graphic-text pair datasetAfter the multi-modal feature extraction module is input, the multi-modal feature extraction module adjusts image data in a data set into a flattened two-dimensional patch sequence, converts the two-dimensional patch sequence into linear embedding, combines the linear embedding and the position embedding of the two-dimensional patch sequence as the input of a visual encoder, and the visual encoder outputs visual features of an embedded space;
the multimodal feature extraction module embeds text data in the dataset and locations of the text data in an input text encoder, the text encoder outputting text features of the embedded space.
Further, the graphic-text pair data set comprises m pairs of positive examples of graphic texts and m (m-1) pairs of negative examples of graphic texts.
Further, the text encoder constructs based on the pre-trained BERT and processes token embedding using linear mapping.
Further, the semantically aligned visual mode features and text mode features are input into a cross-mode semantic fusion module, and an activation function and an adjustable factor are combined, and the cross-mode semantic fusion module outputs cross-mode correlation fusion features by the steps of:
Visual modality features aligned for a given semanticAnd text modality featuresThe cross-modal semantic fusion module calculates attention weights among modalities, wherein the attention weights are visual modal characteristicsAnd text modality featuresThrough a ReLU activation function after the attention weight is added with the adjustability factor, a text attention related score coreT→G and a visual attention related score coreG→T are obtained, and the text attention related score coreT→G and the text modal characteristicsMultiplying the visual attention-related score CorreG→T by the visual modality characteristicsMultiplying and calculating to obtain text modal characteristicsAnd visual modality characteristicsAnd (3) respectively corresponding correlation features Tcorre and Gcorre, and obtaining a cross-modal correlation fusion feature F after the two correlation features are spliced.
Further, the cross-modal correlation fusion feature is input into the true-false identification classification network, and the output iteration training cross-modal semantic fusion module based on the true-false identification classification network and the true-false identification classification network comprises the following specific steps:
the method comprises the steps of obtaining given semanteme aligned visual mode characteristics and text mode characteristics, and cross-mode correlation fusion characteristics, inputting the visual mode characteristics and the text mode characteristics into an authenticity identification classification network together, enabling an attention module in the authenticity identification classification network to give different weights to the input of the authenticity identification classification network, calculating a weight loss function through characteristic distribution generated by a variation automatic encoder, and training the cross-mode semantic fusion module and the authenticity identification classification network based on the loss function iteration training.
Further, the loss function of the feature distribution calculation weight generated by the variation automatic encoder is specifically:
And respectively inputting the given semantically aligned visual mode characteristics and text mode characteristics into a variation automatic encoder, calculating the KL divergence of the two distributions based on the characteristic distribution generated by the variation automatic encoder, and calculating the loss function of the weight based on the KL divergence and the weight given by the attention module.
Further, the given semantically aligned visual modality features and text modality features are output by a trained hybrid pooling expert framework.
Further, the attention module is a SE-ResNet attention module.
Further, the mixed pooling expert framework is composed of a routing gate module, an aggregation expert module and a loss function module.
Compared with the prior art, the invention has the following beneficial effects:
The invention integrates a multi-mode feature extraction module, a cross-mode semantic alignment module, a cross-mode semantic fusion module and an authenticity identification classifier module, fully learns semantic relativity among different mode features, and makes a decision on an authenticity identification result by utilizing the aligned single-mode features and the cross-mode semantic fusion features together, thereby improving the detection effect.
Detailed Description
The invention will now be described in detail with reference to the drawings and specific examples. The present embodiment is implemented on the premise of the technical scheme of the present invention, and a detailed implementation manner and a specific operation process are given, but the protection scope of the present invention is not limited to the following examples.
The invention aims to provide a social media multi-modal information-oriented authenticity identification method, which is based on the field of deep learning and provides a multi-modal information authenticity identification method based on cross-modal semantic relation. The invention integrates a multi-mode feature extraction module, a cross-mode semantic alignment module, a cross-mode semantic fusion module and an authenticity identification classifier module, fully learns semantic relativity among different mode features, and makes a decision on an authenticity identification result by utilizing the aligned single-mode features and the cross-mode semantic fusion features together, thereby improving the detection effect.
The flow chart of the present invention is shown in fig. 3. The method of the invention comprises the following steps:
S1, constructing a multi-mode feature extraction module;
S2, constructing an image-text pair dataset based on an original dataset, inputting the image-text pair dataset into a multi-mode feature extraction module to obtain visual features and text features of an embedded space, inputting the visual features and the text features into a mixed pooling expert frame, outputting a loss function, and iteratively training the mixed pooling expert frame based on the loss function;
S3, acquiring given semantically aligned visual mode features and text mode features, inputting the semantically aligned visual mode features and the text mode features into a cross-mode semantic fusion module, and combining an activation function and an adjustable factor, and outputting cross-mode correlation fusion features by the cross-mode semantic fusion module;
S4, inputting the cross-modal correlation fusion characteristics into a true-false identification classification network, training the cross-modal semantic fusion module and the true-false identification classification network based on output iteration of the true-false identification classification network to obtain a trained cross-modal semantic fusion module and a true-false identification classifier, then acquiring actual multi-modal data of social media to be identified, inputting the actual multi-modal data of the social media to be identified into a trained mixed pooling expert framework, outputting actual alignment data, inputting the actual alignment data into the trained cross-modal semantic fusion module and the true-false identification classifier, and outputting a social media multi-modal information identification result.
In S1, for a given batch of inputs, i.eWherein the method comprises the steps ofRefers to a batchEach sample is a graphic pair. For the i-th sample (ti,gi),A sentence is represented by a word of sentence,Represents a single picture, where (H e 0,255, W e 0, 255) is the resolution of picture gi, c=3 represents the number of channels of gi, and L represents the number of token in sentence ti.
To extract visual features, the present module designs a specialized visual encoder that is constructed based on Vision Transformer (ViT). Consider ViT that the input received is a one-dimensional embedded sequence, but the original image is three-dimensional. Thus readjusting the image to a flattened two-dimensional patch sequenceWhere P refers to the height and width of the patch, the number of patches n=hw/P2. After the two-dimensional patch sequence is converted into linear embedding, the linear embedding is embedded with the two-dimensional positionCombining the input of the visual encoder as a patch-oriented, calculated
In order to extract context-enhanced text features, the present module designs a specialized encoder for text modalities that is constructed based on pre-trained BERT and processes token embedding using linear mapping, and in addition, the input of BERT also includes position embeddingFinally calculate to obtain
In S2, the original data set is firstly basedImage-text pair data set for cross-modal semantic alignment module is constructedWherein the method comprises the steps ofM-pair positive examples are included, and then m (m-1) pair negative examples are correspondingly included. Then inputting the data into a multi-mode semantic feature extraction module for processing to obtain visual features in a D-dimension embedded spaceAnd text featuresThe visual features in patch blocks and the text features in words are then separately aggregated within the visual semantic shared embedding space by a specialized feature aggregation strategy, i.e., a hybrid pooling expert framework. The framework comprises three parts, namely a routing gate module for routing different samples, an aggregation expert module for aggregating segment features of the different samples into an integral vector, and a loss function module for optimizing semantic alignment and aggregation expert load balance, wherein a similarity score matrix between pictures and texts is obtained through calculation of the framework, then difficult negative examples of the picture and text pairs are mined according to the matrix, new picture and text pairs are built, and corresponding picture and text pairs are classified.
In S3, visual modality features aligned for a given semanticAnd text modality featuresThe module firstly calculates the attention weight among the modes based on the relation among the single-mode embedded representations, and in order to strengthen the study on the correlation, an activation function Relu =max { x,0} and an adjustable factor epsilon are introduced to keep the weight score with high correlation, so that the characteristic of low correlation among the modes is discarded. The module frame diagram is shown in fig. 1:
S4, the input of the current module consists of three parts, namely, the aligned visual mode characteristics learned by the cross-mode semantic alignment moduleWith text modality featuresAnd cross-modal correlation fusion features derived from cross-modal semantic fusion modulesAs shown in fig. 2, the authenticity identification classifier module assigns different weights to the features aligned with the single-mode semantics and the fusion features through a special SE-ResNet attention module, and constrains the weights through feature distribution generated by a Variation Automatic Encoder (VAE), so as to finally obtain the features in the input classifier and obtain a detection result. The schematic structure of the authentication classifier module is shown in fig. 2.
The mixed pooling expert framework of the invention specifically comprises:
For input image data and text data, the method fully utilizes a self-attention mechanism based on intra-modal relations through a routing gate module, and routes each sample to a proper pooling expert for feature aggregation. Notably, only the detailed routing process of the visual branch will be described herein, as the routing principle of the text branch is the same as it. First a set of regional features g= { G1,g2,…,gn } is used as input to the routing gate policy, whereOrder theRepresenting query, key, and value vectors, respectively, in the attention mechanism, where n represents the number of regions and d refers to the dimension of the vector. Considering that the dot product between the query and key may contain noise, the present module proposes a door mechanism to filter unwanted information and retain useful information. First, the module needs to calculate the gate mask for the query and key vectors, then the model uses the gate mask to reduce the noise for the query and key vectors, and calculates the value vector attn after adding attention, and then converts the attn feature with added gate attention into the routing gate representation vector Z using a layer of full-join. Finally, each picture G and each text T are routed to the pooling expert of the corresponding modality with the highest probability, based on the normalized distribution over the m pooling experts, respectively, also referred to as the aggregation operator.
In the routing gate module, a gate mechanism is proposed to filter unwanted information and retain useful information, taking into account that the dot product between the query and key may contain noise. First, the module needs a gate mask to calculate the query and key vectors, and the calculation process can be expressed as:
Wherein the method comprises the steps ofIs the result of the element level multiplication of the query and key vectors. Thereafter, gate masks MQ,MK for Q and K are generated from the two full connection layers and the sigmoid activation function, where σ represents the sigmoid operation,
The model then uses the gate mask to denoise the query and key vectors and calculates an attention-added value vector attn, and then converts the gate-attention-added attn feature into a routing gate representation vector Z using a layer of full-join. The calculation process can be expressed as:
Z=attnWattn+battn
where a softmax function is applied to each row,M represents the number of pooling experts.
Finally, each picture G and each text T are routed to the pooling expert of the corresponding modality with the highest probability based on the normalized distribution over m pooling experts, respectively, also referred to as the aggregation operator, wherein the formulation process of calculating the probabilities of G and T being routed to expert i can be expressed as:
The routing probability of the optimal pooling expert for G can be calculated from p (G) =max { pi (G) }. For the word-level text feature t= { T1,t2,…,ts }, the best pooling expert routing probability calculation expression for sentence T is then p (T) =max { pi (T) }.
In order to respectively aggregate the vision and text segment features in the vision semantic shared space into an integral vector for embedding, thereby measuring the similarity of vision and text modes, the method designs a special aggregation expert module. The module consists of m pooling experts, where each expert pooles together the routed pictures and text based on the same model structure. Specifically, the module firstly sorts the single-mode fragment features on feature dimensions to be aggregated, then assigns weights to the sorted features based on a certain mechanism, calculates weighted sums among the features to obtain corresponding overall vectors, and finally multiplies the overall vectors with probability values calculated by the routing gate module to obtain a final aggregate vector result. For a set of inputs having n regional featuresThe selected pooling specialists aggregate them into a fixed length vectorLet maxk (·) represent a function of extracting the first k values from the feature ordered list, the coefficient θk is a corresponding weight of the kth highest value after being ordered according to the feature value size on the dimension to be aggregated, i.e. the region feature dimension of the feature, and 1.ltoreq.k.ltoreq.n, the weight value is a normalized distribution on n regions, and p (G) represents a route probability value of the picture G calculated from the route gate module. The pooling specialist needs to derive the value of θk based on a certain computational mechanism, an encoder-decoder architecture is used herein to achieve this goal. Specifically, the structure comprises two parts, namely a position encoder, a position encoding function based on a trigonometric function, and a position decoder, wherein the position encoder can encode the region position of the feature to obtain a corresponding vectorized representation, and the position decoder can process the vector obtained by the encoder based on a BiGRU sequence model capable of generating a pooling coefficient.
To constrain the process of cross-modal semantic alignment, i.e., to keep a higher similarity score between semantically similar samples and a lower score between semantically unrelated samples, a specialized penalty function module is designed herein that includes two penalty functions. The method comprises the steps of firstly, a bidirectional ternary ordering loss function which can optimize the distance between cross-modal semantic features, and secondly, a load balancing loss function which can balance and optimize the load of samples which are routed by each pooling expert.
The bidirectional ternary ordering loss function is a semantic alignment objective function, the load balancing loss function can realize balanced load of routing experts, and the routing experts are given m experts and batchesThe formula for the load balancing loss function can be expressed as:
Where Si represents the proportion of samples assigned to expert i and Pi represents the proportion of routing probabilities assigned to expert i.
The routing gate module can adaptively assign each sample to an appropriate pooling operator based on differences in intra-modal relationships, while the aggregation expert module combines multiple pooling experts to pooling aggregate segment features of different samples, and additional auxiliary load balancing objective functions can facilitate load balancing among the pooling experts. The framework is realized based on a visual semantic shared space paradigm, is suitable for various picture encoders and text encoders, and can promote cross-modal semantic alignment in different multi-modal related tasks through plug and play.
The documents cited in the present invention are as follows:
[1]Singhal S,Shah R R,Chakraborty T,et al.Spotfake:A multi-modal framework for fake news detection[C]//2019IEEE fifth international conference on multimedia big data(BigMM).IEEE,2019:39-47.
[2]Kenton J D M W C,Toutanova L K.BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding[C]//Proceedings of NAACL-HLT.2019:4171-4186.
[3]Deng J,Dong W,Socher R,et al.Imagenet:A large-scale hierarchical image database[C]//2009IEEE conference on computer vision and pattern recognition.
Ieee,2009:248-255.
[4]Simonyan K,Zisserman A.Very Deep Convolutional Networks for Large-Scale Image Recognition[C]//Proceedings of the International Conference on Learning Representations(ICLR),San Diego,CA,USA,2015.
[5]Singhal S,Kabra A,Sharma M,et al.Spotfake+:A multimodal framework for fake news detection via transfer learning(student abstract)[C]//Proceedings of the AAAI conference on artificial intelligence.2020,34(10):13915-13916.
[6]Yang Z,Dai Z,Yang Y,et al.Xlnet:Generalized autoregressive pretraining for language understanding[J].Advances in neural information processing systems,2019,32.
[7]Singh P,Srivastava R,Rana K P S,et al.SEMI-FND:Stacked ensemble based multimodal inferencing framework for faster fake news detection[J].Expert systems with applications,2023,215:119302.
[8]Clark K,Luong M T,Le Q V,Manning C D.ELECTRA:Pre-training Text Encoders as Discriminators Rather Than Generators[C]//Proceedings of the 8th International Conference on Learning Representations(ICLR),Addis Ababa,
Ethiopia,2020.
[9] Liu Jinshuo, feng Kuo, pan J Z.MSRD method for multimodal network rumor detection [ J ]. Computer research and development
Exhibited by 2020,57 (11): 2328-2336.
[10]Huang G,Liu Z,Van Der Maaten L,et al.Densely connected convolutional networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition.2017:4700-4708.
[11]Zhang G,Giachanou A,Rosso P.SceneFND:Multimodal fake news detection by modelling scene context information[J].Journal of Information Science,2024,
50(2):355-367.
[12]Xue J,Wang Y,Tian Y,et al.Detecting fake news by exploring the consistency of multimodal data[J].Information Processing&Management,2021,58(5):102610.[13]Huang N E,Shen Z,Long S R,et al.The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis[J].
Proceedings of the Royal Society of London.Series A:mathematical,physical and engineering sciences,1998,454(1971):903-995.
[14]Segura-Bedmar I,Alonso-Bartolome S.Multimodal fake news detection[J].
Information,2022,13(6):284.
[15]Xiong S,Zhang G,Batra V,et al.Trimoon:two-round inconsistency-based multi-modal fusion network for fake news detection[J].Information fusion,2023,93:150-158.
[16]Jin Z,Cao J,Guo H,et al.Multimodal fusion with recurrent neural networks for rumor detection on microblogs[C]//Proceedings of the 25th ACM international conference on Multimedia.2017:795-816.
[17]Wu Y,Zhan P,Zhang Y,et al.Multimodal fusion with co-attention networks for fake news detection[C]//Findings of the association for computational linguistics:
ACL-IJCNLP 2021.2021:2560-2569.
[18]Müller-Budack E,Theiner J,Diering S,et al.Multimodal analytics for real-world news using measures of cross-modal entity consistency[C]//Proceedings of the
2020international conference on multimedia retrieval.2020:16-25.
[19]Zhang W,Gui L,He Y.Supervised contrastive learning for multimodal unreliable news detection in COVID-19pandemic[C]//Proceedings of the 30th ACM international conference on information&knowledge management.2021:3637-3641.
[20]Shang L,Kou Z,Zhang Y,et al.A duo-generative approach to explainable multimodal covid-19misinformation detection[C]//Proceedings of the ACM Web Conference 2022.2022:3623-3631.
[21]Zhou X,Wu J,Zafarani R.:Similarity-Aware Multi-modal Fake News Detection[C]//Pacific-Asia Conference on knowledge discovery and data mining.
Cham:Springer International Publishing,2020:354-367.
[22]Khattar D,Goud J S,Gupta M,et al.Mvae:Multimodal variational autoencoder for fake news detection[C]//The world wide web conference.2019:2915-2921.
[23]Wang Y,Ma F,Jin Z,et al.Eann:Event adversarial neural networks for multi-modal fake news detection[C]//Proceedings of the 24th acm sigkdd international conference on knowledge discovery&data mining.2018:849-857.
[24]Wei Z,Pan H,Qiao L,et al.Cross-modal knowledge distillation in multi-modal fake news detection[C]//ICASSP 2022-2022IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2022:4733-4737.
[25]Singhal S,Dhawan M,Shah R R,et al.Inter-modality discordance for multimodal fake news detection[C]//Proceedings of the 3rd ACM International Conference on Multimedia in Asia.2021:1-7.
The foregoing describes in detail preferred embodiments of the present invention. It should be understood that numerous modifications and variations can be made in accordance with the concepts of the invention by one of ordinary skill in the art without undue burden. Therefore, all technical solutions which can be obtained by logic analysis, reasoning or limited experiments based on the prior art by the person skilled in the art according to the inventive concept shall be within the scope of protection defined by the claims.