CN118839293A

Movatterモバイル変換

Info

Publication number: CN118839293A
Application number: CN202410637659.1A
Authority: CN
Inventors: 张随雨; 俞定国; 王娇娇; 汪哲; 丁可
Original assignee: Zhejiang University of Media and Communications
Current assignee: Zhejiang University of Media and Communications
Priority date: 2024-05-22
Filing date: 2024-05-22
Publication date: 2024-10-25

Abstract

The invention discloses a social public opinion key feature extraction method based on multi-mode information fusion, which uses web crawlers and database technology to collect and store social media data, uses universal character recognition technology, universal audio recognition technology and universal information extraction technology to extract features of multi-source information on the Internet, finally uses a related algorithm to carry out information fusion on multi-dimensional data results and judge true and false public opinion information, and finally uses the universal information extraction technology to extract key features of the public opinion information to obtain comprehensive and accurate social public opinion analysis results.

Description

Translated fromChinese

一种基于多模态信息融合的社交舆情关键特征提取方法A method for extracting key features of social public opinion based on multimodal information fusion

技术领域Technical Field

本发明涉及一种社交舆情关键特征提取方法，尤其涉及一种基于多模态信息融合的社交舆情关键特征提取方法。The present invention relates to a method for extracting key features of social public opinion, and in particular to a method for extracting key features of social public opinion based on multimodal information fusion.

背景技术Background Art

近年来，随着抖音、小红书、Bilibili等多种社交媒体的普及，人们越来越依赖社交媒体来获取信息和表达观点。当前社交媒体平台上的舆情信息具有庞大的规模和多样的形式，如文本、语音、图片和视频等多模态数据，这些数据中蕴含着丰富的舆情信息，为用户提供最新鲜直观的新闻报道，可帮助人们快速的了解新闻事件的概况和影响，也可以帮助政府、企业和研究机构了解公众的观点、情绪和行为，从而指导决策、改善产品或服务、进行市场分析等。In recent years, with the popularity of various social media such as Douyin, Xiaohongshu, and Bilibili, people are increasingly relying on social media to obtain information and express their opinions. The public opinion information on current social media platforms is of huge scale and diverse forms, such as multimodal data such as text, voice, pictures, and videos. These data contain rich public opinion information, providing users with the latest and most intuitive news reports, helping people quickly understand the overview and impact of news events, and also helping governments, enterprises, and research institutions understand the public's views, emotions, and behaviors, thereby guiding decision-making, improving products or services, and conducting market analysis.

在现代社会，社会舆情的作用和力量越来越大，成为影响人们思想和行为的重要因素。然而，互联网的发展使得社会舆情的测度和分析也日益复杂化，舆情信息缺乏有序的知识组织，其中大量有价值的、面向决策支持的信息往往会被忽略，也存在某些虚假的舆情信息会影响社会舆情事件发展的方向。In modern society, the role and power of public opinion are growing, becoming an important factor affecting people's thoughts and behaviors. However, the development of the Internet has made the measurement and analysis of public opinion increasingly complicated. Public opinion information lacks orderly knowledge organization, and a large amount of valuable information for decision support is often ignored. There is also some false public opinion information that will affect the direction of the development of public opinion events.

而社交媒体上的信息极其丰富和多样化，难以从海量的文字、图片音频和视频中获取有用的信息。因此，需要一种高效的方法来自动提取社交媒体上的关键特征，以便进行更加精确的舆情分析和建模，基于多模态信息融合的舆情关键特征抽取方法，可以整合不同来源的信息，包括文本、图像、视频等多模态数据，并通过多种深度学习技术将这些数据转化为有用的高阶知识，这种方法可以为舆情网络建模、舆情分析等提供更可靠的材料支持。However, the information on social media is extremely rich and diverse, and it is difficult to obtain useful information from the massive amount of text, images, audio and video. Therefore, an efficient method is needed to automatically extract key features on social media in order to conduct more accurate public opinion analysis and modeling. The public opinion key feature extraction method based on multimodal information fusion can integrate information from different sources, including multimodal data such as text, images, and videos, and transform these data into useful high-level knowledge through a variety of deep learning techniques. This method can provide more reliable material support for public opinion network modeling, public opinion analysis, etc.

现有的多模态信息提取技术方案主要分为传统多模态数据融合方法和基于神经网络的多模态数据融合方法。Existing multimodal information extraction technology solutions are mainly divided into traditional multimodal data fusion methods and multimodal data fusion methods based on neural networks.

传统多模态融合方法是指通过人工手段将多模态信息进行整合，构建新的特征表示，由于不同模态信息的采样率和特征维度可能不同，需要进行预处理和同步，例如存在同一时间内音频和视频的帧数不匹配等问题，传统融合算法主要基于人工融合和传统的机器学习算法来处理多模态数据。The traditional multimodal fusion method refers to integrating multimodal information through manual means to construct a new feature representation. Since the sampling rates and feature dimensions of different modal information may be different, preprocessing and synchronization are required. For example, there may be problems such as mismatch in the number of audio and video frames at the same time. Traditional fusion algorithms are mainly based on manual fusion and traditional machine learning algorithms to process multimodal data.

随着深度学习的发展，基于神经网络的方法逐渐成为多模态数据融合的主流方法，深度神经网络能够自动从底层特征中提取高层语义表示，因此成为多模态数据处理的有效工具，在基于神经网络的方法中，常见的做法是将不同模态的数据输入到不同的神经网络分支中，然后将各个分支的特征进行融合，以得到更丰富的特征表示。With the development of deep learning, neural network-based methods have gradually become the mainstream method for multimodal data fusion. Deep neural networks can automatically extract high-level semantic representations from underlying features, and thus become an effective tool for multimodal data processing. In neural network-based methods, a common practice is to input data of different modalities into different neural network branches, and then fuse the features of each branch to obtain a richer feature representation.

当前的一些技术虽然取得了一些成果，但是仍然存在一些缺陷，如Although some current technologies have achieved some results, they still have some defects, such as

1、基于人工规则来提取信息需要大量的人力投入，并且对于某些领域类的知识获取构成了瓶颈；1. Extracting information based on manual rules requires a lot of manpower and constitutes a bottleneck for knowledge acquisition in certain fields;

2、基于神经网络的方法需要大量的样本进行训练才能较好地实现指定信息的提取，在缺乏样本的情况下，模型的性能可能无法达到理想水平；2. The neural network-based method requires a large number of samples for training to better extract the specified information. In the absence of samples, the performance of the model may not reach the ideal level;

3、当前信息提取方法仅针对图片、文本等单一源信息，缺少对媒体数据中多模态信息的提取以及关联融合处理，这导致无法快速全面地关注舆情事件中的所有人、事、物；3. Current information extraction methods only target single-source information such as images and texts, and lack the extraction and correlation fusion processing of multimodal information in media data, which makes it impossible to quickly and comprehensively pay attention to all people, things, and objects in public opinion events;

4、随着互联网的发展，越来越多新类型的数据涌入，如各地方言、各种网络文字等，传统单一的信息提取不能适应并解决新出现的问题；4. With the development of the Internet, more and more new types of data are pouring in, such as local dialects, various online texts, etc. Traditional single information extraction cannot adapt to and solve new problems;

5、互联网的媒体资源呈现多样化的特点，现有的特征提取模型在特定领域中可能表现良好，但在其他领域中效果有限。5. The media resources on the Internet are diverse. Existing feature extraction models may perform well in specific fields, but have limited effects in other fields.

发明内容Summary of the invention

为了解决上述问题，本发明提供了一种基于多模态信息融合的社交舆情关键特征提取方法，该方法使用网络爬虫以及数据库技术对社交媒体数据进行采集与存储，并用通用文字识别技术、通用音频识别技术以及通用信息抽取技术对互联网上的多源信息进行特征提取，最后利用相关算法对多维度数据结果进行信息融合并对舆情信息真假做出判断，最后使用通用信息抽取技术抽取舆情信息的关键特征，获取全面准确的社交舆情分析结果。In order to solve the above problems, the present invention provides a method for extracting key features of social public opinion based on multimodal information fusion. The method uses web crawlers and database technology to collect and store social media data, and uses general text recognition technology, general audio recognition technology and general information extraction technology to extract features of multi-source information on the Internet. Finally, relevant algorithms are used to fuse multi-dimensional data results and judge the truth of public opinion information. Finally, general information extraction technology is used to extract key features of public opinion information to obtain comprehensive and accurate social public opinion analysis results.

本发明的目的是通过以下技术方案来实现：The purpose of the present invention is to be achieved through the following technical solutions:

一种基于多模态信息融合的社交舆情关键特征提取方法，包括以下步骤：A method for extracting key features of social public opinion based on multimodal information fusion, comprising the following steps:

一种基于多模态信息融合的社交舆情关键特征提取方法，其特征在于，包括以下步骤：A method for extracting key features of social public opinion based on multimodal information fusion, characterized in that it includes the following steps:

第1步：采用网络爬虫技术对舆情信息作初步采集，依据初步采集的内容对社交舆情做数据采集，获取文本、图像、视频、音频；Step 1: Use web crawler technology to collect public opinion information preliminarily, collect data on social public opinion based on the preliminarily collected content, and obtain text, images, videos, and audio;

第2步：对采集得到的图像，先进行相似度检测，删除冗余数据，其次使用图像描述模型提取每张图像中的中文描述信息以及对图像进行文字识别，获取图像的描述信息以及文字识别信息；Step 2: For the acquired images, first perform similarity detection and delete redundant data. Then use the image description model to extract the Chinese description information in each image and perform text recognition on the image to obtain the image description information and text recognition information.

第3步：使用通用语音识别技术，将采集得到的音频转化为文本信息；Step 3: Use general speech recognition technology to convert the collected audio into text information;

第4步：对于采集得到的视频，从视频提取音频信息，从音频信息中提取音频的文本信息，并截取视频的图像，获取图像的文本信息；Step 4: For the captured video, extract audio information from the video, extract audio text information from the audio information, and capture the image of the video to obtain the text information of the image;

第5步：使用BERT预训练模型作为文本模态的特征提取器，对第1步获得的文本、第2步获得的图像的描述信息以及文字识别信息、第3步获得的文本信息、第4步获得的图像的文本信息进行多模态融合，融合后获取文本、图像、音频、视频的文本语义特征表示，并对文本语义特征进行全局特征提取，得到文本、图像、音频、视频的特征融合向量；Step 5: Use the BERT pre-trained model as a feature extractor for text modality, perform multimodal fusion on the text obtained in step 1, the description information and character recognition information of the image obtained in step 2, the text information obtained in step 3, and the text information of the image obtained in step 4. After fusion, obtain the text semantic feature representation of text, image, audio, and video, and perform global feature extraction on the text semantic features to obtain the feature fusion vector of text, image, audio, and video;

第6步：使用softmax分类，将文本、图像、音频、视频的特征融合向量映射到真实与虚假两类目标空间中，对社交舆情进行真假判断，获取舆情真假的概率和预测结果；Step 6: Use softmax classification to map the feature fusion vectors of text, image, audio, and video into two target spaces: true and false. This allows you to judge the truth or falsehood of social public opinion and obtain the probability and prediction results of the truth or falsehood of public opinion.

第7步：基于第5步获取的文本、图像、音频、视频的特征融合向量，使用UIE模型进行信息抽取，结合第6步中的预测结果，获取多模态舆情数据中的关键信息。Step 7: Based on the feature fusion vectors of text, images, audio, and video obtained in step 5, use the UIE model to extract information, and combine it with the prediction results in step 6 to obtain key information in the multimodal public opinion data.

第1步中，初步采集的内容包括：标题、发布时间、点赞量、评论量、转发量、浏览量，具体地，In step 1, the initial collected content includes: title, release time, number of likes, number of comments, number of reposts, and number of views. Specifically,

针对目前主流使用的几种社交媒体抖音、小红书、Bilibli、微博等，采用网络爬虫技术对舆情信息作初步采集，初步采集的内容包括：标题、发布时间(t₂)、点赞量(c₁)、评论量(c₂)、转发量(c₃)、浏览量(c₄)。Aiming at several mainstream social media platforms, such as Tik Tok, Xiaohongshu, Bilibli, and Weibo, web crawler technology is used to collect public opinion information preliminarily. The preliminarily collected contents include: title, release time (t₂ ), number of likes (c₁ ), number of comments (c₂ ), number of reposts (c₃ ), and number of views (c₄ ).

1.2)热度值(H)按如下公式计算，对热度值进行排序，对热度值高的舆情内容作进一步的采集存储和分析，计算公式如式(1)所示，其中t₁表示当前时间。1.2) The heat value (H) is calculated according to the following formula. The heat values are sorted and the public opinion content with high heat values is further collected, stored and analyzed. The calculation formula is shown in formula (1), where_t1 represents the current time.

H＝(w₁×c₁+w₂×c₂+w₃×c₃+w₄×c₄)/(t₁-t₂) (1)H＝(w₁ ×c₁ +w₂ ×c₂ +w₃ ×c₃ +w₄ ×c₄ )/(t₁ -t₂ ) (1)

第2步中，对采集得到的图像，先进行相似度检测，删除冗余数据，具体包括：In step 2, the acquired images are first tested for similarity and redundant data is deleted, including:

2.1)对图像进行DCT变换；2.1) Perform DCT transformation on the image;

步骤2.1)中，DCT变换公式如式(2)所示，In step 2.1), the DCT transformation formula is shown in formula (2):

其中：x、y为空间采样值；u、v为频率采样值；c(u)、c(v)为变换系数，f(x,y)为原始二维信号；M、N为图像矩阵的行、列值；F(u,v)即为在DCT变换中表示经过变换后的图像的频域系数；Where: x, y are spatial sampling values; u, v are frequency sampling values; c(u), c(v) are transform coefficients, f(x, y) is the original two-dimensional signal; M, N are the row and column values of the image matrix; F(u, v) is the frequency domain coefficient representing the transformed image in the DCT transform;

2.2)对DCT变换后的图像信息取矩阵可呈现图片的信息，计算矩阵中所有元素的均值，依次将矩阵中的元素与均值进行比对，根据比对结果赋1或0，从而生成图像数据的哈希值；2.2) Take the matrix of the image information after DCT transformation to present the information of the picture, calculate the mean of all elements in the matrix, compare the elements in the matrix with the mean in turn, assign 1 or 0 according to the comparison result, and thus generate the hash value of the image data;

2.3)使用k-means算法将步骤2.2)所有的图像数据的哈希值聚成k簇；2.3) Use the k-means algorithm to cluster the hash values of all image data in step 2.2) into k clusters;

2.4)使用汉明距离算法计算k簇中同一簇中两两图像的距离，计算公式如式(3)所示，2.4) Use the Hamming distance algorithm to calculate the distance between two images in the same cluster in the k clusters. The calculation formula is shown in formula (3):

其中，x_n、y_n表示两个N位的字符串，为异或运算；Among them, x_n and y_n represent two N-bit strings. is an XOR operation;

2.5)步骤2.4)中计算的汉明距离即为图像的相似度分数，根据图片实际情况设置阈值，对相似度分数超过该阈值的两图像，删除重复的图像，最终保存的图像记作I_i。2.5) The Hamming distance calculated in step 2.4) is the similarity score of the images. A threshold is set according to the actual situation of the images. For two images whose similarity scores exceed the threshold, the duplicate images are deleted and the final saved image is recorded as I_i .

第2步中，使用图像描述模型提取每张图像中的中文描述信息以及对图像进行文字识别，获取图像的描述信息以及文字识别信息，具体包括：In step 2, the image description model is used to extract the Chinese description information in each image and perform text recognition on the image to obtain the image description information and text recognition information, including:

2.5)基于VIT-GPT2的图像描述预训练模型提取每张图像中的中文描述信息，并对描述内容拼接，公式为：2.5) Based on the VIT-GPT2 image description pre-training model, the Chinese description information in each image is extracted and the description content is spliced. The formula is:

D_gpt＝concat(GPT(I₁),...GPT(I_i)) (4)D_gpt =concat(GPT(I₁ ),...GPT(I_i )) (4)

其中，GPT(I_i)表示使用对第i张图像I_i的描述信息，D_gpt表示对i张图像I_i描述信息的拼接信息，作为图像的描述信息；Wherein, GPT(I_i ) indicates the use of the description information of the i-th image I_i , and D_gpt indicates the concatenation information of the description information of the i-th image_{I i} as the description information of the image;

2.6)图像文字识别包含文字检测和识别两个部分，文字检测采用基于分割的DB文本检测算法，采用透视变换算法，将倾斜的候选框，矫正为水平的矩形框，根据文字位置将图像分割成文字片段；将裁剪后保留的图像送入识别网络中识别文本，对每张图像文字识别的结果做拼接，记作2.6) Image text recognition includes two parts: text detection and recognition. Text detection uses the segmentation-based DB text detection algorithm and the perspective transformation algorithm to correct the tilted candidate box into a horizontal rectangular box and segment the image into text segments according to the text position. The cropped image is sent to the recognition network to recognize the text, and the results of text recognition in each image are spliced and recorded as

D_ocr＝concat(OCR(I₁),...OCR(I_i)) (5)D_ocr =concat(OCR(I₁ ),...OCR(I_i )) (5)

其中，OCR(I₁)表示使用对第i张图像的文字识别信息，D_ocr表示对i张图像I_i文字识别信息的拼接信息，作为文字识别信息。Here, OCR(I₁ ) indicates that the text recognition information of the i-th image is used, and_Docr indicates the concatenation information of the text recognition information of the i-th image I_i as the text recognition information.

第3步中，使用通用语音识别技术，将采集得到的音频转化为文本信息，具体包括：In step 3, general speech recognition technology is used to convert the collected audio into text information, including:

使用讯飞语音转写接口将采集得到的音频信息转化为文本信息，记作Use the iFlytek speech transcription interface to convert the collected audio information into text information, recorded as

D_audio＝audio_to_text(A) (6)D_audio =audio_to_text(A) (6)

其中，A表示采集的音频，D_audio表示音频信息转化的文本信息。Among them, A represents the collected audio, and D_audio represents the text information converted from the audio information.

第4步中，对于采集得到的视频，从视频提取音频信息，从音频信息中提取音频的文本信息，并截取视频的图像，获取图像的文本信息，具体包括；In step 4, for the captured video, extracting audio information from the video, extracting audio text information from the audio information, and intercepting an image of the video to obtain text information of the image, specifically including:

4.1)提取视频中的音频信息转化为wav或者mp3格式，并根据语音识别方法将音频数据转换为文本信息，视频中音频转换后的文本信息记作D_vaudio；4.1) extracting audio information from the video and converting it into wav or mp3 format, and converting the audio data into text information according to the speech recognition method. The text information after the audio in the video is converted is recorded as D_vaudio ;

4.2)使用Python中的FFmpeg工具对采集得到的视频按间隔截取图像，截取的图像删除重复相似图像，得到去重后的图像；4.2) Using the FFmpeg tool in Python to capture images of the collected video at intervals, delete duplicate similar images from the captured images, and obtain deduplicated images;

4.3)使用VIT-GPT2的图像描述预训练模型对去重后的图像进行描述，输出图像的场景描述文本，输出的图像的场景描述文本记作D_vgpt；4.3) Use the VIT-GPT2 image description pre-trained model to describe the deduplicated image and output the scene description text of the image. The output scene description text of the image is recorded as D_vgpt ;

4.4)使用通用文字识别方法对去重后的图像进行文字识别，文字识别后的文本信息记作D_vocr；4.4) Using a general text recognition method to perform text recognition on the deduplicated image, the text information after text recognition is recorded as D_vocr ;

4.5)D_vgpt、D_vocr、D_vaudio形成图像的文本信息。4.5) D_vgpt , D_vocr , and D_vaudio form the text information of the image.

第5步中，使用BERT预训练模型作为文本模态的特征提取器，对第1步获得的文本、第2步获得的图像的描述信息以及文字识别信息、第3步获得的文本信息、第4步获得的图像的文本信息进行多模态融合，融合后获取文本、图像、音频、视频的文本语义特征表示，并对文本语义特征进行全局特征提取，得到文本、图像、音频、视频的特征融合向量，具体包括：In step 5, the BERT pre-trained model is used as a feature extractor for text modality. The text obtained in step 1, the description information and character recognition information of the image obtained in step 2, the text information obtained in step 3, and the text information of the image obtained in step 4 are multimodally fused. After fusion, the text semantic feature representation of text, image, audio, and video is obtained, and global feature extraction is performed on the text semantic features to obtain the feature fusion vectors of text, image, audio, and video, including:

5.1)将第1步获得的文本S作为输入，使用BERT预训练模型作为文本模态的特征提取器，首先将文本S送入到Bert分词器中，将token列表T送入到BERT模型中获得嵌入向量W，将嵌入向量W送入掩码注意力网络，通过计算每个单词的权重，进行加权，最终得到文本的文本语义特征e₁，公式如下：5.1) Take the text S obtained in step 1 as input, use the BERT pre-trained model as the feature extractor of the text modality, first send the text S to the Bert word segmenter, send the token list T to the BERT model to obtain the embedding vector W, send the embedding vector W to the masked attention network, calculate the weight of each word, and finally obtain the text semantic feature e₁ of the text. The formula is as follows:

T＝BertTokenizer(S) (7)T = BertTokenizer (S) (7)

W＝BERT(T) (8)W＝BERT(T) (8)

e₁＝Mask_Attention(W) (9)e₁ =Mask_Attention(W) (9)

5.2)第2步获得的文本D_gpt、D_ocr进行拼接的信息D_i记作：5.2) The information_Di obtained by splicing the text D_gpt and D_ocr obtained in step 2 is recorded as:

D_i＝concat(D_gpt,D_ocr) (10)D_i =concat(D_gpt ,D_ocr ) (10)

将拼接的文本输入到BERT模型中，获得图像的文本语义特征e_i，原理同5.1)，将文本D_i送入到Bert分词器中，将token列表T_i送入到BERT模型中获得嵌入向量W_i，将嵌入向量W_i送入掩码注意力网络，通过计算每个单词的权重，进行加权，最终得到图像的文本语义特征e_i，公式表示如式(11-13)下：The concatenated text is input into the BERT model to obtain the text semantic feature e_i of the image. The principle is the same as 5.1). The text D_i is sent to the Bert word segmenter, and the token list T_i is sent to the BERT model to obtain the embedding vector W_i . The embedding vector W_i is sent to the masked attention network. By calculating the weight of each word and performing weighting, the text semantic feature e_i of the image is finally obtained. The formula is expressed as follows (11-13):

T_i＝BertTokenizer(D_i) (11)T_i =BertTokenizer(D_i ) (11)

W_i＝BERT(T_i) (12)_Wi = BERT(_Ti ) (12)

e_i＝Mask_Attention(W_i) (13)e_i =Mask_Attention(W_i ) (13)

5.3)将第4步获得的文本D_vgpt、D_vocr、D_vaudio三者拼接的信息记作D_v：5.3) The information obtained in step 4, which is the combination of the texts D_vgpt , D_vocr and D_{vaudio ,} is recorded as D_v :

D_v＝concat(D_vgpt,D_vocr,D_vaudio) (14)D_v =concat(D_vgpt ,D_vocr ,D_vaudio ) (14)

将拼接的文本D_v输入到BERT模型中，获得视频的文本语义特征e_v，原理同5.1)，将文本D_v送入到Bert分词器中，将token列表T_v送入到BERT模型中获得嵌入向量W_v，将嵌入向量W_v送入掩码注意力网络，通过计算每个单词的权重，进行加权，最终得到视频的文本语义特征e_vThe concatenated text D_v is input into the BERT model to obtain the text semantic feature e_v of the video. The principle is the same as 5.1). The text D_v is sent to the Bert word segmenter, and the token list T_v is sent to the BERT model to obtain the embedding vector W_v . The embedding vector W_v is sent to the masked attention network. By calculating the weight of each word, weighted, the text semantic feature e_v of the video is finally obtained.

其公式表示如式(15-17)：Its formula is as shown in formula (15-17):

T_v＝BertTokenizer(D_v) (15)T_v =BertTokenizer(D_v ) (15)

W_v＝BERT(T_v) (16)W_v = BERT(T_v ) (16)

e_v＝Mask_Attention(W_v) (17)e_v =Mask_Attention(W_v ) (17)

5.4)将第3步获得的文本D_audio作为输入，使用BERT预训练模型作为文本模态的特征提取器，得到音频的文本语义特征e_a，原理同5.1)，将文本D_audio送入到Bert分词器中，将token列表T_a送入到BERT模型中获得嵌入向量W_a，将嵌入向量W_a送入掩码注意力网络，通过计算每个单词的权重，进行加权，最终得到图像的文本语义特征e_a，公式如下：5.4) Take the text D_audio obtained in step 3 as input, use the BERT pre-trained model as the feature extractor of the text modality, and obtain the text semantic feature e_a of the audio. The principle is the same as 5.1). Send the text D_audio to the Bert word segmenter, send the token list T_a to the BERT model to obtain the embedding vector W_a , and send the embedding vector W_a to the masked attention network. By calculating the weight of each word and performing weighting, we finally get the text semantic feature e_a of the image. The formula is as follows:

T_a＝BertTokenizer(D_audio) (18)T_a =BertTokenizer(D_audio ) (18)

W_a＝BERT(T_a) (19)W_a = BERT(T_a ) (19)

e_a＝Mask_Attention(W_a) (20)e_a =Mask_Attention(W_a ) (20)

5.5)对文本的文本语义特征e₁、图像的文本语义特征e_i、视频的文本语义特征e_v、音频的文本语义特征e_a做拼接，并采用双向长短期记忆网络对文本进行全局特征提取，得到特征融合向量e。5.5) The text semantic features e₁ of the text, the text semantic features e_i of the image, the text semantic features e_v of the video, and the text semantic features e_a of the audio are concatenated, and a bidirectional long short-term memory network is used to extract global features of the text to obtain a feature fusion vector e.

8、根据权利要求1所述的基于多模态信息融合的社交舆情关键特征提取方法，其特征在于，第6步中，获取舆情真假的概率和预测结果，包括：8. The method for extracting key features of social public opinion based on multimodal information fusion according to claim 1 is characterized in that, in step 6, obtaining the probability and prediction result of whether the public opinion is true or false includes:

P＝Softmax(we+b) (21)P = Softmax(we+b) (21)

其中，P为舆情真假的概率，w为参数，e为文本、图像、音频、视频的特征融合向量，b为偏置项，根据预先设定的阈值t，得到最终的预测结果：Among them, P is the probability of public opinion being true or false, w is a parameter, e is the feature fusion vector of text, image, audio, and video, and b is a bias term. According to the pre-set threshold t, the final prediction result is obtained:

与现有技术相比，本发明具有如下优点：Compared with the prior art, the present invention has the following advantages:

本发明涉及对文本、图片、音频和视频等多种通用媒体数据识别技术的集成，可以应对互联网上不断涌现的新类型数据，如各地方言、各种网络文字等，能够对社交媒体上海量复杂的舆情信息进行全面准确的识别，同时使用Bert预训练模型和BiLSTM模型对多模态信息进行融合提取以及对舆情信息的真假做出判断，并利用通用信息抽取技术提取舆情信息的关键特征，保证了结果的稳定性和可靠性。The present invention involves the integration of multiple general media data recognition technologies such as text, pictures, audio and video. It can cope with the new types of data constantly emerging on the Internet, such as local dialects, various network texts, etc., and can comprehensively and accurately identify the massive and complex public opinion information on social media. At the same time, the Bert pre-training model and the BiLSTM model are used to fuse and extract multimodal information and judge the authenticity of public opinion information. The key features of public opinion information are extracted using general information extraction technology, ensuring the stability and reliability of the results.

本发明所具有的创新性，具体体现在以下几个方面：The innovation of the present invention is specifically embodied in the following aspects:

1)提出的多模态信息融合的社交舆情关键特征提取方法，是一种基于多种通用预训练模型的综合分析方法，该方法结合了通用的文字识别、语音识别、信息抽取等多种技术手段，能够对复杂、多样的社交媒体上的舆情信息进行全面、深入的分析和挖掘，从海量数据中准确抽取出关键特征；1) The proposed method for extracting key features of social public opinion based on multimodal information fusion is a comprehensive analysis method based on multiple general pre-training models. This method combines multiple technical means such as general text recognition, speech recognition, and information extraction. It can conduct comprehensive and in-depth analysis and mining of public opinion information on complex and diverse social media, and accurately extract key features from massive data;

2)提出的多模态信息融合方法，不仅可以对单一模态的舆情信息进行分析，还能够同时对文本、图像、音频、视频等多种模态的信息作融合，从而实现更全面、更精准的舆情分析；2) The proposed multimodal information fusion method can not only analyze public opinion information of a single modality, but also simultaneously integrate information of multiple modalities such as text, image, audio, and video, thereby achieving a more comprehensive and accurate public opinion analysis;

3)提出的舆情真假判断方法，基于深度学习算法，能够对社交媒体上的舆情信息进行快速、准确的真假判断，并且能够针对不同情况进行灵活调整，保证判断结果的稳定性和可靠性；3) The proposed method for judging the authenticity of public opinion is based on a deep learning algorithm, which can quickly and accurately judge the authenticity of public opinion information on social media, and can be flexibly adjusted according to different situations to ensure the stability and reliability of the judgment results;

4)提出的社交舆情关键特征提取方法，是一种全流程、全要素的分析方法，该方法集数据采集、数据分析、信息融合、信息抽取等一体化环节，提供了全面、准确的舆情分析服务。4) The proposed method for extracting key features of social public opinion is a full-process, full-factor analysis method that integrates data collection, data analysis, information fusion, information extraction and other integrated links, providing comprehensive and accurate public opinion analysis services.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明一种基于多模态信息融合的社交舆情关键特征提取方法发明算法的整体流程示意图；FIG1 is a schematic diagram of the overall flow of an algorithm for extracting key features of social public opinion based on multimodal information fusion according to the present invention;

图2为本发明的一个示例图。FIG. 2 is an exemplary diagram of the present invention.

具体实施方式DETAILED DESCRIPTION

下面结合附图对基于多模态信息融合的社交舆情关键特征提取方法作进一步说明，如图1所示，本发明方法包括以下步骤：The following is a further description of the method for extracting key features of social public opinion based on multimodal information fusion in conjunction with the accompanying drawings. As shown in FIG1 , the method of the present invention includes the following steps:

第1步：采用网络爬虫技术对舆情信息作初步采集，初步采集的内容包括：标题、发布时间(t₂)、点赞量(c₁)、评论量(c₂)、转发量(c₃)、浏览量(c₄)，依据与热度因素对文本、图像、视频、音频做进一步地数据采集；Step 1: Use web crawler technology to collect public opinion information preliminarily. The preliminarily collected content includes: title, release time (t₂ ), number of likes (c₁ ), number of comments (c₂ ), number of reposts (c₃ ), and number of views (c₄ ). Further data collection is performed on text, images, videos, and audio based on popularity factors.

第2步：对采集得到的图像信息，首先进行相似度检测，删除冗余数据，其次使用图像描述模型提取每张图像中的中文描述信息以及对图像进行文字识别，获取图像的描述信息以及文字识别信息；Step 2: For the collected image information, first perform similarity detection and delete redundant data. Then use the image description model to extract the Chinese description information in each image and perform text recognition on the image to obtain the image description information and text recognition information.

第3步：使用讯飞语音转写接口将采集得到的音频信息转化为文本信息；Step 3: Use the iFlytek speech transcription interface to convert the collected audio information into text information;

第4步：对于视频，提取音频信息，按照第3步中的方法提取音频的文本信息，截取视频图像，按照第2步中的方法获取图像的文本信息；Step 4: For the video, extract the audio information, extract the text information of the audio according to the method in step 3, capture the video image, and obtain the text information of the image according to the method in step 2;

第5步：使用BERT预训练模型作为文本模态的特征提取器，分别对第2、3、4步中的多模态特征进行融合，分别获取文本、图像、音频、视频的文本语义特征表示；Step 5: Use the BERT pre-trained model as the feature extractor of the text modality, fuse the multimodal features in steps 2, 3, and 4 respectively, and obtain the text semantic feature representation of text, image, audio, and video respectively;

第6步：使用softmax分类，将多模态特征向量e映射到真实与虚假两类目标空间中，对舆情信息进行真假判断；Step 6: Use softmax classification to map the multimodal feature vector e into two target spaces: true and false, and make a true or false judgment on the public opinion information;

第7步：结合第6步中的舆情真假情况，使用UIE模型对第5步中多模态融合的文本信息进行信息抽取，获取多模态舆情数据中的关键信息。Step 7: Based on the truth or falsity of public opinion in step 6, use the UIE model to extract information from the multimodal fused text information in step 5 to obtain key information in the multimodal public opinion data.