CN110297986A

Movatterモバイル変換

Info

Publication number: CN110297986A
Application number: CN201910540279.5A
Authority: CN
Inventors: 徐建国; 蔺珍; 肖海峰; 韩青君
Original assignee: Shandong University of Science and Technology
Current assignee: Shandong University of Science and Technology
Priority date: 2019-06-21
Filing date: 2019-06-21
Publication date: 2019-10-01

Abstract

Translated fromChinese

本发明公开了一种微博热点话题的情感倾向分析方法，根据指定的话题，采集话题的文本信息；抽取与微博热点话题相关的主观微博评价词语情感信息；在情感分类过程中，为提高微博文本情感多元分类的准确性，提出的基于SVM‑BILSTM的微博文本情感多元分类模型；最后根据前面的分析及结果做出情感倾向性分析。本发明的有益效果是通过指定话题采集文本，抽取情感信息，利用情感多元分类模型能够实时响应舆情事件情感倾向，提高舆情事件的响应速度，更快捷高效。

The invention discloses a method for analyzing the sentiment tendency of microblog hot topics. According to a specified topic, the text information of the topic is collected; the emotional information of subjective microblog evaluation words related to the microblog hot topic is extracted; In order to improve the accuracy of multi-sentiment classification of micro-blog text, a multi-sentiment classification model of micro-blog text based on SVM-BILSTM is proposed; finally, the sentiment tendency analysis is made according to the previous analysis and results. The present invention has the beneficial effects of collecting texts by specifying topics, extracting emotional information, and using a multi-emotional classification model to respond to the emotional tendencies of public opinion events in real time, improving the response speed of public opinion events, and being more efficient and efficient.

Description

Translated fromChinese

一种微博热点话题的情感倾向分析方法A Sentiment Tendency Analysis Method for Weibo Hot Topics

技术领域technical field

本发明属于数据处理技术领域，涉及一种微博热点话题的情感倾向分析方法。The invention belongs to the technical field of data processing, and relates to a method for analyzing the sentiment tendency of microblog hot topics.

背景技术Background technique

微博热点话题的情感倾向分析模型，主要包含网络信息采集技术，数据预处理过程中使用的中文分析及词性标注方法，文本的特征表示、特征提取及文本分类方法，最后是深度学习中的长短期记忆神经网络算法。网络信息采集技术(Network InformationCollection Technology)，是一种按照一定的规则自动采集互联网上数据信息的计算机技术。通常以一个或者多个初始URL为起点，通过各类端口发送按照http协议格式的抓取指令采集网页中的信息[24]。如此重复循环，对互联网信息进行遍历搜索，直到采集到所有的数据为止。对于不同的媒体类型，其舆情信息的采集方式也有所差异，从舆情信息最易爆发的角度来看，主要集中在新闻、微博、论坛三大媒体。中文分词(Chinese Word Segmentation)就是在分析中文文本之前必须将一个汉字序列切分成一个个单独的词的过程，词性标注(Part-of-Speech Tagging)指判断出在一个句子中每个词所扮演的语法角色。文本表示(Text representation)是指通过某种形式将文本字符串表示成计算机所能处理的数值向量。因为计算机不能直接对文本字符串进行处理，因此需要将本文中抽取出来的特征词进行数值化或者向量化使得计算机能够识别和处理。文本分类(Text Categorization)指的是在给定的分类体系下将每个文本自动分配到预先定义好的类中，文本分类的主要数据来源是非结构化的文本，即可以通过一个分类器将给定的文本分配到相应类别的过程。LSTM(Long Short Term Memory)是循环神经网络(Recurrent Neural Network,RNN)结构的一种，由输入层、隐藏层、输出层构成,LSTM网络模型将传统RNN的输入层和隐藏层植入到记忆单元中,记忆单元中包含特殊的门结构，即输入门、遗忘门和输出门来控制信息的流通，只有符合算法认证的信息才会被留下，不符的信息则通过遗忘门被遗忘,LSTM模型涉及的计算比较多，也比较复杂，所以对信息的处理更灵活，也更强大，适合于处理和预测时间序列中间隔相对较久的事件和延迟相对较长的重要事件。在实际应用中，因为语言有长期依赖关系，RNN模型不擅长捕捉和保留之前的所有信息，存在长期依赖的问题，而LSTM可以解决解决上述问题。LSTM已经在语音识别、图像识别、控制聊天机器人等科技领域有了多种应用。现有的微博热点情感倾向分析一般都有滞后性。The sentiment tendency analysis model of Weibo hot topics mainly includes network information collection technology, Chinese analysis and part-of-speech tagging methods used in the data preprocessing process, text feature representation, feature extraction and text classification methods, and finally the long-term in deep learning. Short-term memory neural network algorithm. Network Information Collection Technology (Network Information Collection Technology) is a computer technology that automatically collects data information on the Internet according to certain rules. Usually, one or more initial URLs are used as the starting point, and crawling instructions in the http protocol format are sent through various ports to collect information in web pages [24]. Repeat the cycle in this way, traverse and search the Internet information until all the data are collected. For different types of media, the collection methods of public opinion information are also different. From the perspective of the most explosive public opinion information, it is mainly concentrated in the three major media of news, Weibo, and forums. Chinese Word Segmentation is the process of dividing a sequence of Chinese characters into individual words before analyzing Chinese text. Part-of-Speech Tagging refers to judging the role of each word in a sentence. grammatical roles. Text representation refers to the representation of text strings into numerical vectors that can be processed by a computer in some form. Because the computer cannot process the text strings directly, it is necessary to quantify or vectorize the feature words extracted in this paper so that the computer can recognize and process them. Text Categorization refers to automatically assigning each text to a pre-defined class under a given classification system. The main data source of text classification is unstructured text, that is, a classifier can give The process of assigning a given text to the corresponding category. LSTM (Long Short Term Memory) is a type of Recurrent Neural Network (RNN) structure, which consists of an input layer, a hidden layer, and an output layer. The LSTM network model implants the input layer and hidden layer of the traditional RNN into the memory. In the unit, the memory unit contains a special gate structure, that is, the input gate, the forgetting gate and the output gate to control the flow of information. Only the information that meets the algorithm authentication will be left, and the information that does not match will be forgotten through the forgetting gate. LSTM The model involves more calculations and is more complex, so the processing of information is more flexible and powerful, and it is suitable for processing and predicting events with relatively long intervals and important events with relatively long delays in time series. In practical applications, because the language has long-term dependencies, the RNN model is not good at capturing and retaining all the previous information, and there is a long-term dependency problem, and LSTM can solve the above problems. LSTMs already have a variety of applications in technology fields such as speech recognition, image recognition, and controlling chatbots. The existing sentiment analysis of Weibo hotspots generally has a lag.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于提供一种微博热点话题的情感倾向分析方法，本发明的有益效果是通过指定话题采集文本，抽取情感信息，利用情感多元分类模型能够实时响应舆情事件情感倾向，提高舆情事件的响应速度，更快捷高效。The purpose of the present invention is to provide an emotional tendency analysis method for microblog hot topics. The beneficial effect of the present invention is to collect texts by specifying topics, extract emotional information, and use the emotional multi-classification model to be able to respond to the emotional tendencies of public opinion events in real time, and to improve public opinion events. The response speed is faster and more efficient.

本发明所采用的技术方案是按照以下步骤进行：The technical scheme adopted in the present invention is to carry out according to the following steps:

A、微博热点话题的数据获取与预处理；根据指定的话题，采集话题的文本信息；A. Data acquisition and preprocessing of Weibo hot topics; according to the specified topic, the text information of the topic is collected;

B、抽取与微博热点话题相关的主观微博评价词语情感信息；情感信息抽取过程中，为提高获取的情感信息的质量，结合TF-IDF-COS与SVM算法改进了微博情感信息抽取模型，来抽取与微博热点话题相关的主观微博评价词语情感信息的抽取；B. Extract the emotional information of subjective micro-blog evaluation words related to hot topics of micro-blog; in the process of emotional information extraction, in order to improve the quality of the acquired emotional information, the micro-blog emotional information extraction model is improved by combining TF-IDF-COS and SVM algorithm , to extract the sentiment information of subjective Weibo evaluation words related to Weibo hot topics;

C、在情感分类过程中，为提高微博文本情感多元分类的准确性，提出的基于SVM-BILSTM的微博文本情感多元分类模型；C. In the process of sentiment classification, in order to improve the accuracy of the multi-sentiment classification of micro-blog text, a multi-sentiment classification model of micro-blog text based on SVM-BILSTM is proposed;

D、最后根据前面的分析及结果做出情感倾向性分析。D. Finally, make an emotional tendency analysis based on the previous analysis and results.

进一步，步骤A中，微博热点话题的数据获取与预处理是指选定微博中的特定话题，利用Python工具到微博平台爬取该话题的文本信息，随后对采集到的半结构化信息进行预处理，进而得到纯文本语料进行存储。Further, in step A, the data acquisition and preprocessing of microblog hot topics refers to selecting a specific topic in the microblog, using Python tools to crawl the text information of the topic on the microblog platform, and then retrieving the collected semi-structured information. The information is preprocessed, and then the plain text corpus is obtained for storage.

进一步，步骤B中TF-IDF-COS与SVM算法如下：Further, the TF-IDF-COS and SVM algorithms in step B are as follows:

选取TF-IDF算法结合余弦系数相似度计算方法来计算文本与话题的相似度，通过计算词i的TF-IDF权重及词i和热点话题词T(w)之间的余弦系数，进而抽取出与热点话题词相似度较大的词，再通过SVM算法将与话题相关的文本和与话题无关的文本进行分类，进而得到与话题相关的微博文本；Select the TF-IDF algorithm combined with the cosine coefficient similarity calculation method to calculate the similarity between the text and the topic, and then extract the TF-IDF weight of the word i and the cosine coefficient between the word i and the hot topic word T(w). Words with high similarity to hot topic words, and then classify topic-related texts and topic-irrelevant texts through the SVM algorithm, and then obtain topic-related Weibo texts;

词频反映了一个词在文档中出现的次数，计算公式如下：Word frequency reflects the number of times a word appears in a document, and the calculation formula is as follows:

其中，w_i表示为第i个词汇，p_j表示为第j篇文本，n_ij表示为第i个词汇在第j篇文本中出现的次数，n_j表示为第j篇文本词汇的总和。Among them,_wi represents the i-th word, p_j represents the j-th text, n_ij represents the number of occurrences of the i-th word in the j-th text, and n_j represents the sum of the j-th text words.

逆文档频率是对一个词语重要性的度量，描述了该词语的使用范围，计算公式如下：The inverse document frequency is a measure of the importance of a word, which describes the usage range of the word. The calculation formula is as follows:

其中，m为语料库的文档总数，m_i为语料库中包含词语w_i的文档数量。同时，为防止某一个生僻词不在语料库中而使得该式的分母为0，所以对IDF进行了平滑处理，即分母做加1处理，使语料库中没有出现的词也可以得到一个合适的IDF值，Among them, m is the total number of documents in the corpus, and m_i is the number of documents in the corpus that contain the word_wi . At the same time, in order to prevent a certain rare word from not in the corpus, the denominator of the formula is 0, so the IDF is smoothed, that is, the denominator is added by 1, so that words that do not appear in the corpus can also get a suitable IDF value ,

TF-IDF＝词频(TF)×逆文档频率(IDF)TF-IDF = term frequency (TF) × inverse document frequency (IDF)

在文本特征表示时，每条微博文本都可以用微博中词的特征来表示，这些词的特征及其权重构成空间中的向量(W_1,j,W_2,j,W_3,j,···,W_n,j)，其中W_i,j为词条i在微博文本D_j中的权重，计算如下：In the text feature representation, each microblog text can be represented by the features of the words in the microblog. The features of these words and their weights constitute a vector in the space (W_1,j ,W_2,j ,W_3,j ,...,W_n,j ), where Wi_,j is the weight of the entry i in the microblog text D_j , calculated as follows:

W_i,j＝TF_i,j×IDF_i×COS_i。Wi_,j =TF_i,j ×IDF_i ×COS_i .

进一步，步骤C中SVM-BILSTM算法就是把SVM和BILSTM结合的一种算法，利用SVM-BILSTM的微博文本情感多元分类模型，输出极正、较正、正向、负向、较负、极负6个情感类别。Further, the SVM-BILSTM algorithm in step C is an algorithm that combines SVM and BILSTM, and uses the SVM-BILSTM microblog text sentiment multi-classification model to output extremely positive, relatively positive, positive, negative, relatively negative, and extreme. Negative 6 sentiment categories.

BILSTM的计算方法可表示为The calculation method of BILSTM can be expressed as

s_t＝f(Ux_t+Ws_t-1)s_t =f(Ux_t +Ws_t-1 )

s′_t＝f(U′x_t+W′s′_t+1)s'_t =f(U'x_t +W's'_t+1 )

其中，权重U和U′、W和W′、V和V′分别是BILSTM计算时的不同权重矩阵，(W，U)为正向计算时输入到隐藏层的权重，(U′，W′)为反向计算时隐藏层到隐藏层的权重，(V，V′)为BILSTM隐藏层到输出层的权重。Among them, the weights U and U', W and W', V and V' are the different weight matrices in the BILSTM calculation, (W, U) are the weights input to the hidden layer in the forward calculation, (U', W' ) is the weight from the hidden layer to the hidden layer during reverse calculation, and (V, V′) is the weight from the hidden layer to the output layer of BILSTM.

SVM是指在样本点所在的向量空间中找出一个满足分类要求的最优分类超平面，它可以把不同类的样本分开，使分类间隔最大化，它是机器学习中的一类按监督学习(SupervisedLearning)方式对数据进行二元分类的广义线性分类器，主要依赖于不同的核函数，常用的核函数如下：SVM refers to finding an optimal classification hyperplane that satisfies the classification requirements in the vector space where the sample points are located. It can separate samples of different classes and maximize the classification interval. It is a kind of supervised learning in machine learning. (SupervisedLearning) The generalized linear classifier for binary classification of data mainly depends on different kernel functions. The commonly used kernel functions are as follows:

K(x_i,y_i)＝(x_i*y_i)K(x_i ,y_i )=(x_i *y_i )

选取训练样本集T＝(x_i,y_i),i＝1,2,···，n；x为输入向量；y＝{1,-1},y_i为x_i向类标签，超平面方程如下：ω·x_i+b＝0Select the training sample set T=(x_i , y_i ), i=1,2,...,n; x is the input vector; y={1,-1}, y_i is the class label of x_i , super The plane equation is as follows: ω·x_i +b=0

其中ω为法向量，决定超平面的方向，b为位移项，决定超平面与原点之间的距离。最后得到训练样本核函数展开式为：where ω is the normal vector, which determines the direction of the hyperplane, and b is the displacement term, which determines the distance between the hyperplane and the origin. Finally, the expansion of the training sample kernel function is:

i＝1,2,···，n；x为输入向量；y＝{1,-1},y_i为x_i向类标签，k为核函数，b为位移项，ɑ为拉格朗日乘子。i=1,2,...,n; x is the input vector; y={1,-1}, y_i is the x_i -direction class label, k is the kernel function, b is the displacement term, and ɑ is the Lagrangian day multiplier.

附图说明Description of drawings

图1为本发明的微博热点话题的情感倾向分析总体框架图；Fig. 1 is the general frame diagram of the emotional tendency analysis of the microblog hot topic of the present invention;

图2为本发明的微博热点话题情感倾向分析模型图；Fig. 2 is the microblog hot topic emotional tendency analysis model diagram of the present invention;

图3为本发明的改进的的情感信息抽取模型图；Fig. 3 is the improved emotional information extraction model diagram of the present invention;

图4为基于SVM-BILSTM的情感分类模型图图。Figure 4 is a diagram of a sentiment classification model based on SVM-BILSTM.

具体实施方式Detailed ways

下面结合具体实施方式对本发明进行详细说明。The present invention will be described in detail below with reference to specific embodiments.

本发明微博热点话题的情感倾向分析方法如下：The emotional tendency analysis method of the microblog hot topic of the present invention is as follows:

A、微博热点话题的数据获取与预处理；根据指定的话题，采集话题的文本信息；微博热点话题的数据获取与预处理是指选定微博中的特定话题，利用Python工具到微博平台爬取该话题的文本信息，随后对采集到的半结构化信息进行预处理，进而得到纯文本语料进行存储。A. Data acquisition and preprocessing of Weibo hot topics; according to the specified topic, the text information of the topic is collected; data acquisition and preprocessing of Weibo hot topic refers to selecting a specific topic in Weibo and using Python tools to upload it to Weibo The blog platform crawls the text information of the topic, and then preprocesses the collected semi-structured information, and then obtains a plain text corpus for storage.

B、抽取与微博热点话题相关的主观微博评价词语情感信息；情感信息抽取过程中，为提高获取的情感信息的质量，结合TF-IDF-COS与SVM算法改进了微博情感信息抽取模型，来抽取与微博热点话题相关的主观微博评价词语情感信息的抽取；TF-IDF-COS与SVM算法如下：B. Extract the emotional information of subjective micro-blog evaluation words related to hot topics of micro-blog; in the process of emotional information extraction, in order to improve the quality of the acquired emotional information, the micro-blog emotional information extraction model is improved by combining TF-IDF-COS and SVM algorithm , to extract the sentiment information of subjective Weibo evaluation words related to Weibo hot topics; TF-IDF-COS and SVM algorithms are as follows:

W_i,j＝TF_i,j×IDF_i×COS_i。Wi_,j =TF_i,j ×IDF_i ×COS_i .

TF表示一个词在文档中出现的频率，IDF表示出现特定词的文档的倒数。TF represents the frequency of a word in a document, and IDF represents the reciprocal of documents in which a particular word appears.

C、在情感分类过程中，为提高微博文本情感多元分类的准确性，提出的基于SVM-BILSTM的微博文本情感多元分类模型；SVM-BILSTM算法就是把SVM和BILSTM结合的一种算法，利用SVM-BILSTM的微博文本情感多元分类模型，输出极正、较正、正向、负向、较负、极负6个情感类别。C. In the process of sentiment classification, in order to improve the accuracy of the multi-level sentiment classification of microblog text, a multi-level sentiment classification model of microblog text based on SVM-BILSTM is proposed; the SVM-BILSTM algorithm is an algorithm that combines SVM and BILSTM, Using the SVM-BILSTM microblog text sentiment multivariate classification model, it outputs 6 sentiment categories: extremely positive, relatively positive, positive, negative, relatively negative, and extremely negative.

综合C和D两个步骤的结果得出最后的情感倾向分析结论。Combining the results of the two steps C and D, the final sentiment analysis conclusion is obtained.

图1为本发明的微博热点话题的情感倾向分析总体框架图；图2为本发明的微博热点话题情感倾向分析模型图；图3为本发明的改进的的情感信息抽取模型图；图4为基于SVM-BILSTM的情感分类模型图图。从总体框架图可以看出本实施例将微博文本的情感倾向分析分为数据获取及预处理、情感信息抽取、情感分类、情感倾向分析4个模块。Fig. 1 is the general frame diagram of the emotional tendency analysis of the microblog hot topic of the present invention; Fig. 2 is the emotional tendency analysis model diagram of the microblog hot topic of the present invention; Fig. 3 is the improved emotional information extraction model diagram of the present invention; Fig. 4 is a diagram of the sentiment classification model based on SVM-BILSTM. It can be seen from the overall frame diagram that this embodiment divides the sentiment tendency analysis of microblog text into four modules: data acquisition and preprocessing, sentiment information extraction, sentiment classification, and sentiment tendency analysis.

在进行数据获取时，需要先选定一个话题，然后利用Python工具到微博平台爬取该话题的文本信息，包括抽取文本标点符号、表情符号等，随后对采集到的半结构化信息进行预处理，包括标点符号及表情符号的提取、数据清洗、分词及词性标注、去停用词等，进而得到纯文本语料进行存储。在情感信息抽取模块采用一定的技术方法抽取有价值的评价对象和评价词语，以获得既与话题相关，又是主观性表达的微博文本作为本文情感分类任务的实验数据。在情感分类模块主要是微博文本的情感多元分类任务，首先，在之前的主客观微博文本情感分类特征的基础上稍作改进，结合了张想改进的八维特征词及何跃、肖敏、张月提出的10类情感极性特征，增加了标点符号作为微博文本情感特征之一，将微博中常用的标点符号，采用人工标记的方式将其分为正负两类，分别存储为“[+]”“[-]”，进而设制了12种情感分类特征，并将这12种情感分类特征作为情感分类特征词，以提高情感分类的准确性。在情感倾向分析模块主要是研究微博文本的倾向性分类问题，即在进行微博文本的情感分类后，加入情感强度计算，将情感倾向包含在不同的分类状态中。根据情感分类模块的输出结果，即极正、较正、正向、负向、较负、极负6个类别，分别为每个情感类别赋予相应的权重，以方便计算微博热点话题的情感强度，进而判定微博热点话题的情感倾向状态。When acquiring data, you need to select a topic first, and then use Python tools to crawl the text information of the topic on the Weibo platform, including extracting text punctuation marks, emoticons, etc., and then pre-predict the collected semi-structured information. Processing, including punctuation and emoji extraction, data cleaning, word segmentation and part-of-speech tagging, removal of stop words, etc., and then obtain plain text corpus for storage. In the emotional information extraction module, certain technical methods are used to extract valuable evaluation objects and evaluation words, so as to obtain microblog texts that are both topic-related and subjectively expressed as experimental data for the sentiment classification task in this paper. In the sentiment classification module, it is mainly the sentiment multi-classification task of microblog texts. First of all, based on the previous subjective and objective sentiment classification features of microblog texts, a slight improvement is made, combining Zhang Xiang's improved eight-dimensional feature words and He Yue, Xiao The 10 types of emotional polarity features proposed by Min and Zhang Yue added punctuation as one of the emotional features of Weibo texts. The commonly used punctuation in Weibo was manually marked into positive and negative categories, respectively. It is stored as "[+]" and "[-]", and then 12 kinds of emotion classification features are set up, and these 12 kinds of emotion classification features are used as emotion classification feature words to improve the accuracy of emotion classification. In the sentiment tendency analysis module, it mainly studies the tendency classification of microblog texts, that is, after the sentiment classification of microblog texts, the calculation of sentiment intensity is added, and the sentiment tendency is included in different classification states. According to the output results of the sentiment classification module, namely extremely positive, relatively positive, positive, negative, relatively negative, and extremely negative 6 categories, assign corresponding weights to each emotional category to facilitate the calculation of the sentiment of Weibo hot topics intensity, and then determine the emotional tendency state of Weibo hot topics.

以上所述仅是对本发明的较佳实施方式而已，并非对本发明作任何形式上的限制，凡是依据本发明的技术实质对以上实施方式所做的任何简单修改，等同变化与修饰，均属于本发明技术方案的范围内。The above is only a preferred embodiment of the present invention, and does not limit the present invention in any form. Any simple modifications, equivalent changes and modifications made to the above embodiments according to the technical essence of the present invention belong to the present invention. within the scope of the technical solution of the invention.

Claims

Translated fromChinese

1.一种微博热点话题的情感倾向分析方法，其特征在于按照以下步骤进行：1. a kind of emotional tendency analysis method of microblog hot topic is characterized in that carrying out according to the following steps:

2.按照权利要求1所述一种微博热点话题的情感倾向分析方法，其特征在于：所述步骤A中，微博热点话题的数据获取与预处理是指选定微博中的特定话题，利用Python工具到微博平台爬取该话题的文本信息，随后对采集到的半结构化信息进行预处理，进而得到纯文本语料进行存储。2. according to the sentiment tendency analysis method of a kind of microblog hot topic according to claim 1, it is characterized in that: in the described step A, the data acquisition and preprocessing of microblog hot topic refer to the specific topic in the selected microblog , use Python tools to crawl the text information of the topic on the microblog platform, and then preprocess the collected semi-structured information, and then obtain the plain text corpus for storage.

3.按照权利要求1所述一种微博热点话题的情感倾向分析方法，其特征在于：所述步骤B中TF-IDF-COS与SVM算法如下：3. according to the emotional tendency analysis method of a kind of microblog hot topic described in claim 1, it is characterized in that: in described step B, TF-IDF-COS and SVM algorithm are as follows:

选取TF-IDF算法结合余弦系数相似度计算方法来计算文本与话题的相似度，通过计算词i的TF-IDF权重及词i和热点话题词T(w)之间的余弦系数，进而抽取出与热点话题词相似度较大的词，再通过SVM算法将与话题相关的文本和与话题无关的文本进行分类，进而得到与话题相关的微博文本，词频反映了一个词在文档中出现的次数，计算公式如下：Select the TF-IDF algorithm combined with the cosine coefficient similarity calculation method to calculate the similarity between the text and the topic, and then extract the TF-IDF weight of the word i and the cosine coefficient between the word i and the hot topic word T(w). Words with high similarity to hot topic words, and then classify topic-related texts and topic-irrelevant texts through the SVM algorithm, and then obtain topic-related microblog texts. The word frequency reflects the occurrence of a word in the document. times, the calculation formula is as follows:

其中，w_i表示为第i个词汇，p_j表示为第j篇文本，n_ij表示为第i个词汇在第j篇文本中出现的次数，n_j表示为第j篇文本词汇的总和；逆文档频率是对一个词语重要性的度量，描述了该词语的使用范围，计算公式如下：Among them,_wi represents the i-th word, p_j represents the j-th text, n_ij represents the number of occurrences of the i-th word in the j-th text, and n_j represents the sum of the j-th text words; The inverse document frequency is a measure of the importance of a word, which describes the usage range of the word. The calculation formula is as follows:

其中，m为语料库的文档总数，m_i为语料库中包含词语w_i的文档数量，同时，为防止某一个生僻词不在语料库中而使得该式的分母为0，所以对IDF进行了平滑处理，即分母做加1处理，使语料库中没有出现的词也可以得到一个合适的IDF值，Among them, m is the total number of documents in the corpus, and m_i is the number of documents containing the word_wi in the corpus. At the same time, in order to prevent a certain rare word from not being in the corpus, the denominator of the formula is 0, so the IDF is smoothed, That is, the denominator is added by 1, so that words that do not appear in the corpus can also get a suitable IDF value.

在文本特征表示时，每条微博文本都可以用微博中词的特征来表示，这些词的特征及其权重构成空间中的向量(W_1,j,W_2,j,W_3,j,…,W_n,j)，其中W_i,j为词条i在微博文本D_j中的权重，计算如下：In the text feature representation, each microblog text can be represented by the features of the words in the microblog. The features of these words and their weights constitute a vector in the space (W_1,j ,W_2,j ,W_3,j ,...,W_n,j ), where Wi_,j is the weight of the entry i in the microblog text D_j , calculated as follows:

W_i,j＝TF_i,j×IDF_i×COS_i。Wi_,j =TF_i,j ×IDF_i ×COS_i .

4.按照权利要求1所述一种微博热点话题的情感倾向分析方法，其特征在于：所述步骤C中SVM-BILSTM算法就是把SVM和BILSTM结合的一种算法，利用SVM-BILSTM的微博文本情感多元分类模型，输出极正、较正、正向、负向、较负、极负6个情感类别，BILSTM的计算方法为4. according to the emotional tendency analysis method of a kind of microblog hot topic described in claim 1, it is characterized in that: in described step C, SVM-BILSTM algorithm is exactly a kind of algorithm that SVM and BILSTM are combined, utilizes the microcomputer of SVM-BILSTM. The blog text sentiment multi-classification model outputs 6 sentiment categories: extremely positive, relatively positive, positive, negative, relatively negative, and extremely negative. The calculation method of BILSTM is:

s_t＝f(Ux_t+Ws_t-1)s_t =f(Ux_t +Ws_t-1 )

s′_t＝f(U′x_t+W′s′_t+1)s'_t =f(U'x_t +W's'_t+1 )

其中，权重U和U′、W和W′、V和V′分别是BILSTM计算时的不同权重矩阵，W，U为正向计算时输入到隐藏层的权重，U′，W′为反向计算时隐藏层到隐藏层的权重，V，V′为BILSTM隐藏层到输出层的权重，SVM是指在样本点所在的向量空间中找出一个满足分类要求的最优分类超平面，它可以把不同类的样本分开，使分类间隔最大化，它是机器学习中的一类按监督学习方式对数据进行二元分类的广义线性分类器，主要依赖于不同的核函数：Among them, the weights U and U', W and W', V and V' are the different weight matrices for BILSTM calculation, W, U are the weights input to the hidden layer in the forward calculation, U', W' are the reverse When calculating the weight from the hidden layer to the hidden layer, V, V′ are the weights from the BILSTM hidden layer to the output layer. SVM refers to finding an optimal classification hyperplane that meets the classification requirements in the vector space where the sample points are located. It can It separates samples of different classes to maximize the classification interval. It is a generalized linear classifier in machine learning that performs binary classification on data by supervised learning, and mainly depends on different kernel functions:

K(x_i,y_i)＝(x_i*y_i)K(x_i ,y_i )=(x_i *y_i )

选取训练样本集T＝(x_i,y_i),i＝1,2,…，n；x为输入向量；y＝{1,-1},y_i为x_i向类标签，超平面方程如下：ω·x_i+b＝0Select the training sample set T=(x_i , y_i ), i=1,2,...,n; x is the input vector; y={1,-1}, y_i is the class label of x_i , the hyperplane equation As follows: ω·x_i +b=0

其中ω为法向量，决定超平面的方向，b为位移项，决定超平面与原点之间的距离，最后得到训练样本核函数展开式为：where ω is the normal vector, which determines the direction of the hyperplane, b is the displacement term, which determines the distance between the hyperplane and the origin, and finally the kernel function expansion of the training sample is obtained as:

i＝1,2,…，n；x为输入向量；y＝{1,-1},y_i为x_i向类标签，k为核函数，b为位移项，ɑ为拉格朗日乘子。i=1,2,...,n; x is the input vector; y={1,-1}, y_i is the x_i -direction class label, k is the kernel function, b is the displacement term, and ɑ is the Lagrange multiplication son.