CN109933657B

Movatterモバイル変換

Info

Publication number: CN109933657B
Application number: CN201910218584.2A
Authority: CN
Inventors: 冯佳纯; 饶洋辉
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2019-03-21
Filing date: 2019-03-21
Publication date: 2021-07-09
Anticipated expiration: 2039-03-21
Also published as: CN109933657A

Abstract

The invention belongs to emotion analysis and topic mining tasks in the field of natural language processing, and particularly relates to a topic mining emotion analysis method based on user feature optimization. The method comprises the following steps: s1, establishing a multi-dimensional theme emotion combined model MTSM based on an LDA theme model, wherein the model integrates text information, time, user characteristics and emotion labels; s2, solving model parameters by using a training corpus training model; and S3, performing theme mining and emotion prediction on the test corpus by using the trained model. Aiming at the characteristics of the social network texts, the method effectively integrates the information of four dimensions such as text information, time, user characteristics, emotion labels and the like, redefines the generation mode of the social network texts, establishes a multi-dimensional theme emotion classification combination type, provides the theme information observed and compared from multiple visual angles, and improves the emotion prediction accuracy of the social network texts.

Description

Translated fromChinese

一种基于用户特征优化的主题挖掘情感分析方法A sentiment analysis method for topic mining based on user feature optimization

技术领域technical field

本发明属于自然语言处理领域的情感分析与主题挖掘任务，更具体地，涉及一种基于用户特征优化的主题挖掘情感分析方法。The invention belongs to the tasks of sentiment analysis and topic mining in the field of natural language processing, and more particularly relates to a topic mining sentiment analysis method based on user feature optimization.

背景技术Background technique

互联网社交网络文本包含了用户的观点意见及个人情绪，对这种非结构化的网络数据的提取的过程被称为情感分析或观点挖掘。根据方法的基本属性，主要可以分为机器学习模型、基于词典的学习模型和主题模型。近年来，由于主题模型的蓬勃发展，因此大量基于主题模型被拓展为情感预测分类模型，并应用情感分析领域对互联网用户生成文本做情感分类工作，例如，对商品评论信息和电影评论信息的情感分类及主题挖掘。Internet social network texts contain users' opinions and personal emotions, and the process of extracting such unstructured network data is called sentiment analysis or opinion mining. According to the basic properties of the method, it can be mainly divided into machine learning model, dictionary-based learning model and topic model. In recent years, due to the vigorous development of topic models, a large number of topic-based models have been extended to sentiment prediction classification models, and the field of sentiment analysis has been applied to perform sentiment classification on texts generated by Internet users, such as sentiment on product review information and movie review information. Classification and topic mining.

Mei Q等人提出首个情感-主题联合模型Topic-Sentiment Model(简称TSM) 模型，该模型在pLSA的基础上进行改进的，它同时对情感和主题线索进行建模，假设文档中的每一个词语的生成首先应该决定该词属于正面情感抑或是属于负面情感，继而决定该词的主题，最后决定相应主题下的词。与pLSA一样，在面对小数据集时，TSM同样易导致过拟合问题的出现。基于LDA的优势， Lin C和He Y提出JST模型，不仅仅为隐变量主题和情感加入先验分布，而且为每一篇文章设置一个满足多项式分布的情感分布，为该文章的每一个情感标签设置一个满足多项式分布的主题分布。明显地，在JST模型中，主题和情感之间的关系是相对独立的，这二种简易的组合方式带来文章情感不一致的噪声效果，Jo Y和Oh AH在JST的工作基础上提出ASUM模型，假设一个句子里的所有单词仅有一个主题，共享一个情感标签。Li F等人则提出Sentiment-LDA 和Dependency-Sentiment-LDA模型，Sentiment-LDA假设文章满足多项式分布的主题分布决定其满足二项式的情感分布，而Dependency-Sentiment-LDA运用句子中的连词信息(如“但是、并且、然而”等连词)，来减少单词情感的不一致性。针对JST无法分离出主题词汇以及情感词汇的缺点，Zhao W等人则提出MaximumEntropy LDA(简称Max-Ent LDA)模型，借助最大信息熵的性质将单词分为背景词汇，及特定主题词汇，提高主题挖掘的精准度和情感分析的准确度。Xu K等人提出TUS-LDA模型，结合时间信息、用户身份信息、情感偏向对数据做个人兴趣挖掘和社会热点检测。TUS-LDA模型将主题分为两类，一类为与用户个人兴趣相关的“静态话题”，另一类则是随时间变化很大的社会热点事件相关的“动态话题”。如果一篇社交网络文本的主题为“静态话题”， TUS-LDA则使用其子模型“用户-情感-主题”联合模型进行对用户的个人兴趣和情感偏向分析，反之则使用“时间-情感-主题”联合模型得到社会热点和事件和舆论观点。在TUS-LDA的两个子模型里，同样使用每篇文本的不同情感类别满足一个多项式分布的假设，并通过情感类别确定一个用户的兴趣主题多项式分布或者一个时间段内时间话题多项式分布。Mei Q et al. proposed the first sentiment-topic joint model Topic-Sentiment Model (TSM for short) model, which is improved on the basis of pLSA. It models sentiment and topic cues at the same time, assuming that each The generation of a word should first determine whether the word belongs to a positive emotion or a negative emotion, then determine the topic of the word, and finally determine the words under the corresponding topic. Like pLSA, TSM is also prone to overfitting when faced with small datasets. Based on the advantages of LDA, Lin C and He Y proposed the JST model, which not only adds a prior distribution to the latent variable topics and sentiments, but also sets a sentiment distribution that satisfies the multinomial distribution for each article. Set up a topic distribution that satisfies a multinomial distribution. Obviously, in the JST model, the relationship between topics and emotions is relatively independent. These two simple combinations bring about the noise effect of inconsistent sentiment in the article. Jo Y and Oh AH proposed the ASUM model based on the work of JST. , assuming that all words in a sentence have only one topic and share a sentiment label. Li F et al. proposed the Sentiment-LDA and Dependency-Sentiment-LDA models. Sentiment-LDA assumes that the topic distribution of the article satisfies the multinomial distribution to determine the sentiment distribution that satisfies the binomial distribution, while Dependency-Sentiment-LDA uses the conjunction information in the sentence. (e.g. conjunctions such as "but, and, however") to reduce inconsistencies in word sentiment. Aiming at the shortcomings of JST's inability to separate topic vocabulary and emotional vocabulary, Zhao W et al. proposed the MaximumEntropy LDA (Max-Ent LDA) model, which uses the property of maximum information entropy to divide words into background vocabulary and specific topic vocabulary. The accuracy of mining and the accuracy of sentiment analysis. Xu K et al. proposed the TUS-LDA model, which combines time information, user identity information, and emotional bias for personal interest mining and social hotspot detection. The TUS-LDA model divides topics into two categories, one is "static topics" related to users' personal interests, and the other is "dynamic topics" related to social hot events that change greatly over time. If the topic of a social network text is "static topic", TUS-LDA uses its sub-model "user-sentiment-topic" joint model to analyze the user's personal interests and sentiment bias, otherwise it uses "time-sentiment- The topic" joint model gets social hotspots and events and public opinion views. In the two sub-models of TUS-LDA, the different sentiment categories of each text are also used to satisfy the assumption of a multinomial distribution, and the multinomial distribution of a user's topic of interest or a multinomial distribution of time topics within a time period is determined by the sentiment category.

以上主题情感联合模型均为无监督主题情感联合模型，需要依靠情感词典的辅助信息提高模型的情感预测效果。为了将主题模型应用于有监督学习， Mcauliffe JD和BleiDM提出可适用于分类问题和回归问题的有监督主题模型 Supervised topic models(简称SLDA)，然而该模型并没有对主题层和情感层之间的联系进行深刻探讨。Bao S等]提出有监督情感分析主题模型Emotion-Term Model(简称ETM)，该模型以作者的角度进行构建，面对公众情感分类工作，针对已有的训练集语料及每篇文章的公众情感投票标签对测试预料公众情感反馈。Rao Y等人提出有监督情感分析主题模型Multi-label supervised topicmodel (简称MSTM)和Sentiment latent topic model(简称SLTM)，实验效果表明，以读者视角为构建基础的MSTM和SLTM模型更为适合公众情感投票的预测工作。The above topic sentiment joint models are all unsupervised topic sentiment joint models, which need to rely on the auxiliary information of sentiment dictionary to improve the sentiment prediction effect of the model. In order to apply the topic model to supervised learning, Mcauliffe JD and BleiDM proposed a supervised topic model (SLDA), which can be applied to classification problems and regression problems. Contact for in-depth discussion. Bao S et al.] proposed a supervised sentiment analysis topic model Emotion-Term Model (ETM for short), which is constructed from the author's point of view, facing the work of public sentiment classification, based on the existing training set corpus and the public sentiment of each article. The polling tab anticipates public sentiment feedback on the test. Rao Y et al. proposed supervised sentiment analysis topic models Multi-label supervised topic model (MSTM for short) and Sentiment latent topic model (SLTM for short). The experimental results show that the MSTM and SLTM models based on the reader's perspective are more suitable for public sentiment Prediction of voting works.

过去的研究工作大部分只将文本信息与文本发布时间、文本情感偏向、文本发布作者身份等额外信息中的一个或者两个维度信息进行融合，并没有研究工作针对社交网络文本的特点，对社交网络文本给出的文本信息、发布时间、用户特征充分挖掘并有效整合，充分发挥各个维度信息的价值，对社交文本进行准确挖掘。例如，尽管TUS-LDA模型结合了情感、时间、文本、用户身份四个维度的信息，但是该模型并没有利用用户的特征信息。而上述不同纬度的特征在基于主题模型的情感分析中均具有重要价值，具体如下：Most of the past research work only integrates text information with one or two dimensions of additional information such as text publishing time, text emotional bias, and text publishing authorship. The text information, release time, and user characteristics given by network texts are fully mined and effectively integrated, and the value of information in various dimensions can be fully exploited to accurately mine social texts. For example, although the TUS-LDA model combines information from four dimensions of emotion, time, text, and user identity, the model does not utilize user feature information. The above features of different dimensions are of great value in sentiment analysis based on topic models, as follows:

首先，网络舆论热点随时间变化迅速，随着时间带有显著发展变化。例如，曾经的社会舆论热点“扶老人过马路”总是充斥着“讹诈、道德底线、冷漠”等主题词汇，表达了人们对该主题的担心与痛斥，带有负面情感偏向。一段时间后随着事件的冷却，在人们理性的分析思考之后，社会对该主题的表达逐渐演变回“美德、善良、公正”等积极词汇，再次回归正面的情感偏向。First of all, online public opinion hotspots change rapidly over time, with significant development and changes over time. For example, the once hot topic of public opinion "helping the elderly to cross the road" is always full of themes such as "blackmail, moral bottom line, indifference", expressing people's worries and denunciations on this topic, with a negative emotional bias. After a period of time, with the cooling of the event, after people's rational analysis and thinking, the society's expression of this topic gradually evolved back to positive words such as "virtue, kindness, justice", and returned to positive emotional bias again.

其次，情感标签对模型进行主题建模和情感分析起到监督作用，能更好的区分不同主题与不同情感之间的联系。Secondly, sentiment labels play a role in supervising the model’s topic modeling and sentiment analysis, which can better distinguish the relationship between different topics and different emotions.

最后，用户的特征标签也对主题-情感有不同的影响。例如，对同一新闻事件，男性和女性、工薪阶层和中产阶层的人们的看法和情感会有微妙的不同，这与其自身的所处环境造成的影响不可分割，而用户的特征标签正是用以描述用户自身及其环境的重要表达。明显的，互联网用户数量数以万计，形如AT 模型的作者-主题建模方式如用于网络社交文本的主题提取及情感分析，对每个网络社交用户进行跟踪建模，将导致模型参数过多，这并不能够适应网络社交文本数量庞大的特点。同时，社交网络面对社会每一个民众，人人之间既不相同，又存在共性，通过这些共性将人群按照不同粒度级别进行社区划分，再进行主题建模，不仅能够有效地减少模型参数，而且通过社区内的人群信息相互补充挖掘更加充分的主题信息和更有效地情感预测。但目前为止，尚没有相关研究工作提出如何有效将用户的多维特征以及时间、文本、情感标签等融入主题模型。Finally, user feature labels also have different effects on topic-sentiment. For example, men and women, working-class and middle-class people have subtly different perceptions and emotions about the same news event, which is inseparable from the influence of their own environment, and the user's feature tag is used to Important expressions that describe the user himself and his environment. Obviously, there are tens of thousands of Internet users, such as the author-topic modeling method of the AT model, such as topic extraction and sentiment analysis for online social texts, and tracking and modeling each online social user will lead to model parameters. Too much, which cannot adapt to the huge amount of social texts on the Internet. At the same time, the social network faces every member of the society, and everyone is different and has commonalities. Through these commonalities, people can be divided into communities according to different granularity levels, and then topic modeling can not only effectively reduce model parameters, but also Moreover, through the mutual complementation of crowd information in the community, more sufficient topic information and more effective sentiment prediction can be mined. But so far, there is no related research work that proposes how to effectively integrate the multi-dimensional features of users, as well as time, text, emotional tags, etc. into topic models.

发明内容SUMMARY OF THE INVENTION

本发明为克服上述现有技术所述的至少一种缺陷，提供一种基于用户特征优化的主题挖掘情感分析方法，有效整合文本信息、时间、用户特征、情感标签等四个维度的信息，重新定义网络社交文本生成方式，建立多维度主题情感分联合型，通过对多维度信息的整合，提高对网络社交文本的情感预测准确度。In order to overcome at least one of the above-mentioned defects in the prior art, the present invention provides a theme mining sentiment analysis method based on user feature optimization, which effectively integrates information from four dimensions, such as text information, time, user features, and emotional tags, and recreates the Define the generation method of online social text, establish a multi-dimensional theme sentiment sub-joint, and improve the accuracy of sentiment prediction for online social text through the integration of multi-dimensional information.

为解决上述技术问题，本发明采用的技术方案是：一种基于用户特征优化的主题挖掘情感分析方法，包括以下步骤：In order to solve the above-mentioned technical problems, the technical solution adopted in the present invention is: a kind of theme mining sentiment analysis method based on user feature optimization, comprising the following steps:

S1.建立基于LDA主题模型的多维度主题情感联合模型MTSM，该模型融合了文本信息、时间、用户特征和情感标签；S1. Establish a multi-dimensional topic sentiment joint model MTSM based on the LDA topic model, which integrates text information, time, user characteristics and sentiment labels;

S2.根据文档在模型中的生成过程，使用训练语料对模型训练，进行参数的求解：对文档用户的社区概率分布参数进行估计，发现用户社区，知道了用户的所属社区之后，对该用户所写的文档进行主题和情感检测；使用Gibbs Sampling算法根据公式不断对用户写的文档里的每一个单词采样，推测每个单词可能所属的主题和情感标签，直到收敛；S2. According to the generation process of the document in the model, use the training corpus to train the model and solve the parameters: Estimate the community probability distribution parameters of the document user, discover the user community, and know the user's community. Subject and emotion detection of the written document; use the Gibbs Sampling algorithm to continuously sample each word in the document written by the user according to the formula, and infer the subject and emotion label that each word may belong to until convergence;

S3.模型参数求解完成后，训练好的MTSM模型可以对测试文档有效地进行主题挖掘和情感预测；S3. After the model parameters are solved, the trained MTSM model can effectively perform topic mining and sentiment prediction on the test document;

S4.对测试文档进行主题挖掘和情感预测：得到了模型的参数以后，当对测试文档进行主题挖掘和情感预测的时候，分为社区发现和文档的单词采样两步，利用这两个步骤采样迭代直到收敛，得到基于训练文档和测试文档的新参数，以此进行主题挖掘和情感预测。S4. Perform topic mining and sentiment prediction on the test document: After obtaining the parameters of the model, when performing topic mining and sentiment prediction on the test document, it is divided into two steps: community discovery and document word sampling, and use these two steps to sample Iterate until convergence, and obtain new parameters based on training documents and test documents for topic mining and sentiment prediction.

进一步的，所述的MTSM模型在原始的LDA主题模型的基础上添加如下生成条件：Further, the MTSM model adds the following generation conditions on the basis of the original LDA topic model:

1)添加全局的社区多项式概率π，使其先验服从狄利克雷分布，即π～Dirichlet(γ)，该概率分布代表一批语料里的用户属于各个社区的概率；1) Add the global community polynomial probability π to make its prior obey the Dirichlet distribution, ie π～Dirichlet(γ), the probability distribution represents the probability that users in a batch of corpus belong to each community;

2)添加全局的特定社区下用户特征多项式概率ψ，每一种用户特征都有一个概率分布，使用j计数，使其先验服从狄利克雷分布，即ψ_j～Dirichlet(λ)，该概率分布代表某个社区里，用户的特征分布概率；2) Add the polynomial probability ψ of the user feature under the global specific community, each user feature has a probability distribution, use j to count, so that its prior obeys the Dirichlet distribution, that is, ψ_j ~Dirichlet(λ), the probability The distribution represents the feature distribution probability of users in a certain community;

3)对于每个社区，添加其社区内的文章主题概率分布θ_c，即社区内所有用户的写的所有文章共同服从一个主题概率分布，使其先验服从狄利克雷分布，即θ_c～Dirichlet(α)，该概率代表每个社区的用户所有文章的主题分布概率；3) For each community, add the topic probability distribution θ_c of articles in its community, that is, all articles written by all users in the community jointly obey a topic probability distribution, and make it a priori obey the Dirichlet distribution, that is, θ_c ~ Dirichlet(α), this probability represents the topic distribution probability of all articles of users in each community;

4)对于每一个主题添加情感概率分布φ_z，使其先验服从狄利克雷分布，即φ_z～Dirichlet(μ)，该概率代表用户对一批语料里挖掘出来的主题的情感分布概率；4) Add the emotional probability distribution φ_z to each topic, so that its prior obeys the Dirichlet distribution, that is, φ_z ～Dirichlet(μ), which represents the probability of the user’s emotional distribution of topics excavated from a batch of corpus;

5)对于每一个主题添加时间概率分布τ，使其服从伯努利分布，即 t～Beta(τ)，该概率代表一个主题的时间分布概率；5) Add the time probability distribution τ to each topic to make it obey the Bernoulli distribution, that is, t～Beta(τ), which represents the time distribution probability of a topic;

6)对于特定主题的特定情感添加词语概率分布

使其先验服从狄利克雷分布

该概率代表特定主题特定情感下所有词语的分布概率。6) Add word probability distributions for specific sentiments on specific topics

make its prior obey the Dirichlet distribution

This probability represents the distribution probability of all words under a specific sentiment of a specific topic.

进一步的，所述的S2步骤具体包括：Further, the step S2 specifically includes:

S21.在总共J维度的特征标签空间中，对于每一维的特征标签f_j采样一个满足多项式分布的特征值概率分布ψ_j～Dirichlet(λ)；S21. In the feature label space of a total of J dimensions, for each dimension of the feature label f_j , sample an eigenvalue probability distribution ψ_j ~Dirichlet(λ) that satisfies the polynomial distribution;

S22.对于数据集中的所有用户采样一个满足多项式分布的社区概率分布π～Dirichlet(γ)；S22. Sample a community probability distribution π～Dirichlet(γ) that satisfies the polynomial distribution for all users in the data set;

S23.对于每一个聚集的社区c采样一个满足多项式分布的主题概率分布θ_c～Dirichlet(α)；S23. For each aggregated community c, sample a topic probability distribution θ_c ~Dirichlet(α) that satisfies the polynomial distribution;

S24.对于每一个主题z采样一个满足多项式分布的情感概率分布φ_z～Dirichlet(μ)；S24. Sampling an emotion probability distribution φ_z ~Dirichlet(μ) that satisfies the polynomial distribution for each topic z;

S25.对于每一个主题z采样一个满足二项式分布的时间概率分布 t～Beta(τ)；S25. For each topic z, sample a time probability distribution t～Beta(τ) that satisfies the binomial distribution;

S26.对于每一个主题z的每一个特定情感s，采样一个满足多项式分布的单词概率分布

S26. For each specific emotion s of each topic z, sample a word probability distribution that satisfies the multinomial distribution

S27.对于每一个用户u_r：S27. For each user_ur :

a)为该用户采样出所属的社区

a) Sample the community to which the user belongs

b)对描述该用户的每一维度特征空间

为该用户的第j维度的特征空间采样其特征值

b) For each dimension feature space describing the user

Sampling its eigenvalues for the jth dimension of the user's feature space

c)对于用户u_r所写的每一篇文章

其中的每一个单词w_i，n：c) for each article written by user_ur

Each of the words wi_{,n in it} :

i.根据该文章的作者所属社区c采样出该单词的主题z_i，n～Mul(θ_c)；i. Sample the topic zi_,n ~Mul(θ_c ) of the word according to the community c to which the author of the article belongs;

ii.根据主题z_i，n采样出该单词的情感s_i，n～Mul(φ_z)；ii. According to the topic zi_,n, sample the sentiment of the word_si,n ～Mul(φ_z );

iii.根据主题z_i，n采样出该单词的时间戳t_i，n～Bin(τ_z)；iii. According to the topic zi_{, n,} sample the timestamp t_i,n ～Bin(τ_z ) of the word;

iv.根据主题z_i，n和情感s_i，n采样出具体单词

iv. Sample specific words according to topic zi_{, n} and emotion_{si, n}

进一步的，所述的S3步骤具体包括：Further, the step S3 specifically includes:

S31.对文档用户的社区概率分布参数进行估计：S31. Estimate the community probability distribution parameters of document users:

在社区发现的步骤中，带有特征标签

的用户u_r属于社区c的概率如下式所示：In the step of community discovery, with feature label

The probability of user_ur belonging to community c is as follows:

其中，

和

的求解公式为：in,

and

The solution formula is:

式中，

为除了用户u_r以外其他所有属于社区c的用户数量，

为除了用户u_r以外，在社区c的所有用户中，第j维特征的特征值k_j的出现频数；In the formula,

is the number of all users belonging to community c except user_ur ,

is the frequency of occurrence of the feature value k_j of the j-_th dimension feature among all users of the community c except for the user ur;

S32.完成对用户的社区发现之后，需要根据训练文档的文本内容、情感标签、时间戳对每一个用户生成的文档进行主题采样以及情感采样；S32. After completing the community discovery of the user, subject sampling and sentiment sampling need to be performed on each user-generated document according to the text content, sentiment label, and timestamp of the training document;

对于文档

已知其情感标签为s_i，用户u_r所属的社区为

则其单词w_i，n属于某一主题和情感的概率为：for documentation

It is known that its emotional label is s_i , and the community to which user_ur belongs is

Then the probability that the word wi_,n belongs to a certain topic and emotion is:

其中，

的参数为：in,

The parameters are:

其中，

为除了文档d_i的第n个单词w_i，n外，属于社区c的所有用户的所有文档中，属于主题z的单词频数，

为除了单词w_i，n外，所有文档中属于主题z的情感s的单词频数，

为除了单词w_i，n外，单词w属于主题并且属于情感的频数；in,

is the frequency of words belonging to topic z in all documents belonging to all users of community c except for the nth word wi_,n of document d_i ,

is the word frequency of sentiment s belonging to topic z in all documents except words wi_{, n} ,

is the frequency that the word w belongs to the topic and belongs to the emotion except for the word wi_{, n} ;

S33.文档的情感标签s_i为已知参数，因此在对训练集文档训练采样过程中，每篇文档的单词仅仅对各个主题下与情感s_i相关的参数进行更新采样，通过情感标签达到有监督训练的目的；S33. The sentiment label s_i of the document is a known parameter, so in the process of training and sampling the training set document, the words of each document only update and sample the parameters related to sentiment s_i under each topic, and achieve the desired result through the sentiment label. the purpose of supervised training;

参数

使用矩阵估计法进行参数更新，具体计算方法为：parameter

Use the matrix estimation method to update the parameters. The specific calculation method is:

其中，

和

分别为所有被赋予主题z的单词的时间戳平均值和标准差。in,

and

are the mean and standard deviation of timestamps for all words assigned to topic z, respectively.

进一步的，所述的S4步骤具体包括：Further, the step S4 specifically includes:

S41.假设文档d_test，已知生成该文档的用户为u_test，该用户的特征标签分别是

以及该文档的时间戳为t_test，那么该文档的情感标签则根据式下式计算：S41. Suppose the document d_test , the user who generates the document is known to be u_test , and the feature labels of the user are respectively

And the timestamp of the document is t_test , then the sentiment label of the document is calculated according to the following formula:

其中，对其进行所属社区的概率

根据下式计算：Among them, the probability of the community to which it belongs

Calculate according to the following formula:

将上式简化为：Simplify the above formula to:

其中，

为用户的第j维特征空间中的特征值

上式中各个参数的计算公式为：in,

is the feature value in the jth dimension feature space of the user

The calculation formula of each parameter in the above formula is:

其中，

为除了文档d_test的第n个单词w_test，n外，测试集和训练集中，属于社区c的所有用户的所有文档中，属于主题z的单词频数；

为除了单词w_test，n外，测试集和训练集中，所有文档中属于主题z的情感s的单词频数；

为除了单词w_test，n外，测试集和训练集中，单词w属于主题并且属于情感的频数；

和

分别替换成

和

其中，

和

分别为在训练集和测试集中，属于主题z的所有单词的时间戳的均值和标准差，

为除了文档d_test的作者u_test以外，属于社区c的用户数量；in,

is the frequency of words belonging to topic z in all documents of all users belonging to community c in the test set and training set, except for the nth word w_{test, n} of document d_test ;

is the word frequency of sentiment s belonging to topic z in all documents in the test set and training set except the word w_{test, n} ;

In addition to the word w_{test, n} , the test set and training set, the word w belongs to the topic and belongs to the sentiment frequency;

and

replaced by

and

in,

and

are the mean and standard deviation of the timestamps of all words belonging to topic z in the training and test sets, respectively,

is the number of users who belong to the community c except for the author u_test of the document d_test ;

S42.由于文档的情感标签s_test为未知参数，因此在对测试文档采样过程中，每篇文档的单词需要对各个主题下的每一种情感偏向的相关参数进行更新采样，进而确定该文档属于哪一种类别情感的概率最大。S42. Since the emotional label s_test of the document is an unknown parameter, in the process of sampling the test document, the words of each document need to update and sample the relevant parameters of each emotional bias under each topic, and then determine that the document belongs to Which category of emotion has the highest probability.

与现有技术相比，有益效果是：Compared with the prior art, the beneficial effects are:

1.挖掘用户特征标签的价值，通过用户的特征标签对社交网络用户进行社区划分，可以进行社区级别的主题挖掘和情感分析，传统主题情感联合模型无法进行社区级别的主题挖掘和情感分析；1. Mining the value of user feature tags, and dividing social network users into communities by user feature tags, community-level topic mining and sentiment analysis can be performed, while traditional topic sentiment joint models cannot perform community-level topic mining and sentiment analysis;

2.针对网络社交文本的特点，有效整合文本信息、时间、用户特征、情感标签等四个维度的信息，重新定义网络社交文本生成方式，建立多维度主题情感分联合型，并且，提供从多个视角观测对比主题信息；2. According to the characteristics of online social texts, effectively integrate information from four dimensions, such as text information, time, user characteristics, and emotional tags, redefine the generation method of online social texts, establish a multi-dimensional theme emotion classification type, and provide multiple Observing and comparing topic information from different perspectives;

3.将多维度主题情感联合模型应用于公众情感分析领域做情感预测任务。通过对多维度信息的整合，提高对网络社交文本的情感预测准确度。3. Apply the multi-dimensional topic sentiment joint model to the field of public sentiment analysis to do sentiment prediction tasks. Through the integration of multi-dimensional information, the accuracy of sentiment prediction for online social texts is improved.

附图说明Description of drawings

图1是本发明多维度主题情感联合模型MTSM结构示意图。FIG. 1 is a schematic structural diagram of the multi-dimensional theme emotion joint model MTSM of the present invention.

图2是本发明MTSM分析文本流程图。FIG. 2 is a flow chart of text analysis by MTSM of the present invention.

图3是本发明多维度主题情感模型MTSM的算法流程图。FIG. 3 is an algorithm flow chart of the multi-dimensional theme emotion model MTSM of the present invention.

图4是本发明多维度主题情感联合模型MTSM预测步骤算法流程图。FIG. 4 is a flow chart of the algorithm for the prediction steps of the multi-dimensional theme emotion joint model MTSM of the present invention.

具体实施方式Detailed ways

附图仅用于示例性说明，不能理解为对本发明的限制；为了更好说明本实施例，附图某些部件会有省略、放大或缩小，并不代表实际产品的尺寸；对于本领域技术人员来说，附图中某些公知结构及其说明可能省略是可以理解的。附图中描述位置关系仅用于示例性说明，不能理解为对本发明的限制。The accompanying drawings are for illustrative purposes only, and should not be construed as limiting the present invention; in order to better illustrate the present embodiment, some parts of the accompanying drawings may be omitted, enlarged or reduced, and do not represent the size of the actual product; for those skilled in the art It is understandable to the artisan that certain well-known structures and descriptions thereof may be omitted from the drawings. The positional relationships described in the drawings are only for exemplary illustration, and should not be construed as limiting the present invention.

实施例1：Example 1:

一种基于用户特征优化的主题挖掘情感分析方法，包括以下步骤：A sentiment analysis method for topic mining based on user feature optimization, comprising the following steps:

步骤1.建立基于LDA主题模型的多维度主题情感联合模型MTSM，该模型融合了文本信息、时间、用户特征和情感标签；Step 1. Establish a multi-dimensional topic sentiment joint model MTSM based on the LDA topic model, which integrates text information, time, user characteristics and sentiment labels;

如图1所示，MTSM模型在原始的LDA主题模型的基础上添加如下生成条件：As shown in Figure 1, the MTSM model adds the following generation conditions to the original LDA topic model:

6)对于特定主题的特定情感添加词语概率分布

使其先验服从狄利克雷分布

make its prior obey the Dirichlet distribution

在该模型中，对数据集进行主题挖掘情感分析主要分为两步：用户社区发现，文档主题提取和情感预测。In this model, topic mining sentiment analysis on the dataset is mainly divided into two steps: user community discovery, document topic extraction and sentiment prediction.

首先是用户社区发现，将用户特征融入主题模型，进而约束主题形成的过程是本文的创新点之一。原因是：首先，在现实生活中，面对同一个主题，处于不同情况或者不同环境的人们往往会产生不同的情感。例如，对于同样的一个新闻事件，不同薪资阶层、不同地域、不同年龄的人们往往会抱有不一样的看法。而相似的人群，他们讨论的主题方面可能较为相似，其情感反馈也较为相近。其次，因此，本文提出数据集中的用户集合实际由隐变量不同的社区组成的观点，而每个社区均通过其高频出现的不同特征来表征。用户所属社区总数C为预定参数，这样便可以对用户进行不同粒度级别的划分。The first is the discovery of the user community. It is one of the innovations of this paper to incorporate user characteristics into the topic model and then constrain the process of topic formation. The reasons are: First, in real life, people in different situations or environments tend to have different emotions when faced with the same subject. For example, for the same news event, people from different salary classes, different regions, and different ages often have different opinions. Similar groups of people may discuss similar topics and have similar emotional feedback. Second, therefore, this paper proposes the idea that the set of users in the dataset is actually composed of communities with different latent variables, and each community is characterized by its different characteristics that appear frequently. The total number C of communities to which users belong is a predetermined parameter, so that users can be divided into different levels of granularity.

其次是主题挖掘与情感分析部分。当对用户进行社区聚集之后，针对每一个社区，MTSM对社区内的用户文章进行主题提取以及情感预测。假设每一个社区中，用户的文章均满足一个主题概率分布，同时不同的主题有不同的时间概率分布。在MTSM中，用户生成一篇文档的过程则是根据所属社区的主题概率分布和时间概率分布，用户以一定的概率选择某个主题，然后根据该主题的情感概率分布以一定概率选择某种偏向的情感，最后，根据特定主题以及特定情感，以一定概率选择选择某个单词填入文档。这个步骤循环至文档完成。由上可见，在MTSM模型中，一篇文档的生成不仅仅受到其主题的影响，还受到所属社区、时间，以及情感偏向的作用。各个维度信息通过这种方式的集合，不仅仅能够得到某一社区内的主题分布情况，还能同时观察得到该主题下的情感分布，该主题的时间分布，以及特定主题特定情感的单词分布。The second part is topic mining and sentiment analysis. After gathering users in the community, for each community, MTSM performs topic extraction and sentiment prediction on user articles in the community. It is assumed that in each community, the user's articles satisfy a topic probability distribution, and different topics have different time probability distributions. In MTSM, the process for a user to generate a document is to select a topic with a certain probability according to the topic probability distribution and time probability distribution of the community to which he belongs, and then select a certain bias with a certain probability according to the emotional probability distribution of the topic Finally, according to a specific topic and a specific emotion, a certain word is selected to fill in the document with a certain probability. This step loops until the document is complete. It can be seen from the above that in the MTSM model, the generation of a document is not only affected by its subject, but also by its community, time, and emotional bias. The collection of various dimensions information in this way can not only obtain the distribution of topics in a certain community, but also observe the distribution of emotions under the topic, the time distribution of the topic, and the word distribution of specific emotions in a specific topic.

MTSM模型所需的参数标注如表1所示：The parameter labels required by the MTSM model are shown in Table 1:

表1 MTSM模型所需要的参数Table 1 Parameters required by the MTSM model

步骤2.将测试文档放入该MTSM模型中，根据该MTSM模型内容进行文档的生成；Step 2. Put the test document into the MTSM model, and generate the document according to the content of the MTSM model;

MTSM模型中一篇文章的生成过程如下所示：The generation process of an article in the MTSM model is as follows:

S27.对于每一个用户u_r：S27. For each user_ur :

a)为该用户采样出所属的社区

a) Sample the community to which the user belongs

b)对描述该用户的每一维度特征空间

为该用户的第j维度的特征空间采样其特征值

b) For each dimension feature space describing the user

Sampling its eigenvalues for the jth dimension of the user's feature space

c)对于用户u_r所写的每一篇文章

其中的每一个单词w_i，n：c) for each article written by user_ur

Each of the words wi_{,n in it} :

iv.根据主题z_i，n和情感s_i，n采样出具体单词

iv. Sample specific words according to topic zi_{, n} and emotion_{si, n}

步骤3.参数求解Step 3. Parameter Solving

知道了该模型生成文档的流程之后，我们就可以根据流程推导该模型的参数。得到模型的参数后，就知道了一批训练文本的社区分布、每个社区的用户特征分布、主题分布、每个主题的情感分布、每个主题的时间分布、每个主题每个情感的词汇分布；根据该模型的参数，还能够预测一篇文档的主题分布和情感标签，当要预测一篇测试(未知)文本的主题和情感标签时，根据模型参数、文本的用户特征、时间戳、内容即可推测该文本的主题和情感标签。下面首先介绍模型参数求解的步骤，然后是预测一篇测试(未知)文本的主题和情感标签的步骤。主要流程如图2所示.After knowing the process of document generation by the model, we can deduce the parameters of the model according to the process. After the parameters of the model are obtained, the community distribution of a batch of training texts, the user feature distribution of each community, the topic distribution, the sentiment distribution of each topic, the time distribution of each topic, and the vocabulary of each topic and each emotion are known. distribution; according to the parameters of the model, it can also predict the topic distribution and sentiment label of a document. When predicting the topic and sentiment label of a test (unknown) text, according to the model parameters, text user characteristics, timestamp, content to infer the topic and sentiment tags of the text. The following first introduces the steps of solving the model parameters, followed by the steps of predicting the topic and sentiment labels of a test (unknown) text. The main process is shown in Figure 2.

鉴于Gibbs Sampling算法的简易明了及其有效性，本文使用Gibbs Sampling 算法对MTSM模型参数进行求解，其参数求解具体流程如算法1所示。参数的推导可以主要分为2个步骤，一个是对文档用户的社区概率分布参数进行估计，发现用户社区，第二个步骤是知道了用户的所属社区之后，对该用户所写的文档进行主题和情感检测。使用GibbsSampling算法根据公式不断对每一个用户和文档里的单词采样，推测每个用户其所属的可能社区，每个单词可能所属的主题和情感标签，直到收敛，那么就可以知道模型里的参数了。In view of the simplicity and effectiveness of the Gibbs Sampling algorithm, this paper uses the Gibbs Sampling algorithm to solve the parameters of the MTSM model. The specific process of parameter solving is shown in Algorithm 1. The derivation of parameters can be mainly divided into two steps. One is to estimate the community probability distribution parameters of document users and discover the user community. The second step is to theme the document written by the user after knowing the community to which the user belongs. and emotion detection. Use the GibbsSampling algorithm to continuously sample the words in each user and document according to the formula, infer the possible community to which each user belongs, the topic and sentiment label that each word may belong to, until convergence, then you can know the parameters in the model .

多维度主题情感联合模型参数的算法流程图如图3所示。The algorithm flow chart of the parameters of the multi-dimensional topic emotion joint model is shown in Figure 3.

在社区发现的步骤中，带有特征标签

The probability of user_ur belonging to community c is as follows:

其中，

和

的求解公式为：in,

and

The solution formula is:

式中，

为除了用户u_r以外其他所有属于社区c的用户数量，

is the number of all users belonging to community c except user_ur ,

对于文档

已知其情感标签为s_i，用户u_r所属的社区为

则其单词w_i，n属于某一主题和情感的概率为：for documentation

其中，

的参数为：in,

The parameters are:

其中，

参数

使用矩阵估计法进行参数更新，具体计算方法为：parameter

其中，

和

分别为所有被赋予主题z的单词的时间戳平均值和标准差。in,

and

步骤4.对测试文档进行主题挖掘和情感预测：得到了模型的参数以后，当对测试文档进行主题挖掘和情感预测的时候，分为社区发现和文档的单词采样两步，利用这两个步骤采样迭代直到收敛，得到基于训练文档和测试文档的新参数，以此进行主题挖掘和情感预测。Step 4. Perform topic mining and sentiment prediction on the test document: After obtaining the parameters of the model, when performing topic mining and sentiment prediction on the test document, it is divided into two steps: community discovery and document word sampling. Use these two steps Sampling iterates until convergence, and obtains new parameters based on training documents and test documents for topic mining and sentiment prediction.

多维度主题情感模型的预测步骤算法流程图如图4所示。The algorithm flow chart of the prediction steps of the multi-dimensional topic sentiment model is shown in Figure 4.

S4步骤具体包括：Step S4 specifically includes:

其中，对其进行所属社区的概率

Calculate according to the following formula:

将上式简化为：Simplify the above formula to:

其中，

为用户的第j维特征空间中的特征值

上式中各个参数的计算公式为：in,

is the feature value in the jth dimension feature space of the user

The calculation formula of each parameter in the above formula is:

其中，

和

分别替换成

和

其中，

和

为除了文档d_test的作者u_test以外，属于社区c的用户数量；in,

and

replaced by

and

in,

and

显然，本发明的上述实施例仅仅是为清楚地说明本发明所作的举例，而并非是对本发明的实施方式的限定。对于所属领域的普通技术人员来说，在上述说明的基础上还可以做出其它不同形式的变化或变动。这里无需也无法对所有的实施方式予以穷举。凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等，均应包含在本发明权利要求的保护范围之内。Obviously, the above-mentioned embodiments of the present invention are only examples for clearly illustrating the present invention, rather than limiting the embodiments of the present invention. For those of ordinary skill in the art, changes or modifications in other different forms can also be made on the basis of the above description. There is no need and cannot be exhaustive of all implementations here. Any modifications, equivalent replacements and improvements made within the spirit and principle of the present invention shall be included within the protection scope of the claims of the present invention.

Claims

1. A topic mining emotion analysis method based on user feature optimization is characterized by comprising the following steps:

s1, establishing a multi-dimensional theme emotion combined model MTSM based on an LDA theme model, wherein the model integrates text information, time, user characteristics and emotion labels;

s2, training the model by using the training corpora according to the generation process of the document in the model, and solving parameters: estimating community probability distribution parameters of document users, finding user communities, and detecting the topics and emotions of the documents written by the users after knowing the communities to which the users belong; continuously Sampling each word in the document written by the user according to a formula by using a Gibbs Sampling algorithm, and inferring a theme and an emotion label to which each word possibly belongs until convergence;

s3, after the model parameter solution is completed, the trained MTSM model can effectively perform theme mining and emotion prediction on the test document;

s4, performing theme mining and emotion prediction on the test document: after the parameters of the model are obtained, when theme mining and emotion prediction are carried out on a test document, the two steps of community discovery and word sampling of the document are divided, sampling iteration is carried out until convergence by using the two steps, new parameters based on a training document and the test document are obtained, and the theme mining and emotion prediction are carried out according to the new parameters; the method specifically comprises the following steps:

s41, supposing that the document d_testThe user who generated the document is known as u_testThe feature labels of the users are respectively

And the document has a timestamp t_testThen the sentiment tag of the document is calculated according to the formula:

wherein the probability of belonging community is carried out on it

Calculated according to the following formula:

the above equation is simplified as:

wherein,

feature values in a j-th dimension feature space for a user

The calculation formula of each parameter in the above formula is:

wherein,

to exclude the document d_testN-th word w_test,nIn addition, in the test set and the training set, the word frequency of the subject z is contained in all the documents of all the users belonging to the community c;

in addition to the word w_test,nIn addition, in the test set and the training set, the word frequency of the emotion s belonging to the theme z in all the documents is concentrated;

in addition to the word w_test,nIn addition, in the test set and the training set, the word w belongs to the theme and belongs to the frequency of emotion;

and

respectively replaced by

And

wherein,

and

the mean and standard deviation of the timestamps of all words belonging to topic z in the training set and the test set, respectively,

to exclude the document d_testAuthor u of_testIn addition, the number of users belonging to community c;

is in addition to user u_rExcept for the number of all users belonging to community c,

is in addition to user u_rIn addition, among all users in the community c, the characteristic value k of the j-th dimension characteristic_jThe frequency of occurrence of (c); alpha, beta, gamma, lambda and mu are Dirichlet distribution hyper-parameters;

s42. sentiment tag s of document_testThe emotion detection method is an unknown parameter, so that in the process of sampling test documents, the word of each document needs to update and sample the related parameter of each emotion bias under each theme, and then the probability of the category emotion to which the document belongs is determined to be the maximum.

2. The method of claim 1, wherein the MTSM model adds the following generation conditions based on the original LDA topic model:

1) adding global community polynomial probability pi to make the prior obey Dirichlet distribution, namely pi-Dirichlet (gamma), wherein the probability distribution represents the probability that users in a batch of corpora belong to each community;

2) adding global user characteristic polynomial probability psi under specific community, each user characteristic has a probability distribution, using j count to make it obey Dirichlet distribution a priori, i.e. psi_jDirichlet (λ), the probability distribution representing the probability of a feature distribution of users in a certain community;

3) for each community, adding article topic probability distribution theta in the community_cThat is, all written articles of all users in the community obey a topic probability distribution together, so that the prior obeys Dirichlet distribution, namely theta_cDirichlet (α), which represents the topic distribution probability of all articles of users of each community;

4) for eachOne topic adds an emotional probability distribution phi_zSubject it a priori to a Dirichlet distribution, i.e. phi_zDirichlet (μ), which represents the emotion distribution probability of the user for the topic mined from a batch of corpus;

5) adding a time probability distribution tau to each topic, so that the time probability distribution tau obeys Bernoulli distribution, namely t-Beta (tau), and the probability represents the time distribution probability of one topic;

6) adding word probability distributions for particular emotions for particular topics

Subject it a priori to a Dirichlet distribution

The probability represents the distribution probability of all words under a particular emotion for a particular topic.

3. The method for analyzing topic mining emotion based on user feature optimization according to claim 2, wherein the step S2 specifically includes:

s21, in the feature label space with the total J dimensions, the feature label f for each dimension_jSampling a characteristic value probability distribution psi satisfying a polynomial distribution_j～Dirichlet(λ)；

S22, sampling a community probability distribution pi-Dirichlet (gamma) meeting polynomial distribution for all users in the data set;

s23, sampling a theme probability distribution theta satisfying a polynomial distribution for each aggregated community c_c～Dirichlet(α)；

S24, sampling an emotion probability distribution phi meeting polynomial distribution for each theme z_z～Dirichlet(μ)；

S25, sampling a time probability distribution t-Beta (tau) meeting binomial distribution for each subject z;

s26, for each specific emotion s of each theme z, sampling a word probability score satisfying a polynomial distributionCloth

S27. for each user u_r：

a) The community of the user is sampled

b) For each dimension feature space describing the user

Sampling the characteristic value of the characteristic space of j dimension of the user

c) For user u_rEach article written

Each word w therein_i,n：

i. Sampling out the subject z of the word according to the community c to which the author of the article belongs_i,n～Mul(θ_c)；

According to the subject z_i,nThe emotion s of the word is sampled_i,n～Mul(φ_z)；

According to the subject z_i,nThe time stamp t of the word is sampled_i,n～Bin(τ_z)；

According to the subject z_i,nAnd emotions s_i,nSampling out concrete words