Movatterモバイル変換


[0]ホーム

URL:


CN109933657B - A sentiment analysis method for topic mining based on user feature optimization - Google Patents

A sentiment analysis method for topic mining based on user feature optimization
Download PDF

Info

Publication number
CN109933657B
CN109933657BCN201910218584.2ACN201910218584ACN109933657BCN 109933657 BCN109933657 BCN 109933657BCN 201910218584 ACN201910218584 ACN 201910218584ACN 109933657 BCN109933657 BCN 109933657B
Authority
CN
China
Prior art keywords
distribution
topic
emotion
community
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910218584.2A
Other languages
Chinese (zh)
Other versions
CN109933657A (en
Inventor
冯佳纯
饶洋辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen UniversityfiledCriticalSun Yat Sen University
Priority to CN201910218584.2ApriorityCriticalpatent/CN109933657B/en
Publication of CN109933657ApublicationCriticalpatent/CN109933657A/en
Application grantedgrantedCritical
Publication of CN109933657BpublicationCriticalpatent/CN109933657B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Images

Landscapes

Abstract

The invention belongs to emotion analysis and topic mining tasks in the field of natural language processing, and particularly relates to a topic mining emotion analysis method based on user feature optimization. The method comprises the following steps: s1, establishing a multi-dimensional theme emotion combined model MTSM based on an LDA theme model, wherein the model integrates text information, time, user characteristics and emotion labels; s2, solving model parameters by using a training corpus training model; and S3, performing theme mining and emotion prediction on the test corpus by using the trained model. Aiming at the characteristics of the social network texts, the method effectively integrates the information of four dimensions such as text information, time, user characteristics, emotion labels and the like, redefines the generation mode of the social network texts, establishes a multi-dimensional theme emotion classification combination type, provides the theme information observed and compared from multiple visual angles, and improves the emotion prediction accuracy of the social network texts.

Description

Translated fromChinese
一种基于用户特征优化的主题挖掘情感分析方法A sentiment analysis method for topic mining based on user feature optimization

技术领域technical field

本发明属于自然语言处理领域的情感分析与主题挖掘任务,更具体地,涉及一种基于用户特征优化的主题挖掘情感分析方法。The invention belongs to the tasks of sentiment analysis and topic mining in the field of natural language processing, and more particularly relates to a topic mining sentiment analysis method based on user feature optimization.

背景技术Background technique

互联网社交网络文本包含了用户的观点意见及个人情绪,对这种非结构化的网络数据的提取的过程被称为情感分析或观点挖掘。根据方法的基本属性,主要可以分为机器学习模型、基于词典的学习模型和主题模型。近年来,由于主题模型的蓬勃发展,因此大量基于主题模型被拓展为情感预测分类模型,并应用情感分析领域对互联网用户生成文本做情感分类工作,例如,对商品评论信息和电影评论信息的情感分类及主题挖掘。Internet social network texts contain users' opinions and personal emotions, and the process of extracting such unstructured network data is called sentiment analysis or opinion mining. According to the basic properties of the method, it can be mainly divided into machine learning model, dictionary-based learning model and topic model. In recent years, due to the vigorous development of topic models, a large number of topic-based models have been extended to sentiment prediction classification models, and the field of sentiment analysis has been applied to perform sentiment classification on texts generated by Internet users, such as sentiment on product review information and movie review information. Classification and topic mining.

Mei Q等人提出首个情感-主题联合模型Topic-Sentiment Model(简称TSM) 模型,该模型在pLSA的基础上进行改进的,它同时对情感和主题线索进行建模,假设文档中的每一个词语的生成首先应该决定该词属于正面情感抑或是属于负面情感,继而决定该词的主题,最后决定相应主题下的词。与pLSA一样,在面对小数据集时,TSM同样易导致过拟合问题的出现。基于LDA的优势, Lin C和He Y提出JST模型,不仅仅为隐变量主题和情感加入先验分布,而且为每一篇文章设置一个满足多项式分布的情感分布,为该文章的每一个情感标签设置一个满足多项式分布的主题分布。明显地,在JST模型中,主题和情感之间的关系是相对独立的,这二种简易的组合方式带来文章情感不一致的噪声效果,Jo Y和Oh AH在JST的工作基础上提出ASUM模型,假设一个句子里的所有单词仅有一个主题,共享一个情感标签。Li F等人则提出Sentiment-LDA 和Dependency-Sentiment-LDA模型,Sentiment-LDA假设文章满足多项式分布的主题分布决定其满足二项式的情感分布,而Dependency-Sentiment-LDA运用句子中的连词信息(如“但是、并且、然而”等连词),来减少单词情感的不一致性。针对JST无法分离出主题词汇以及情感词汇的缺点,Zhao W等人则提出MaximumEntropy LDA(简称Max-Ent LDA)模型,借助最大信息熵的性质将单词分为背景词汇,及特定主题词汇,提高主题挖掘的精准度和情感分析的准确度。Xu K等人提出TUS-LDA模型,结合时间信息、用户身份信息、情感偏向对数据做个人兴趣挖掘和社会热点检测。TUS-LDA模型将主题分为两类,一类为与用户个人兴趣相关的“静态话题”,另一类则是随时间变化很大的社会热点事件相关的“动态话题”。如果一篇社交网络文本的主题为“静态话题”, TUS-LDA则使用其子模型“用户-情感-主题”联合模型进行对用户的个人兴趣和情感偏向分析,反之则使用“时间-情感-主题”联合模型得到社会热点和事件和舆论观点。在TUS-LDA的两个子模型里,同样使用每篇文本的不同情感类别满足一个多项式分布的假设,并通过情感类别确定一个用户的兴趣主题多项式分布或者一个时间段内时间话题多项式分布。Mei Q et al. proposed the first sentiment-topic joint model Topic-Sentiment Model (TSM for short) model, which is improved on the basis of pLSA. It models sentiment and topic cues at the same time, assuming that each The generation of a word should first determine whether the word belongs to a positive emotion or a negative emotion, then determine the topic of the word, and finally determine the words under the corresponding topic. Like pLSA, TSM is also prone to overfitting when faced with small datasets. Based on the advantages of LDA, Lin C and He Y proposed the JST model, which not only adds a prior distribution to the latent variable topics and sentiments, but also sets a sentiment distribution that satisfies the multinomial distribution for each article. Set up a topic distribution that satisfies a multinomial distribution. Obviously, in the JST model, the relationship between topics and emotions is relatively independent. These two simple combinations bring about the noise effect of inconsistent sentiment in the article. Jo Y and Oh AH proposed the ASUM model based on the work of JST. , assuming that all words in a sentence have only one topic and share a sentiment label. Li F et al. proposed the Sentiment-LDA and Dependency-Sentiment-LDA models. Sentiment-LDA assumes that the topic distribution of the article satisfies the multinomial distribution to determine the sentiment distribution that satisfies the binomial distribution, while Dependency-Sentiment-LDA uses the conjunction information in the sentence. (e.g. conjunctions such as "but, and, however") to reduce inconsistencies in word sentiment. Aiming at the shortcomings of JST's inability to separate topic vocabulary and emotional vocabulary, Zhao W et al. proposed the MaximumEntropy LDA (Max-Ent LDA) model, which uses the property of maximum information entropy to divide words into background vocabulary and specific topic vocabulary. The accuracy of mining and the accuracy of sentiment analysis. Xu K et al. proposed the TUS-LDA model, which combines time information, user identity information, and emotional bias for personal interest mining and social hotspot detection. The TUS-LDA model divides topics into two categories, one is "static topics" related to users' personal interests, and the other is "dynamic topics" related to social hot events that change greatly over time. If the topic of a social network text is "static topic", TUS-LDA uses its sub-model "user-sentiment-topic" joint model to analyze the user's personal interests and sentiment bias, otherwise it uses "time-sentiment- The topic" joint model gets social hotspots and events and public opinion views. In the two sub-models of TUS-LDA, the different sentiment categories of each text are also used to satisfy the assumption of a multinomial distribution, and the multinomial distribution of a user's topic of interest or a multinomial distribution of time topics within a time period is determined by the sentiment category.

以上主题情感联合模型均为无监督主题情感联合模型,需要依靠情感词典的辅助信息提高模型的情感预测效果。为了将主题模型应用于有监督学习, Mcauliffe JD和BleiDM提出可适用于分类问题和回归问题的有监督主题模型 Supervised topic models(简称SLDA),然而该模型并没有对主题层和情感层之间的联系进行深刻探讨。Bao S等]提出有监督情感分析主题模型Emotion-Term Model(简称ETM),该模型以作者的角度进行构建,面对公众情感分类工作,针对已有的训练集语料及每篇文章的公众情感投票标签对测试预料公众情感反馈。Rao Y等人提出有监督情感分析主题模型Multi-label supervised topicmodel (简称MSTM)和Sentiment latent topic model(简称SLTM),实验效果表明,以读者视角为构建基础的MSTM和SLTM模型更为适合公众情感投票的预测工作。The above topic sentiment joint models are all unsupervised topic sentiment joint models, which need to rely on the auxiliary information of sentiment dictionary to improve the sentiment prediction effect of the model. In order to apply the topic model to supervised learning, Mcauliffe JD and BleiDM proposed a supervised topic model (SLDA), which can be applied to classification problems and regression problems. Contact for in-depth discussion. Bao S et al.] proposed a supervised sentiment analysis topic model Emotion-Term Model (ETM for short), which is constructed from the author's point of view, facing the work of public sentiment classification, based on the existing training set corpus and the public sentiment of each article. The polling tab anticipates public sentiment feedback on the test. Rao Y et al. proposed supervised sentiment analysis topic models Multi-label supervised topic model (MSTM for short) and Sentiment latent topic model (SLTM for short). The experimental results show that the MSTM and SLTM models based on the reader's perspective are more suitable for public sentiment Prediction of voting works.

过去的研究工作大部分只将文本信息与文本发布时间、文本情感偏向、文本发布作者身份等额外信息中的一个或者两个维度信息进行融合,并没有研究工作针对社交网络文本的特点,对社交网络文本给出的文本信息、发布时间、用户特征充分挖掘并有效整合,充分发挥各个维度信息的价值,对社交文本进行准确挖掘。例如,尽管TUS-LDA模型结合了情感、时间、文本、用户身份四个维度的信息,但是该模型并没有利用用户的特征信息。而上述不同纬度的特征在基于主题模型的情感分析中均具有重要价值,具体如下:Most of the past research work only integrates text information with one or two dimensions of additional information such as text publishing time, text emotional bias, and text publishing authorship. The text information, release time, and user characteristics given by network texts are fully mined and effectively integrated, and the value of information in various dimensions can be fully exploited to accurately mine social texts. For example, although the TUS-LDA model combines information from four dimensions of emotion, time, text, and user identity, the model does not utilize user feature information. The above features of different dimensions are of great value in sentiment analysis based on topic models, as follows:

首先,网络舆论热点随时间变化迅速,随着时间带有显著发展变化。例如,曾经的社会舆论热点“扶老人过马路”总是充斥着“讹诈、道德底线、冷漠”等主题词汇,表达了人们对该主题的担心与痛斥,带有负面情感偏向。一段时间后随着事件的冷却,在人们理性的分析思考之后,社会对该主题的表达逐渐演变回“美德、善良、公正”等积极词汇,再次回归正面的情感偏向。First of all, online public opinion hotspots change rapidly over time, with significant development and changes over time. For example, the once hot topic of public opinion "helping the elderly to cross the road" is always full of themes such as "blackmail, moral bottom line, indifference", expressing people's worries and denunciations on this topic, with a negative emotional bias. After a period of time, with the cooling of the event, after people's rational analysis and thinking, the society's expression of this topic gradually evolved back to positive words such as "virtue, kindness, justice", and returned to positive emotional bias again.

其次,情感标签对模型进行主题建模和情感分析起到监督作用,能更好的区分不同主题与不同情感之间的联系。Secondly, sentiment labels play a role in supervising the model’s topic modeling and sentiment analysis, which can better distinguish the relationship between different topics and different emotions.

最后,用户的特征标签也对主题-情感有不同的影响。例如,对同一新闻事件,男性和女性、工薪阶层和中产阶层的人们的看法和情感会有微妙的不同,这与其自身的所处环境造成的影响不可分割,而用户的特征标签正是用以描述用户自身及其环境的重要表达。明显的,互联网用户数量数以万计,形如AT 模型的作者-主题建模方式如用于网络社交文本的主题提取及情感分析,对每个网络社交用户进行跟踪建模,将导致模型参数过多,这并不能够适应网络社交文本数量庞大的特点。同时,社交网络面对社会每一个民众,人人之间既不相同,又存在共性,通过这些共性将人群按照不同粒度级别进行社区划分,再进行主题建模,不仅能够有效地减少模型参数,而且通过社区内的人群信息相互补充挖掘更加充分的主题信息和更有效地情感预测。但目前为止,尚没有相关研究工作提出如何有效将用户的多维特征以及时间、文本、情感标签等融入主题模型。Finally, user feature labels also have different effects on topic-sentiment. For example, men and women, working-class and middle-class people have subtly different perceptions and emotions about the same news event, which is inseparable from the influence of their own environment, and the user's feature tag is used to Important expressions that describe the user himself and his environment. Obviously, there are tens of thousands of Internet users, such as the author-topic modeling method of the AT model, such as topic extraction and sentiment analysis for online social texts, and tracking and modeling each online social user will lead to model parameters. Too much, which cannot adapt to the huge amount of social texts on the Internet. At the same time, the social network faces every member of the society, and everyone is different and has commonalities. Through these commonalities, people can be divided into communities according to different granularity levels, and then topic modeling can not only effectively reduce model parameters, but also Moreover, through the mutual complementation of crowd information in the community, more sufficient topic information and more effective sentiment prediction can be mined. But so far, there is no related research work that proposes how to effectively integrate the multi-dimensional features of users, as well as time, text, emotional tags, etc. into topic models.

发明内容SUMMARY OF THE INVENTION

本发明为克服上述现有技术所述的至少一种缺陷,提供一种基于用户特征优化的主题挖掘情感分析方法,有效整合文本信息、时间、用户特征、情感标签等四个维度的信息,重新定义网络社交文本生成方式,建立多维度主题情感分联合型,通过对多维度信息的整合,提高对网络社交文本的情感预测准确度。In order to overcome at least one of the above-mentioned defects in the prior art, the present invention provides a theme mining sentiment analysis method based on user feature optimization, which effectively integrates information from four dimensions, such as text information, time, user features, and emotional tags, and recreates the Define the generation method of online social text, establish a multi-dimensional theme sentiment sub-joint, and improve the accuracy of sentiment prediction for online social text through the integration of multi-dimensional information.

为解决上述技术问题,本发明采用的技术方案是:一种基于用户特征优化的主题挖掘情感分析方法,包括以下步骤:In order to solve the above-mentioned technical problems, the technical solution adopted in the present invention is: a kind of theme mining sentiment analysis method based on user feature optimization, comprising the following steps:

S1.建立基于LDA主题模型的多维度主题情感联合模型MTSM,该模型融合了文本信息、时间、用户特征和情感标签;S1. Establish a multi-dimensional topic sentiment joint model MTSM based on the LDA topic model, which integrates text information, time, user characteristics and sentiment labels;

S2.根据文档在模型中的生成过程,使用训练语料对模型训练,进行参数的求解:对文档用户的社区概率分布参数进行估计,发现用户社区,知道了用户的所属社区之后,对该用户所写的文档进行主题和情感检测;使用Gibbs Sampling算法根据公式不断对用户写的文档里的每一个单词采样,推测每个单词可能所属的主题和情感标签,直到收敛;S2. According to the generation process of the document in the model, use the training corpus to train the model and solve the parameters: Estimate the community probability distribution parameters of the document user, discover the user community, and know the user's community. Subject and emotion detection of the written document; use the Gibbs Sampling algorithm to continuously sample each word in the document written by the user according to the formula, and infer the subject and emotion label that each word may belong to until convergence;

S3.模型参数求解完成后,训练好的MTSM模型可以对测试文档有效地进行主题挖掘和情感预测;S3. After the model parameters are solved, the trained MTSM model can effectively perform topic mining and sentiment prediction on the test document;

S4.对测试文档进行主题挖掘和情感预测:得到了模型的参数以后,当对测试文档进行主题挖掘和情感预测的时候,分为社区发现和文档的单词采样两步,利用这两个步骤采样迭代直到收敛,得到基于训练文档和测试文档的新参数,以此进行主题挖掘和情感预测。S4. Perform topic mining and sentiment prediction on the test document: After obtaining the parameters of the model, when performing topic mining and sentiment prediction on the test document, it is divided into two steps: community discovery and document word sampling, and use these two steps to sample Iterate until convergence, and obtain new parameters based on training documents and test documents for topic mining and sentiment prediction.

进一步的,所述的MTSM模型在原始的LDA主题模型的基础上添加如下生成条件:Further, the MTSM model adds the following generation conditions on the basis of the original LDA topic model:

1)添加全局的社区多项式概率π,使其先验服从狄利克雷分布,即π~Dirichlet(γ),该概率分布代表一批语料里的用户属于各个社区的概率;1) Add the global community polynomial probability π to make its prior obey the Dirichlet distribution, ie π~Dirichlet(γ), the probability distribution represents the probability that users in a batch of corpus belong to each community;

2)添加全局的特定社区下用户特征多项式概率ψ,每一种用户特征都有一个概率分布,使用j计数,使其先验服从狄利克雷分布,即ψj~Dirichlet(λ),该概率分布代表某个社区里,用户的特征分布概率;2) Add the polynomial probability ψ of the user feature under the global specific community, each user feature has a probability distribution, use j to count, so that its prior obeys the Dirichlet distribution, that is, ψj ~Dirichlet(λ), the probability The distribution represents the feature distribution probability of users in a certain community;

3)对于每个社区,添加其社区内的文章主题概率分布θc,即社区内所有用户的写的所有文章共同服从一个主题概率分布,使其先验服从狄利克雷分布,即θc~Dirichlet(α),该概率代表每个社区的用户所有文章的主题分布概率;3) For each community, add the topic probability distribution θc of articles in its community, that is, all articles written by all users in the community jointly obey a topic probability distribution, and make it a priori obey the Dirichlet distribution, that is, θc ~ Dirichlet(α), this probability represents the topic distribution probability of all articles of users in each community;

4)对于每一个主题添加情感概率分布φz,使其先验服从狄利克雷分布,即φz~Dirichlet(μ),该概率代表用户对一批语料里挖掘出来的主题的情感分布概率;4) Add the emotional probability distribution φz to each topic, so that its prior obeys the Dirichlet distribution, that is, φz ~Dirichlet(μ), which represents the probability of the user’s emotional distribution of topics excavated from a batch of corpus;

5)对于每一个主题添加时间概率分布τ,使其服从伯努利分布,即 t~Beta(τ),该概率代表一个主题的时间分布概率;5) Add the time probability distribution τ to each topic to make it obey the Bernoulli distribution, that is, t~Beta(τ), which represents the time distribution probability of a topic;

6)对于特定主题的特定情感添加词语概率分布

Figure RE-GDA0002029626700000041
使其先验服从狄利克雷分布
Figure RE-GDA0002029626700000042
该概率代表特定主题特定情感下所有词语的分布概率。6) Add word probability distributions for specific sentiments on specific topics
Figure RE-GDA0002029626700000041
make its prior obey the Dirichlet distribution
Figure RE-GDA0002029626700000042
This probability represents the distribution probability of all words under a specific sentiment of a specific topic.

进一步的,所述的S2步骤具体包括:Further, the step S2 specifically includes:

S21.在总共J维度的特征标签空间中,对于每一维的特征标签fj采样一个满足多项式分布的特征值概率分布ψj~Dirichlet(λ);S21. In the feature label space of a total of J dimensions, for each dimension of the feature label fj , sample an eigenvalue probability distribution ψj ~Dirichlet(λ) that satisfies the polynomial distribution;

S22.对于数据集中的所有用户采样一个满足多项式分布的社区概率分布π~Dirichlet(γ);S22. Sample a community probability distribution π~Dirichlet(γ) that satisfies the polynomial distribution for all users in the data set;

S23.对于每一个聚集的社区c采样一个满足多项式分布的主题概率分布θc~Dirichlet(α);S23. For each aggregated community c, sample a topic probability distribution θc ~Dirichlet(α) that satisfies the polynomial distribution;

S24.对于每一个主题z采样一个满足多项式分布的情感概率分布φz~Dirichlet(μ);S24. Sampling an emotion probability distribution φz ~Dirichlet(μ) that satisfies the polynomial distribution for each topic z;

S25.对于每一个主题z采样一个满足二项式分布的时间概率分布 t~Beta(τ);S25. For each topic z, sample a time probability distribution t~Beta(τ) that satisfies the binomial distribution;

S26.对于每一个主题z的每一个特定情感s,采样一个满足多项式分布的单词概率分布

Figure RE-GDA0002029626700000051
S26. For each specific emotion s of each topic z, sample a word probability distribution that satisfies the multinomial distribution
Figure RE-GDA0002029626700000051

S27.对于每一个用户urS27. For each userur :

a)为该用户采样出所属的社区

Figure RE-GDA0002029626700000052
a) Sample the community to which the user belongs
Figure RE-GDA0002029626700000052

b)对描述该用户的每一维度特征空间

Figure RE-GDA0002029626700000053
为该用户的第j维度的特征空间采样其特征值
Figure RE-GDA0002029626700000054
b) For each dimension feature space describing the user
Figure RE-GDA0002029626700000053
Sampling its eigenvalues for the jth dimension of the user's feature space
Figure RE-GDA0002029626700000054

c)对于用户ur所写的每一篇文章

Figure RE-GDA0002029626700000055
其中的每一个单词wi,n:c) for each article written by userur
Figure RE-GDA0002029626700000055
Each of the words wi,n in it :

i.根据该文章的作者所属社区c采样出该单词的主题zi,n~Mul(θc);i. Sample the topic zi,n ~Mul(θc ) of the word according to the community c to which the author of the article belongs;

ii.根据主题zi,n采样出该单词的情感si,n~Mul(φz);ii. According to the topic zi,n, sample the sentiment of the wordsi,n ~Mul(φz );

iii.根据主题zi,n采样出该单词的时间戳ti,n~Bin(τz);iii. According to the topic zi, n, sample the timestamp ti,n ~Bin(τz ) of the word;

iv.根据主题zi,n和情感si,n采样出具体单词

Figure RE-GDA0002029626700000056
iv. Sample specific words according to topic zi, n and emotionsi, n
Figure RE-GDA0002029626700000056

进一步的,所述的S3步骤具体包括:Further, the step S3 specifically includes:

S31.对文档用户的社区概率分布参数进行估计:S31. Estimate the community probability distribution parameters of document users:

在社区发现的步骤中,带有特征标签

Figure RE-GDA0002029626700000057
的用户ur属于社区c的概率如下式所示:In the step of community discovery, with feature label
Figure RE-GDA0002029626700000057
The probability of userur belonging to community c is as follows:

Figure RE-GDA0002029626700000058
Figure RE-GDA0002029626700000058

其中,

Figure RE-GDA00020296267000000617
Figure RE-GDA00020296267000000618
的求解公式为:in,
Figure RE-GDA00020296267000000617
and
Figure RE-GDA00020296267000000618
The solution formula is:

Figure RE-GDA0002029626700000062
Figure RE-GDA0002029626700000062

Figure RE-GDA0002029626700000063
Figure RE-GDA0002029626700000063

式中,

Figure RE-GDA0002029626700000064
为除了用户ur以外其他所有属于社区c的用户数量,
Figure RE-GDA0002029626700000065
为除了用户ur以外,在社区c的所有用户中,第j维特征的特征值kj的出现频数;In the formula,
Figure RE-GDA0002029626700000064
is the number of all users belonging to community c except userur ,
Figure RE-GDA0002029626700000065
is the frequency of occurrence of the feature value kj of the j-th dimension feature among all users of the community c except for the user ur;

S32.完成对用户的社区发现之后,需要根据训练文档的文本内容、情感标签、时间戳对每一个用户生成的文档进行主题采样以及情感采样;S32. After completing the community discovery of the user, subject sampling and sentiment sampling need to be performed on each user-generated document according to the text content, sentiment label, and timestamp of the training document;

对于文档

Figure RE-GDA0002029626700000066
已知其情感标签为si,用户ur所属的社区为
Figure RE-GDA0002029626700000067
则其单词wi,n属于某一主题和情感的概率为:for documentation
Figure RE-GDA0002029626700000066
It is known that its emotional label is si , and the community to which userur belongs is
Figure RE-GDA0002029626700000067
Then the probability that the word wi,n belongs to a certain topic and emotion is:

Figure RE-GDA0002029626700000068
Figure RE-GDA0002029626700000068

其中,

Figure RE-GDA0002029626700000069
的参数为:in,
Figure RE-GDA0002029626700000069
The parameters are:

Figure RE-GDA00020296267000000610
Figure RE-GDA00020296267000000610

Figure RE-GDA00020296267000000611
Figure RE-GDA00020296267000000611

Figure RE-GDA00020296267000000612
Figure RE-GDA00020296267000000612

Figure RE-GDA00020296267000000613
Figure RE-GDA00020296267000000613

其中,

Figure RE-GDA00020296267000000614
为除了文档di的第n个单词wi,n外,属于社区c的所有用户的所有文档中,属于主题z的单词频数,
Figure RE-GDA00020296267000000615
为除了单词wi,n外,所有文档中属于主题z的情感s的单词频数,
Figure RE-GDA00020296267000000616
为除了单词wi,n外,单词w属于主题并且属于情感的频数;in,
Figure RE-GDA00020296267000000614
is the frequency of words belonging to topic z in all documents belonging to all users of community c except for the nth word wi,n of document di ,
Figure RE-GDA00020296267000000615
is the word frequency of sentiment s belonging to topic z in all documents except words wi, n ,
Figure RE-GDA00020296267000000616
is the frequency that the word w belongs to the topic and belongs to the emotion except for the word wi, n ;

S33.文档的情感标签si为已知参数,因此在对训练集文档训练采样过程中,每篇文档的单词仅仅对各个主题下与情感si相关的参数进行更新采样,通过情感标签达到有监督训练的目的;S33. The sentiment label si of the document is a known parameter, so in the process of training and sampling the training set document, the words of each document only update and sample the parameters related to sentiment si under each topic, and achieve the desired result through the sentiment label. the purpose of supervised training;

参数

Figure RE-GDA0002029626700000071
使用矩阵估计法进行参数更新,具体计算方法为:parameter
Figure RE-GDA0002029626700000071
Use the matrix estimation method to update the parameters. The specific calculation method is:

Figure RE-GDA0002029626700000072
Figure RE-GDA0002029626700000072

Figure RE-GDA0002029626700000073
Figure RE-GDA0002029626700000073

其中,

Figure RE-GDA0002029626700000074
Figure RE-GDA0002029626700000075
分别为所有被赋予主题z的单词的时间戳平均值和标准差。in,
Figure RE-GDA0002029626700000074
and
Figure RE-GDA0002029626700000075
are the mean and standard deviation of timestamps for all words assigned to topic z, respectively.

进一步的,所述的S4步骤具体包括:Further, the step S4 specifically includes:

S41.假设文档dtest,已知生成该文档的用户为utest,该用户的特征标签分别是

Figure RE-GDA0002029626700000076
以及该文档的时间戳为ttest,那么该文档的情感标签则根据式下式计算:S41. Suppose the document dtest , the user who generates the document is known to be utest , and the feature labels of the user are respectively
Figure RE-GDA0002029626700000076
And the timestamp of the document is ttest , then the sentiment label of the document is calculated according to the following formula:

Figure RE-GDA0002029626700000077
Figure RE-GDA0002029626700000077

其中,对其进行所属社区的概率

Figure RE-GDA0002029626700000078
根据下式计算:Among them, the probability of the community to which it belongs
Figure RE-GDA0002029626700000078
Calculate according to the following formula:

Figure RE-GDA0002029626700000079
Figure RE-GDA0002029626700000079

将上式简化为:Simplify the above formula to:

Figure RE-GDA00020296267000000710
Figure RE-GDA00020296267000000710

其中,

Figure RE-GDA00020296267000000711
为用户的第j维特征空间中的特征值
Figure RE-GDA00020296267000000712
上式中各个参数的计算公式为:in,
Figure RE-GDA00020296267000000711
is the feature value in the jth dimension feature space of the user
Figure RE-GDA00020296267000000712
The calculation formula of each parameter in the above formula is:

Figure RE-GDA00020296267000000713
Figure RE-GDA00020296267000000713

Figure RE-GDA00020296267000000714
Figure RE-GDA00020296267000000714

Figure RE-GDA0002029626700000081
Figure RE-GDA0002029626700000081

Figure RE-GDA0002029626700000082
Figure RE-GDA0002029626700000082

Figure RE-GDA0002029626700000083
Figure RE-GDA0002029626700000083

其中,

Figure RE-GDA0002029626700000084
为除了文档dtest的第n个单词wtest,n外,测试集和训练集中,属于社区c的所有用户的所有文档中,属于主题z的单词频数;
Figure RE-GDA0002029626700000085
为除了单词wtest,n外,测试集和训练集中,所有文档中属于主题z的情感s的单词频数;
Figure RE-GDA0002029626700000086
为除了单词wtest,n外,测试集和训练集中,单词w属于主题并且属于情感的频数;
Figure RE-GDA0002029626700000087
Figure RE-GDA0002029626700000088
分别替换成
Figure RE-GDA0002029626700000089
Figure RE-GDA00020296267000000810
其中,
Figure RE-GDA00020296267000000811
Figure RE-GDA00020296267000000812
分别为在训练集和测试集中,属于主题z的所有单词的时间戳的均值和标准差,
Figure RE-GDA00020296267000000813
为除了文档dtest的作者utest以外,属于社区c的用户数量;in,
Figure RE-GDA0002029626700000084
is the frequency of words belonging to topic z in all documents of all users belonging to community c in the test set and training set, except for the nth word wtest, n of document dtest ;
Figure RE-GDA0002029626700000085
is the word frequency of sentiment s belonging to topic z in all documents in the test set and training set except the word wtest, n ;
Figure RE-GDA0002029626700000086
In addition to the word wtest, n , the test set and training set, the word w belongs to the topic and belongs to the sentiment frequency;
Figure RE-GDA0002029626700000087
and
Figure RE-GDA0002029626700000088
replaced by
Figure RE-GDA0002029626700000089
and
Figure RE-GDA00020296267000000810
in,
Figure RE-GDA00020296267000000811
and
Figure RE-GDA00020296267000000812
are the mean and standard deviation of the timestamps of all words belonging to topic z in the training and test sets, respectively,
Figure RE-GDA00020296267000000813
is the number of users who belong to the community c except for the author utest of the document dtest ;

S42.由于文档的情感标签stest为未知参数,因此在对测试文档采样过程中,每篇文档的单词需要对各个主题下的每一种情感偏向的相关参数进行更新采样,进而确定该文档属于哪一种类别情感的概率最大。S42. Since the emotional label stest of the document is an unknown parameter, in the process of sampling the test document, the words of each document need to update and sample the relevant parameters of each emotional bias under each topic, and then determine that the document belongs to Which category of emotion has the highest probability.

与现有技术相比,有益效果是:Compared with the prior art, the beneficial effects are:

1.挖掘用户特征标签的价值,通过用户的特征标签对社交网络用户进行社区划分,可以进行社区级别的主题挖掘和情感分析,传统主题情感联合模型无法进行社区级别的主题挖掘和情感分析;1. Mining the value of user feature tags, and dividing social network users into communities by user feature tags, community-level topic mining and sentiment analysis can be performed, while traditional topic sentiment joint models cannot perform community-level topic mining and sentiment analysis;

2.针对网络社交文本的特点,有效整合文本信息、时间、用户特征、情感标签等四个维度的信息,重新定义网络社交文本生成方式,建立多维度主题情感分联合型,并且,提供从多个视角观测对比主题信息;2. According to the characteristics of online social texts, effectively integrate information from four dimensions, such as text information, time, user characteristics, and emotional tags, redefine the generation method of online social texts, establish a multi-dimensional theme emotion classification type, and provide multiple Observing and comparing topic information from different perspectives;

3.将多维度主题情感联合模型应用于公众情感分析领域做情感预测任务。通过对多维度信息的整合,提高对网络社交文本的情感预测准确度。3. Apply the multi-dimensional topic sentiment joint model to the field of public sentiment analysis to do sentiment prediction tasks. Through the integration of multi-dimensional information, the accuracy of sentiment prediction for online social texts is improved.

附图说明Description of drawings

图1是本发明多维度主题情感联合模型MTSM结构示意图。FIG. 1 is a schematic structural diagram of the multi-dimensional theme emotion joint model MTSM of the present invention.

图2是本发明MTSM分析文本流程图。FIG. 2 is a flow chart of text analysis by MTSM of the present invention.

图3是本发明多维度主题情感模型MTSM的算法流程图。FIG. 3 is an algorithm flow chart of the multi-dimensional theme emotion model MTSM of the present invention.

图4是本发明多维度主题情感联合模型MTSM预测步骤算法流程图。FIG. 4 is a flow chart of the algorithm for the prediction steps of the multi-dimensional theme emotion joint model MTSM of the present invention.

具体实施方式Detailed ways

附图仅用于示例性说明,不能理解为对本发明的限制;为了更好说明本实施例,附图某些部件会有省略、放大或缩小,并不代表实际产品的尺寸;对于本领域技术人员来说,附图中某些公知结构及其说明可能省略是可以理解的。附图中描述位置关系仅用于示例性说明,不能理解为对本发明的限制。The accompanying drawings are for illustrative purposes only, and should not be construed as limiting the present invention; in order to better illustrate the present embodiment, some parts of the accompanying drawings may be omitted, enlarged or reduced, and do not represent the size of the actual product; for those skilled in the art It is understandable to the artisan that certain well-known structures and descriptions thereof may be omitted from the drawings. The positional relationships described in the drawings are only for exemplary illustration, and should not be construed as limiting the present invention.

实施例1:Example 1:

一种基于用户特征优化的主题挖掘情感分析方法,包括以下步骤:A sentiment analysis method for topic mining based on user feature optimization, comprising the following steps:

步骤1.建立基于LDA主题模型的多维度主题情感联合模型MTSM,该模型融合了文本信息、时间、用户特征和情感标签;Step 1. Establish a multi-dimensional topic sentiment joint model MTSM based on the LDA topic model, which integrates text information, time, user characteristics and sentiment labels;

如图1所示,MTSM模型在原始的LDA主题模型的基础上添加如下生成条件:As shown in Figure 1, the MTSM model adds the following generation conditions to the original LDA topic model:

1)添加全局的社区多项式概率π,使其先验服从狄利克雷分布,即π~Dirichlet(γ),该概率分布代表一批语料里的用户属于各个社区的概率;1) Add the global community polynomial probability π to make its prior obey the Dirichlet distribution, ie π~Dirichlet(γ), the probability distribution represents the probability that users in a batch of corpus belong to each community;

2)添加全局的特定社区下用户特征多项式概率ψ,每一种用户特征都有一个概率分布,使用j计数,使其先验服从狄利克雷分布,即ψj~Dirichlet(λ),该概率分布代表某个社区里,用户的特征分布概率;2) Add the polynomial probability ψ of the user feature under the global specific community, each user feature has a probability distribution, use j to count, so that its prior obeys the Dirichlet distribution, that is, ψj ~Dirichlet(λ), the probability The distribution represents the feature distribution probability of users in a certain community;

3)对于每个社区,添加其社区内的文章主题概率分布θc,即社区内所有用户的写的所有文章共同服从一个主题概率分布,使其先验服从狄利克雷分布,即θc~Dirichlet(α),该概率代表每个社区的用户所有文章的主题分布概率;3) For each community, add the topic probability distribution θc of articles in its community, that is, all articles written by all users in the community jointly obey a topic probability distribution, and make it a priori obey the Dirichlet distribution, that is, θc ~ Dirichlet(α), this probability represents the topic distribution probability of all articles of users in each community;

4)对于每一个主题添加情感概率分布φz,使其先验服从狄利克雷分布,即φz~Dirichlet(μ),该概率代表用户对一批语料里挖掘出来的主题的情感分布概率;4) Add the emotional probability distribution φz to each topic, so that its prior obeys the Dirichlet distribution, that is, φz ~Dirichlet(μ), which represents the probability of the user’s emotional distribution of topics excavated from a batch of corpus;

5)对于每一个主题添加时间概率分布τ,使其服从伯努利分布,即 t~Beta(τ),该概率代表一个主题的时间分布概率;5) Add the time probability distribution τ to each topic to make it obey the Bernoulli distribution, that is, t~Beta(τ), which represents the time distribution probability of a topic;

6)对于特定主题的特定情感添加词语概率分布

Figure RE-GDA0002029626700000091
使其先验服从狄利克雷分布
Figure RE-GDA0002029626700000092
该概率代表特定主题特定情感下所有词语的分布概率。6) Add word probability distributions for specific sentiments on specific topics
Figure RE-GDA0002029626700000091
make its prior obey the Dirichlet distribution
Figure RE-GDA0002029626700000092
This probability represents the distribution probability of all words under a specific sentiment of a specific topic.

在该模型中,对数据集进行主题挖掘情感分析主要分为两步:用户社区发现,文档主题提取和情感预测。In this model, topic mining sentiment analysis on the dataset is mainly divided into two steps: user community discovery, document topic extraction and sentiment prediction.

首先是用户社区发现,将用户特征融入主题模型,进而约束主题形成的过程是本文的创新点之一。原因是:首先,在现实生活中,面对同一个主题,处于不同情况或者不同环境的人们往往会产生不同的情感。例如,对于同样的一个新闻事件,不同薪资阶层、不同地域、不同年龄的人们往往会抱有不一样的看法。而相似的人群,他们讨论的主题方面可能较为相似,其情感反馈也较为相近。其次,因此,本文提出数据集中的用户集合实际由隐变量不同的社区组成的观点,而每个社区均通过其高频出现的不同特征来表征。用户所属社区总数C为预定参数,这样便可以对用户进行不同粒度级别的划分。The first is the discovery of the user community. It is one of the innovations of this paper to incorporate user characteristics into the topic model and then constrain the process of topic formation. The reasons are: First, in real life, people in different situations or environments tend to have different emotions when faced with the same subject. For example, for the same news event, people from different salary classes, different regions, and different ages often have different opinions. Similar groups of people may discuss similar topics and have similar emotional feedback. Second, therefore, this paper proposes the idea that the set of users in the dataset is actually composed of communities with different latent variables, and each community is characterized by its different characteristics that appear frequently. The total number C of communities to which users belong is a predetermined parameter, so that users can be divided into different levels of granularity.

其次是主题挖掘与情感分析部分。当对用户进行社区聚集之后,针对每一个社区,MTSM对社区内的用户文章进行主题提取以及情感预测。假设每一个社区中,用户的文章均满足一个主题概率分布,同时不同的主题有不同的时间概率分布。在MTSM中,用户生成一篇文档的过程则是根据所属社区的主题概率分布和时间概率分布,用户以一定的概率选择某个主题,然后根据该主题的情感概率分布以一定概率选择某种偏向的情感,最后,根据特定主题以及特定情感,以一定概率选择选择某个单词填入文档。这个步骤循环至文档完成。由上可见,在MTSM模型中,一篇文档的生成不仅仅受到其主题的影响,还受到所属社区、时间,以及情感偏向的作用。各个维度信息通过这种方式的集合,不仅仅能够得到某一社区内的主题分布情况,还能同时观察得到该主题下的情感分布,该主题的时间分布,以及特定主题特定情感的单词分布。The second part is topic mining and sentiment analysis. After gathering users in the community, for each community, MTSM performs topic extraction and sentiment prediction on user articles in the community. It is assumed that in each community, the user's articles satisfy a topic probability distribution, and different topics have different time probability distributions. In MTSM, the process for a user to generate a document is to select a topic with a certain probability according to the topic probability distribution and time probability distribution of the community to which he belongs, and then select a certain bias with a certain probability according to the emotional probability distribution of the topic Finally, according to a specific topic and a specific emotion, a certain word is selected to fill in the document with a certain probability. This step loops until the document is complete. It can be seen from the above that in the MTSM model, the generation of a document is not only affected by its subject, but also by its community, time, and emotional bias. The collection of various dimensions information in this way can not only obtain the distribution of topics in a certain community, but also observe the distribution of emotions under the topic, the time distribution of the topic, and the word distribution of specific emotions in a specific topic.

MTSM模型所需的参数标注如表1所示:The parameter labels required by the MTSM model are shown in Table 1:

表1 MTSM模型所需要的参数Table 1 Parameters required by the MTSM model

Figure RE-GDA0002029626700000101
Figure RE-GDA0002029626700000101

Figure RE-GDA0002029626700000111
Figure RE-GDA0002029626700000111

步骤2.将测试文档放入该MTSM模型中,根据该MTSM模型内容进行文档的生成;Step 2. Put the test document into the MTSM model, and generate the document according to the content of the MTSM model;

MTSM模型中一篇文章的生成过程如下所示:The generation process of an article in the MTSM model is as follows:

S21.在总共J维度的特征标签空间中,对于每一维的特征标签fj采样一个满足多项式分布的特征值概率分布ψj~Dirichlet(λ);S21. In the feature label space of a total of J dimensions, for each dimension of the feature label fj , sample an eigenvalue probability distribution ψj ~Dirichlet(λ) that satisfies the polynomial distribution;

S22.对于数据集中的所有用户采样一个满足多项式分布的社区概率分布π~Dirichlet(γ);S22. Sample a community probability distribution π~Dirichlet(γ) that satisfies the polynomial distribution for all users in the data set;

S23.对于每一个聚集的社区c采样一个满足多项式分布的主题概率分布θc~Dirichlet(α);S23. For each aggregated community c, sample a topic probability distribution θc ~Dirichlet(α) that satisfies the polynomial distribution;

S24.对于每一个主题z采样一个满足多项式分布的情感概率分布φz~Dirichlet(μ);S24. Sampling an emotion probability distribution φz ~Dirichlet(μ) that satisfies the polynomial distribution for each topic z;

S25.对于每一个主题z采样一个满足二项式分布的时间概率分布 t~Beta(τ);S25. For each topic z, sample a time probability distribution t~Beta(τ) that satisfies the binomial distribution;

S26.对于每一个主题z的每一个特定情感s,采样一个满足多项式分布的单词概率分布

Figure RE-GDA0002029626700000121
S26. For each specific emotion s of each topic z, sample a word probability distribution that satisfies the multinomial distribution
Figure RE-GDA0002029626700000121

S27.对于每一个用户urS27. For each userur :

a)为该用户采样出所属的社区

Figure RE-GDA0002029626700000126
a) Sample the community to which the user belongs
Figure RE-GDA0002029626700000126

b)对描述该用户的每一维度特征空间

Figure RE-GDA0002029626700000122
为该用户的第j维度的特征空间采样其特征值
Figure RE-GDA0002029626700000123
b) For each dimension feature space describing the user
Figure RE-GDA0002029626700000122
Sampling its eigenvalues for the jth dimension of the user's feature space
Figure RE-GDA0002029626700000123

c)对于用户ur所写的每一篇文章

Figure RE-GDA0002029626700000124
其中的每一个单词wi,n:c) for each article written by userur
Figure RE-GDA0002029626700000124
Each of the words wi,n in it :

i.根据该文章的作者所属社区c采样出该单词的主题zi,n~Mul(θc);i. Sample the topic zi,n ~Mul(θc ) of the word according to the community c to which the author of the article belongs;

ii.根据主题zi,n采样出该单词的情感si,n~Mul(φz);ii. According to the topic zi,n, sample the sentiment of the wordsi,n ~Mul(φz );

iii.根据主题zi,n采样出该单词的时间戳ti,n~Bin(τz);iii. According to the topic zi, n, sample the timestamp ti,n ~Bin(τz ) of the word;

iv.根据主题zi,n和情感si,n采样出具体单词

Figure RE-GDA0002029626700000125
iv. Sample specific words according to topic zi, n and emotionsi, n
Figure RE-GDA0002029626700000125

步骤3.参数求解Step 3. Parameter Solving

知道了该模型生成文档的流程之后,我们就可以根据流程推导该模型的参数。得到模型的参数后,就知道了一批训练文本的社区分布、每个社区的用户特征分布、主题分布、每个主题的情感分布、每个主题的时间分布、每个主题每个情感的词汇分布;根据该模型的参数,还能够预测一篇文档的主题分布和情感标签,当要预测一篇测试(未知)文本的主题和情感标签时,根据模型参数、文本的用户特征、时间戳、内容即可推测该文本的主题和情感标签。下面首先介绍模型参数求解的步骤,然后是预测一篇测试(未知)文本的主题和情感标签的步骤。主要流程如图2所示.After knowing the process of document generation by the model, we can deduce the parameters of the model according to the process. After the parameters of the model are obtained, the community distribution of a batch of training texts, the user feature distribution of each community, the topic distribution, the sentiment distribution of each topic, the time distribution of each topic, and the vocabulary of each topic and each emotion are known. distribution; according to the parameters of the model, it can also predict the topic distribution and sentiment label of a document. When predicting the topic and sentiment label of a test (unknown) text, according to the model parameters, text user characteristics, timestamp, content to infer the topic and sentiment tags of the text. The following first introduces the steps of solving the model parameters, followed by the steps of predicting the topic and sentiment labels of a test (unknown) text. The main process is shown in Figure 2.

鉴于Gibbs Sampling算法的简易明了及其有效性,本文使用Gibbs Sampling 算法对MTSM模型参数进行求解,其参数求解具体流程如算法1所示。参数的推导可以主要分为2个步骤,一个是对文档用户的社区概率分布参数进行估计,发现用户社区,第二个步骤是知道了用户的所属社区之后,对该用户所写的文档进行主题和情感检测。使用GibbsSampling算法根据公式不断对每一个用户和文档里的单词采样,推测每个用户其所属的可能社区,每个单词可能所属的主题和情感标签,直到收敛,那么就可以知道模型里的参数了。In view of the simplicity and effectiveness of the Gibbs Sampling algorithm, this paper uses the Gibbs Sampling algorithm to solve the parameters of the MTSM model. The specific process of parameter solving is shown in Algorithm 1. The derivation of parameters can be mainly divided into two steps. One is to estimate the community probability distribution parameters of document users and discover the user community. The second step is to theme the document written by the user after knowing the community to which the user belongs. and emotion detection. Use the GibbsSampling algorithm to continuously sample the words in each user and document according to the formula, infer the possible community to which each user belongs, the topic and sentiment label that each word may belong to, until convergence, then you can know the parameters in the model .

多维度主题情感联合模型参数的算法流程图如图3所示。The algorithm flow chart of the parameters of the multi-dimensional topic emotion joint model is shown in Figure 3.

S31.对文档用户的社区概率分布参数进行估计:S31. Estimate the community probability distribution parameters of document users:

在社区发现的步骤中,带有特征标签

Figure RE-GDA0002029626700000131
的用户ur属于社区c的概率如下式所示:In the step of community discovery, with feature label
Figure RE-GDA0002029626700000131
The probability of userur belonging to community c is as follows:

Figure RE-GDA0002029626700000132
Figure RE-GDA0002029626700000132

其中,

Figure RE-GDA0002029626700000133
Figure RE-GDA0002029626700000134
的求解公式为:in,
Figure RE-GDA0002029626700000133
and
Figure RE-GDA0002029626700000134
The solution formula is:

Figure RE-GDA0002029626700000135
Figure RE-GDA0002029626700000135

Figure RE-GDA0002029626700000136
Figure RE-GDA0002029626700000136

式中,

Figure RE-GDA0002029626700000137
为除了用户ur以外其他所有属于社区c的用户数量,
Figure RE-GDA0002029626700000138
为除了用户ur以外,在社区c的所有用户中,第j维特征的特征值kj的出现频数;In the formula,
Figure RE-GDA0002029626700000137
is the number of all users belonging to community c except userur ,
Figure RE-GDA0002029626700000138
is the frequency of occurrence of the feature value kj of the j-th dimension feature among all users of the community c except for the user ur;

S32.完成对用户的社区发现之后,需要根据训练文档的文本内容、情感标签、时间戳对每一个用户生成的文档进行主题采样以及情感采样;S32. After completing the community discovery of the user, subject sampling and sentiment sampling need to be performed on each user-generated document according to the text content, sentiment label, and timestamp of the training document;

对于文档

Figure RE-GDA0002029626700000139
已知其情感标签为si,用户ur所属的社区为
Figure RE-GDA00020296267000001310
则其单词wi,n属于某一主题和情感的概率为:for documentation
Figure RE-GDA0002029626700000139
It is known that its emotional label is si , and the community to which userur belongs is
Figure RE-GDA00020296267000001310
Then the probability that the word wi,n belongs to a certain topic and emotion is:

Figure RE-GDA00020296267000001311
Figure RE-GDA00020296267000001311

其中,

Figure RE-GDA00020296267000001312
的参数为:in,
Figure RE-GDA00020296267000001312
The parameters are:

Figure RE-GDA00020296267000001313
Figure RE-GDA00020296267000001313

Figure RE-GDA00020296267000001314
Figure RE-GDA00020296267000001314

Figure RE-GDA00020296267000001315
Figure RE-GDA00020296267000001315

Figure RE-GDA0002029626700000141
Figure RE-GDA0002029626700000141

其中,

Figure RE-GDA0002029626700000142
为除了文档di的第n个单词wi,n外,属于社区c的所有用户的所有文档中,属于主题z的单词频数,
Figure RE-GDA0002029626700000143
为除了单词wi,n外,所有文档中属于主题z的情感s的单词频数,
Figure RE-GDA0002029626700000144
为除了单词wi,n外,单词w属于主题并且属于情感的频数;in,
Figure RE-GDA0002029626700000142
is the frequency of words belonging to topic z in all documents belonging to all users of community c except for the nth word wi,n of document di ,
Figure RE-GDA0002029626700000143
is the word frequency of sentiment s belonging to topic z in all documents except words wi, n ,
Figure RE-GDA0002029626700000144
is the frequency that the word w belongs to the topic and belongs to the emotion except for the word wi, n ;

S33.文档的情感标签si为已知参数,因此在对训练集文档训练采样过程中,每篇文档的单词仅仅对各个主题下与情感si相关的参数进行更新采样,通过情感标签达到有监督训练的目的;S33. The sentiment label si of the document is a known parameter, so in the process of training and sampling the training set document, the words of each document only update and sample the parameters related to sentiment si under each topic, and achieve the desired result through the sentiment label. the purpose of supervised training;

参数

Figure RE-GDA0002029626700000145
使用矩阵估计法进行参数更新,具体计算方法为:parameter
Figure RE-GDA0002029626700000145
Use the matrix estimation method to update the parameters. The specific calculation method is:

Figure RE-GDA0002029626700000146
Figure RE-GDA0002029626700000146

Figure RE-GDA0002029626700000147
Figure RE-GDA0002029626700000147

其中,

Figure RE-GDA0002029626700000148
Figure RE-GDA0002029626700000149
分别为所有被赋予主题z的单词的时间戳平均值和标准差。in,
Figure RE-GDA0002029626700000148
and
Figure RE-GDA0002029626700000149
are the mean and standard deviation of timestamps for all words assigned to topic z, respectively.

步骤4.对测试文档进行主题挖掘和情感预测:得到了模型的参数以后,当对测试文档进行主题挖掘和情感预测的时候,分为社区发现和文档的单词采样两步,利用这两个步骤采样迭代直到收敛,得到基于训练文档和测试文档的新参数,以此进行主题挖掘和情感预测。Step 4. Perform topic mining and sentiment prediction on the test document: After obtaining the parameters of the model, when performing topic mining and sentiment prediction on the test document, it is divided into two steps: community discovery and document word sampling. Use these two steps Sampling iterates until convergence, and obtains new parameters based on training documents and test documents for topic mining and sentiment prediction.

多维度主题情感模型的预测步骤算法流程图如图4所示。The algorithm flow chart of the prediction steps of the multi-dimensional topic sentiment model is shown in Figure 4.

S4步骤具体包括:Step S4 specifically includes:

S41.假设文档dtest,已知生成该文档的用户为utest,该用户的特征标签分别是

Figure RE-GDA00020296267000001410
以及该文档的时间戳为ttest,那么该文档的情感标签则根据式下式计算:S41. Suppose the document dtest , the user who generates the document is known to be utest , and the feature labels of the user are respectively
Figure RE-GDA00020296267000001410
And the timestamp of the document is ttest , then the sentiment label of the document is calculated according to the following formula:

Figure RE-GDA00020296267000001411
Figure RE-GDA00020296267000001411

其中,对其进行所属社区的概率

Figure RE-GDA0002029626700000151
根据下式计算:Among them, the probability of the community to which it belongs
Figure RE-GDA0002029626700000151
Calculate according to the following formula:

Figure RE-GDA0002029626700000152
Figure RE-GDA0002029626700000152

将上式简化为:Simplify the above formula to:

Figure RE-GDA0002029626700000153
Figure RE-GDA0002029626700000153

其中,

Figure RE-GDA0002029626700000154
为用户的第j维特征空间中的特征值
Figure RE-GDA0002029626700000155
上式中各个参数的计算公式为:in,
Figure RE-GDA0002029626700000154
is the feature value in the jth dimension feature space of the user
Figure RE-GDA0002029626700000155
The calculation formula of each parameter in the above formula is:

Figure RE-GDA0002029626700000156
Figure RE-GDA0002029626700000156

Figure RE-GDA0002029626700000157
Figure RE-GDA0002029626700000157

Figure RE-GDA0002029626700000158
Figure RE-GDA0002029626700000158

Figure RE-GDA0002029626700000159
Figure RE-GDA0002029626700000159

Figure RE-GDA00020296267000001510
Figure RE-GDA00020296267000001510

其中,

Figure RE-GDA00020296267000001511
为除了文档dtest的第n个单词wtest,n外,测试集和训练集中,属于社区c的所有用户的所有文档中,属于主题z的单词频数;
Figure RE-GDA00020296267000001512
为除了单词wtest,n外,测试集和训练集中,所有文档中属于主题z的情感s的单词频数;
Figure RE-GDA00020296267000001513
为除了单词wtest,n外,测试集和训练集中,单词w属于主题并且属于情感的频数;
Figure RE-GDA00020296267000001514
Figure RE-GDA00020296267000001515
分别替换成
Figure RE-GDA00020296267000001516
Figure RE-GDA00020296267000001517
其中,
Figure RE-GDA00020296267000001518
Figure RE-GDA00020296267000001519
分别为在训练集和测试集中,属于主题z的所有单词的时间戳的均值和标准差,
Figure RE-GDA00020296267000001520
为除了文档dtest的作者utest以外,属于社区c的用户数量;in,
Figure RE-GDA00020296267000001511
is the frequency of words belonging to topic z in all documents of all users belonging to community c in the test set and training set, except for the nth word wtest, n of document dtest ;
Figure RE-GDA00020296267000001512
is the word frequency of sentiment s belonging to topic z in all documents in the test set and training set except the word wtest, n ;
Figure RE-GDA00020296267000001513
In addition to the word wtest, n , the test set and training set, the word w belongs to the topic and belongs to the sentiment frequency;
Figure RE-GDA00020296267000001514
and
Figure RE-GDA00020296267000001515
replaced by
Figure RE-GDA00020296267000001516
and
Figure RE-GDA00020296267000001517
in,
Figure RE-GDA00020296267000001518
and
Figure RE-GDA00020296267000001519
are the mean and standard deviation of the timestamps of all words belonging to topic z in the training and test sets, respectively,
Figure RE-GDA00020296267000001520
is the number of users who belong to the community c except for the author utest of the document dtest ;

S42.由于文档的情感标签stest为未知参数,因此在对测试文档采样过程中,每篇文档的单词需要对各个主题下的每一种情感偏向的相关参数进行更新采样,进而确定该文档属于哪一种类别情感的概率最大。S42. Since the emotional label stest of the document is an unknown parameter, in the process of sampling the test document, the words of each document need to update and sample the relevant parameters of each emotional bias under each topic, and then determine that the document belongs to Which category of emotion has the highest probability.

显然,本发明的上述实施例仅仅是为清楚地说明本发明所作的举例,而并非是对本发明的实施方式的限定。对于所属领域的普通技术人员来说,在上述说明的基础上还可以做出其它不同形式的变化或变动。这里无需也无法对所有的实施方式予以穷举。凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本发明权利要求的保护范围之内。Obviously, the above-mentioned embodiments of the present invention are only examples for clearly illustrating the present invention, rather than limiting the embodiments of the present invention. For those of ordinary skill in the art, changes or modifications in other different forms can also be made on the basis of the above description. There is no need and cannot be exhaustive of all implementations here. Any modifications, equivalent replacements and improvements made within the spirit and principle of the present invention shall be included within the protection scope of the claims of the present invention.

Claims (3)

1. A topic mining emotion analysis method based on user feature optimization is characterized by comprising the following steps:
s1, establishing a multi-dimensional theme emotion combined model MTSM based on an LDA theme model, wherein the model integrates text information, time, user characteristics and emotion labels;
s2, training the model by using the training corpora according to the generation process of the document in the model, and solving parameters: estimating community probability distribution parameters of document users, finding user communities, and detecting the topics and emotions of the documents written by the users after knowing the communities to which the users belong; continuously Sampling each word in the document written by the user according to a formula by using a Gibbs Sampling algorithm, and inferring a theme and an emotion label to which each word possibly belongs until convergence;
s3, after the model parameter solution is completed, the trained MTSM model can effectively perform theme mining and emotion prediction on the test document;
s4, performing theme mining and emotion prediction on the test document: after the parameters of the model are obtained, when theme mining and emotion prediction are carried out on a test document, the two steps of community discovery and word sampling of the document are divided, sampling iteration is carried out until convergence by using the two steps, new parameters based on a training document and the test document are obtained, and the theme mining and emotion prediction are carried out according to the new parameters; the method specifically comprises the following steps:
s41, supposing that the document dtestThe user who generated the document is known as utestThe feature labels of the users are respectively
Figure FDA0002981142930000011
And the document has a timestamp ttestThen the sentiment tag of the document is calculated according to the formula:
Figure FDA0002981142930000012
wherein the probability of belonging community is carried out on it
Figure FDA0002981142930000013
Calculated according to the following formula:
Figure FDA0002981142930000014
the above equation is simplified as:
Figure FDA0002981142930000021
wherein,
Figure FDA0002981142930000022
feature values in a j-th dimension feature space for a user
Figure FDA0002981142930000023
The calculation formula of each parameter in the above formula is:
Figure FDA0002981142930000024
Figure FDA0002981142930000025
Figure FDA0002981142930000026
Figure FDA0002981142930000027
Figure FDA0002981142930000028
Figure FDA0002981142930000029
Figure FDA00029811429300000210
wherein,
Figure FDA00029811429300000211
to exclude the document dtestN-th word wtest,nIn addition, in the test set and the training set, the word frequency of the subject z is contained in all the documents of all the users belonging to the community c;
Figure FDA00029811429300000212
in addition to the word wtest,nIn addition, in the test set and the training set, the word frequency of the emotion s belonging to the theme z in all the documents is concentrated;
Figure FDA00029811429300000213
in addition to the word wtest,nIn addition, in the test set and the training set, the word w belongs to the theme and belongs to the frequency of emotion;
Figure FDA00029811429300000214
and
Figure FDA00029811429300000215
respectively replaced by
Figure FDA00029811429300000216
And
Figure FDA00029811429300000217
wherein,
Figure FDA00029811429300000218
and
Figure FDA00029811429300000219
the mean and standard deviation of the timestamps of all words belonging to topic z in the training set and the test set, respectively,
Figure FDA00029811429300000220
to exclude the document dtestAuthor u oftestIn addition, the number of users belonging to community c;
Figure FDA00029811429300000221
is in addition to user urExcept for the number of all users belonging to community c,
Figure FDA00029811429300000222
is in addition to user urIn addition, among all users in the community c, the characteristic value k of the j-th dimension characteristicjThe frequency of occurrence of (c); alpha, beta, gamma, lambda and mu are Dirichlet distribution hyper-parameters;
s42. sentiment tag s of documenttestThe emotion detection method is an unknown parameter, so that in the process of sampling test documents, the word of each document needs to update and sample the related parameter of each emotion bias under each theme, and then the probability of the category emotion to which the document belongs is determined to be the maximum.
2. The method of claim 1, wherein the MTSM model adds the following generation conditions based on the original LDA topic model:
1) adding global community polynomial probability pi to make the prior obey Dirichlet distribution, namely pi-Dirichlet (gamma), wherein the probability distribution represents the probability that users in a batch of corpora belong to each community;
2) adding global user characteristic polynomial probability psi under specific community, each user characteristic has a probability distribution, using j count to make it obey Dirichlet distribution a priori, i.e. psijDirichlet (λ), the probability distribution representing the probability of a feature distribution of users in a certain community;
3) for each community, adding article topic probability distribution theta in the communitycThat is, all written articles of all users in the community obey a topic probability distribution together, so that the prior obeys Dirichlet distribution, namely thetacDirichlet (α), which represents the topic distribution probability of all articles of users of each community;
4) for eachOne topic adds an emotional probability distribution phizSubject it a priori to a Dirichlet distribution, i.e. phizDirichlet (μ), which represents the emotion distribution probability of the user for the topic mined from a batch of corpus;
5) adding a time probability distribution tau to each topic, so that the time probability distribution tau obeys Bernoulli distribution, namely t-Beta (tau), and the probability represents the time distribution probability of one topic;
6) adding word probability distributions for particular emotions for particular topics
Figure FDA0002981142930000031
Subject it a priori to a Dirichlet distribution
Figure FDA0002981142930000032
The probability represents the distribution probability of all words under a particular emotion for a particular topic.
3. The method for analyzing topic mining emotion based on user feature optimization according to claim 2, wherein the step S2 specifically includes:
s21, in the feature label space with the total J dimensions, the feature label f for each dimensionjSampling a characteristic value probability distribution psi satisfying a polynomial distributionj~Dirichlet(λ);
S22, sampling a community probability distribution pi-Dirichlet (gamma) meeting polynomial distribution for all users in the data set;
s23, sampling a theme probability distribution theta satisfying a polynomial distribution for each aggregated community cc~Dirichlet(α);
S24, sampling an emotion probability distribution phi meeting polynomial distribution for each theme zz~Dirichlet(μ);
S25, sampling a time probability distribution t-Beta (tau) meeting binomial distribution for each subject z;
s26, for each specific emotion s of each theme z, sampling a word probability score satisfying a polynomial distributionCloth
Figure FDA0002981142930000041
S27. for each user ur
a) The community of the user is sampled
Figure FDA0002981142930000042
b) For each dimension feature space describing the user
Figure FDA0002981142930000043
Sampling the characteristic value of the characteristic space of j dimension of the user
Figure FDA0002981142930000044
c) For user urEach article written
Figure FDA0002981142930000045
Each word w thereini,n
i. Sampling out the subject z of the word according to the community c to which the author of the article belongsi,n~Mul(θc);
According to the subject zi,nThe emotion s of the word is sampledi,n~Mul(φz);
According to the subject zi,nThe time stamp t of the word is sampledi,n~Bin(τz);
According to the subject zi,nAnd emotions si,nSampling out concrete words
Figure FDA0002981142930000046
CN201910218584.2A2019-03-212019-03-21 A sentiment analysis method for topic mining based on user feature optimizationActiveCN109933657B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201910218584.2ACN109933657B (en)2019-03-212019-03-21 A sentiment analysis method for topic mining based on user feature optimization

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201910218584.2ACN109933657B (en)2019-03-212019-03-21 A sentiment analysis method for topic mining based on user feature optimization

Publications (2)

Publication NumberPublication Date
CN109933657A CN109933657A (en)2019-06-25
CN109933657Btrue CN109933657B (en)2021-07-09

Family

ID=66987925

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201910218584.2AActiveCN109933657B (en)2019-03-212019-03-21 A sentiment analysis method for topic mining based on user feature optimization

Country Status (1)

CountryLink
CN (1)CN109933657B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN110807315A (en)*2019-10-152020-02-18上海大学Topic model-based online comment emotion mining method
CN110851733A (en)*2019-10-312020-02-28天津大学 Community Discovery and Sentiment Interpretation Methods Based on Network Topology and Document Content
CN112948570A (en)*2019-12-112021-06-11复旦大学Unsupervised automatic domain knowledge map construction system
CN111222315B (en)*2019-12-312023-04-18天津外国语大学Movie scenario prediction method
CN111309903B (en)*2020-01-202023-06-16北京大米未来科技有限公司 A data processing method, device, storage medium and electronic equipment
CN112182187B (en)*2020-09-302022-09-02天津大学Method for extracting important time segments in short text of social media
CN112445982A (en)*2020-11-262021-03-05天津大学Social network-based emotion interaction community detection method
CN112905741B (en)*2021-02-082022-04-12合肥供水集团有限公司Water supply user focus mining method considering space-time characteristics
CN113205117B (en)*2021-04-152023-07-04索信达(北京)数据技术有限公司Community dividing method, device, computer equipment and storage medium
CN113935321B (en)*2021-10-192024-03-26昆明理工大学Adaptive iterative Gibbs sampling method suitable for LDA topic model
CN114461879B (en)*2022-01-212024-10-15哈尔滨理工大学Semantic social network multi-view community discovery method based on text feature integration
CN114913951B (en)*2022-05-142025-05-13云知声智能科技股份有限公司 A medical record inconsistency detection method, system, device and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN107613520A (en)*2017-08-292018-01-19重庆邮电大学 A Method for Discovering Telecom User Similarity Based on LDA Topic Model
CN108694176A (en)*2017-04-062018-10-23北京京东尚科信息技术有限公司Method, apparatus, electronic equipment and the readable storage medium storing program for executing of document sentiment analysis
CN108829799A (en)*2018-06-052018-11-16中国人民公安大学Based on the Text similarity computing method and system for improving LDA topic model
CN109446404A (en)*2018-08-302019-03-08中国电子进出口有限公司A kind of the feeling polarities analysis method and device of network public-opinion

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US9542477B2 (en)*2013-12-022017-01-10Qbase, LLCMethod of automated discovery of topics relatedness

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN108694176A (en)*2017-04-062018-10-23北京京东尚科信息技术有限公司Method, apparatus, electronic equipment and the readable storage medium storing program for executing of document sentiment analysis
CN107613520A (en)*2017-08-292018-01-19重庆邮电大学 A Method for Discovering Telecom User Similarity Based on LDA Topic Model
CN108829799A (en)*2018-06-052018-11-16中国人民公安大学Based on the Text similarity computing method and system for improving LDA topic model
CN109446404A (en)*2018-08-302019-03-08中国电子进出口有限公司A kind of the feeling polarities analysis method and device of network public-opinion

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"Supervised Intensive Topic Models for Emotion Detection over Short Text";Yanghui Rao et al.;《International Conference on Database Systems for Advanced Applications》;20170322;第408-422页*
"基于多特征融合的微博主题情感挖掘";黄发良 等;《计算机学报》;20170430;第40卷(第4期);第872-888页*

Also Published As

Publication numberPublication date
CN109933657A (en)2019-06-25

Similar Documents

PublicationPublication DateTitle
CN109933657B (en) A sentiment analysis method for topic mining based on user feature optimization
Wang et al.Feature extraction and analysis of natural language processing for deep learning English language
CN113704546B (en)Video natural language text retrieval method based on space time sequence characteristics
CN108052593B (en) A topic keyword extraction method based on topic word vector and network structure
CN105117428B (en)A kind of web comment sentiment analysis method based on word alignment model
CN104036010B (en)Semi-supervised CBOW based user search term subject classification method
CN108549647B (en)Method for realizing active prediction of emergency in mobile customer service field without marking corpus based on SinglePass algorithm
US20060112146A1 (en)Systems and methods for data analysis and/or knowledge management
CN107688870B (en) A method and device for visual analysis of hierarchical factors of deep neural network based on text stream input
CN104216954A (en)Prediction device and prediction method for state of emergency topic
CN104636425A (en)Method for predicting and visualizing emotion cognitive ability of network individual or group
Pan et al.Advancements of artificial intelligence techniques in the realm about library and information subject—a case survey of latent Dirichlet allocation method
Wang et al.Detecting hot topics from academic big data
Tang et al.Co-attentive representation learning for web services classification
CN118069927A (en)News recommendation method and system based on knowledge perception and user multi-interest feature representation
CN118377900A (en)Social public opinion event detection method based on hyperbolic graph clustering
KimText classification based on neural network fusion
Kusum et al.Sentiment analysis using global vector and long short-term memory
Ge et al.A Novel Chinese Domain Ontology Construction Method for Petroleum Exploration Information.
RuchCan x2vec save lives? integrating graph and language embeddings for automatic mental health classification
PariharA study on sentiment analysis of product reviews
Ma et al.Friend closeness based user matching cross social networks
CN112036165A (en)Method for constructing news characteristic vector and application
CN116226533A (en) Method, device and medium for news association recommendation based on association prediction model
Wang et al.A Method of Hot Topic Detection in Blogs Using N-gram Model.

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp