



技术领域technical field
本发明属于自然语言处理领域的情感分析与主题挖掘任务,更具体地,涉及一种基于用户特征优化的主题挖掘情感分析方法。The invention belongs to the tasks of sentiment analysis and topic mining in the field of natural language processing, and more particularly relates to a topic mining sentiment analysis method based on user feature optimization.
背景技术Background technique
互联网社交网络文本包含了用户的观点意见及个人情绪,对这种非结构化的网络数据的提取的过程被称为情感分析或观点挖掘。根据方法的基本属性,主要可以分为机器学习模型、基于词典的学习模型和主题模型。近年来,由于主题模型的蓬勃发展,因此大量基于主题模型被拓展为情感预测分类模型,并应用情感分析领域对互联网用户生成文本做情感分类工作,例如,对商品评论信息和电影评论信息的情感分类及主题挖掘。Internet social network texts contain users' opinions and personal emotions, and the process of extracting such unstructured network data is called sentiment analysis or opinion mining. According to the basic properties of the method, it can be mainly divided into machine learning model, dictionary-based learning model and topic model. In recent years, due to the vigorous development of topic models, a large number of topic-based models have been extended to sentiment prediction classification models, and the field of sentiment analysis has been applied to perform sentiment classification on texts generated by Internet users, such as sentiment on product review information and movie review information. Classification and topic mining.
Mei Q等人提出首个情感-主题联合模型Topic-Sentiment Model(简称TSM) 模型,该模型在pLSA的基础上进行改进的,它同时对情感和主题线索进行建模,假设文档中的每一个词语的生成首先应该决定该词属于正面情感抑或是属于负面情感,继而决定该词的主题,最后决定相应主题下的词。与pLSA一样,在面对小数据集时,TSM同样易导致过拟合问题的出现。基于LDA的优势, Lin C和He Y提出JST模型,不仅仅为隐变量主题和情感加入先验分布,而且为每一篇文章设置一个满足多项式分布的情感分布,为该文章的每一个情感标签设置一个满足多项式分布的主题分布。明显地,在JST模型中,主题和情感之间的关系是相对独立的,这二种简易的组合方式带来文章情感不一致的噪声效果,Jo Y和Oh AH在JST的工作基础上提出ASUM模型,假设一个句子里的所有单词仅有一个主题,共享一个情感标签。Li F等人则提出Sentiment-LDA 和Dependency-Sentiment-LDA模型,Sentiment-LDA假设文章满足多项式分布的主题分布决定其满足二项式的情感分布,而Dependency-Sentiment-LDA运用句子中的连词信息(如“但是、并且、然而”等连词),来减少单词情感的不一致性。针对JST无法分离出主题词汇以及情感词汇的缺点,Zhao W等人则提出MaximumEntropy LDA(简称Max-Ent LDA)模型,借助最大信息熵的性质将单词分为背景词汇,及特定主题词汇,提高主题挖掘的精准度和情感分析的准确度。Xu K等人提出TUS-LDA模型,结合时间信息、用户身份信息、情感偏向对数据做个人兴趣挖掘和社会热点检测。TUS-LDA模型将主题分为两类,一类为与用户个人兴趣相关的“静态话题”,另一类则是随时间变化很大的社会热点事件相关的“动态话题”。如果一篇社交网络文本的主题为“静态话题”, TUS-LDA则使用其子模型“用户-情感-主题”联合模型进行对用户的个人兴趣和情感偏向分析,反之则使用“时间-情感-主题”联合模型得到社会热点和事件和舆论观点。在TUS-LDA的两个子模型里,同样使用每篇文本的不同情感类别满足一个多项式分布的假设,并通过情感类别确定一个用户的兴趣主题多项式分布或者一个时间段内时间话题多项式分布。Mei Q et al. proposed the first sentiment-topic joint model Topic-Sentiment Model (TSM for short) model, which is improved on the basis of pLSA. It models sentiment and topic cues at the same time, assuming that each The generation of a word should first determine whether the word belongs to a positive emotion or a negative emotion, then determine the topic of the word, and finally determine the words under the corresponding topic. Like pLSA, TSM is also prone to overfitting when faced with small datasets. Based on the advantages of LDA, Lin C and He Y proposed the JST model, which not only adds a prior distribution to the latent variable topics and sentiments, but also sets a sentiment distribution that satisfies the multinomial distribution for each article. Set up a topic distribution that satisfies a multinomial distribution. Obviously, in the JST model, the relationship between topics and emotions is relatively independent. These two simple combinations bring about the noise effect of inconsistent sentiment in the article. Jo Y and Oh AH proposed the ASUM model based on the work of JST. , assuming that all words in a sentence have only one topic and share a sentiment label. Li F et al. proposed the Sentiment-LDA and Dependency-Sentiment-LDA models. Sentiment-LDA assumes that the topic distribution of the article satisfies the multinomial distribution to determine the sentiment distribution that satisfies the binomial distribution, while Dependency-Sentiment-LDA uses the conjunction information in the sentence. (e.g. conjunctions such as "but, and, however") to reduce inconsistencies in word sentiment. Aiming at the shortcomings of JST's inability to separate topic vocabulary and emotional vocabulary, Zhao W et al. proposed the MaximumEntropy LDA (Max-Ent LDA) model, which uses the property of maximum information entropy to divide words into background vocabulary and specific topic vocabulary. The accuracy of mining and the accuracy of sentiment analysis. Xu K et al. proposed the TUS-LDA model, which combines time information, user identity information, and emotional bias for personal interest mining and social hotspot detection. The TUS-LDA model divides topics into two categories, one is "static topics" related to users' personal interests, and the other is "dynamic topics" related to social hot events that change greatly over time. If the topic of a social network text is "static topic", TUS-LDA uses its sub-model "user-sentiment-topic" joint model to analyze the user's personal interests and sentiment bias, otherwise it uses "time-sentiment- The topic" joint model gets social hotspots and events and public opinion views. In the two sub-models of TUS-LDA, the different sentiment categories of each text are also used to satisfy the assumption of a multinomial distribution, and the multinomial distribution of a user's topic of interest or a multinomial distribution of time topics within a time period is determined by the sentiment category.
以上主题情感联合模型均为无监督主题情感联合模型,需要依靠情感词典的辅助信息提高模型的情感预测效果。为了将主题模型应用于有监督学习, Mcauliffe JD和BleiDM提出可适用于分类问题和回归问题的有监督主题模型 Supervised topic models(简称SLDA),然而该模型并没有对主题层和情感层之间的联系进行深刻探讨。Bao S等]提出有监督情感分析主题模型Emotion-Term Model(简称ETM),该模型以作者的角度进行构建,面对公众情感分类工作,针对已有的训练集语料及每篇文章的公众情感投票标签对测试预料公众情感反馈。Rao Y等人提出有监督情感分析主题模型Multi-label supervised topicmodel (简称MSTM)和Sentiment latent topic model(简称SLTM),实验效果表明,以读者视角为构建基础的MSTM和SLTM模型更为适合公众情感投票的预测工作。The above topic sentiment joint models are all unsupervised topic sentiment joint models, which need to rely on the auxiliary information of sentiment dictionary to improve the sentiment prediction effect of the model. In order to apply the topic model to supervised learning, Mcauliffe JD and BleiDM proposed a supervised topic model (SLDA), which can be applied to classification problems and regression problems. Contact for in-depth discussion. Bao S et al.] proposed a supervised sentiment analysis topic model Emotion-Term Model (ETM for short), which is constructed from the author's point of view, facing the work of public sentiment classification, based on the existing training set corpus and the public sentiment of each article. The polling tab anticipates public sentiment feedback on the test. Rao Y et al. proposed supervised sentiment analysis topic models Multi-label supervised topic model (MSTM for short) and Sentiment latent topic model (SLTM for short). The experimental results show that the MSTM and SLTM models based on the reader's perspective are more suitable for public sentiment Prediction of voting works.
过去的研究工作大部分只将文本信息与文本发布时间、文本情感偏向、文本发布作者身份等额外信息中的一个或者两个维度信息进行融合,并没有研究工作针对社交网络文本的特点,对社交网络文本给出的文本信息、发布时间、用户特征充分挖掘并有效整合,充分发挥各个维度信息的价值,对社交文本进行准确挖掘。例如,尽管TUS-LDA模型结合了情感、时间、文本、用户身份四个维度的信息,但是该模型并没有利用用户的特征信息。而上述不同纬度的特征在基于主题模型的情感分析中均具有重要价值,具体如下:Most of the past research work only integrates text information with one or two dimensions of additional information such as text publishing time, text emotional bias, and text publishing authorship. The text information, release time, and user characteristics given by network texts are fully mined and effectively integrated, and the value of information in various dimensions can be fully exploited to accurately mine social texts. For example, although the TUS-LDA model combines information from four dimensions of emotion, time, text, and user identity, the model does not utilize user feature information. The above features of different dimensions are of great value in sentiment analysis based on topic models, as follows:
首先,网络舆论热点随时间变化迅速,随着时间带有显著发展变化。例如,曾经的社会舆论热点“扶老人过马路”总是充斥着“讹诈、道德底线、冷漠”等主题词汇,表达了人们对该主题的担心与痛斥,带有负面情感偏向。一段时间后随着事件的冷却,在人们理性的分析思考之后,社会对该主题的表达逐渐演变回“美德、善良、公正”等积极词汇,再次回归正面的情感偏向。First of all, online public opinion hotspots change rapidly over time, with significant development and changes over time. For example, the once hot topic of public opinion "helping the elderly to cross the road" is always full of themes such as "blackmail, moral bottom line, indifference", expressing people's worries and denunciations on this topic, with a negative emotional bias. After a period of time, with the cooling of the event, after people's rational analysis and thinking, the society's expression of this topic gradually evolved back to positive words such as "virtue, kindness, justice", and returned to positive emotional bias again.
其次,情感标签对模型进行主题建模和情感分析起到监督作用,能更好的区分不同主题与不同情感之间的联系。Secondly, sentiment labels play a role in supervising the model’s topic modeling and sentiment analysis, which can better distinguish the relationship between different topics and different emotions.
最后,用户的特征标签也对主题-情感有不同的影响。例如,对同一新闻事件,男性和女性、工薪阶层和中产阶层的人们的看法和情感会有微妙的不同,这与其自身的所处环境造成的影响不可分割,而用户的特征标签正是用以描述用户自身及其环境的重要表达。明显的,互联网用户数量数以万计,形如AT 模型的作者-主题建模方式如用于网络社交文本的主题提取及情感分析,对每个网络社交用户进行跟踪建模,将导致模型参数过多,这并不能够适应网络社交文本数量庞大的特点。同时,社交网络面对社会每一个民众,人人之间既不相同,又存在共性,通过这些共性将人群按照不同粒度级别进行社区划分,再进行主题建模,不仅能够有效地减少模型参数,而且通过社区内的人群信息相互补充挖掘更加充分的主题信息和更有效地情感预测。但目前为止,尚没有相关研究工作提出如何有效将用户的多维特征以及时间、文本、情感标签等融入主题模型。Finally, user feature labels also have different effects on topic-sentiment. For example, men and women, working-class and middle-class people have subtly different perceptions and emotions about the same news event, which is inseparable from the influence of their own environment, and the user's feature tag is used to Important expressions that describe the user himself and his environment. Obviously, there are tens of thousands of Internet users, such as the author-topic modeling method of the AT model, such as topic extraction and sentiment analysis for online social texts, and tracking and modeling each online social user will lead to model parameters. Too much, which cannot adapt to the huge amount of social texts on the Internet. At the same time, the social network faces every member of the society, and everyone is different and has commonalities. Through these commonalities, people can be divided into communities according to different granularity levels, and then topic modeling can not only effectively reduce model parameters, but also Moreover, through the mutual complementation of crowd information in the community, more sufficient topic information and more effective sentiment prediction can be mined. But so far, there is no related research work that proposes how to effectively integrate the multi-dimensional features of users, as well as time, text, emotional tags, etc. into topic models.
发明内容SUMMARY OF THE INVENTION
本发明为克服上述现有技术所述的至少一种缺陷,提供一种基于用户特征优化的主题挖掘情感分析方法,有效整合文本信息、时间、用户特征、情感标签等四个维度的信息,重新定义网络社交文本生成方式,建立多维度主题情感分联合型,通过对多维度信息的整合,提高对网络社交文本的情感预测准确度。In order to overcome at least one of the above-mentioned defects in the prior art, the present invention provides a theme mining sentiment analysis method based on user feature optimization, which effectively integrates information from four dimensions, such as text information, time, user features, and emotional tags, and recreates the Define the generation method of online social text, establish a multi-dimensional theme sentiment sub-joint, and improve the accuracy of sentiment prediction for online social text through the integration of multi-dimensional information.
为解决上述技术问题,本发明采用的技术方案是:一种基于用户特征优化的主题挖掘情感分析方法,包括以下步骤:In order to solve the above-mentioned technical problems, the technical solution adopted in the present invention is: a kind of theme mining sentiment analysis method based on user feature optimization, comprising the following steps:
S1.建立基于LDA主题模型的多维度主题情感联合模型MTSM,该模型融合了文本信息、时间、用户特征和情感标签;S1. Establish a multi-dimensional topic sentiment joint model MTSM based on the LDA topic model, which integrates text information, time, user characteristics and sentiment labels;
S2.根据文档在模型中的生成过程,使用训练语料对模型训练,进行参数的求解:对文档用户的社区概率分布参数进行估计,发现用户社区,知道了用户的所属社区之后,对该用户所写的文档进行主题和情感检测;使用Gibbs Sampling算法根据公式不断对用户写的文档里的每一个单词采样,推测每个单词可能所属的主题和情感标签,直到收敛;S2. According to the generation process of the document in the model, use the training corpus to train the model and solve the parameters: Estimate the community probability distribution parameters of the document user, discover the user community, and know the user's community. Subject and emotion detection of the written document; use the Gibbs Sampling algorithm to continuously sample each word in the document written by the user according to the formula, and infer the subject and emotion label that each word may belong to until convergence;
S3.模型参数求解完成后,训练好的MTSM模型可以对测试文档有效地进行主题挖掘和情感预测;S3. After the model parameters are solved, the trained MTSM model can effectively perform topic mining and sentiment prediction on the test document;
S4.对测试文档进行主题挖掘和情感预测:得到了模型的参数以后,当对测试文档进行主题挖掘和情感预测的时候,分为社区发现和文档的单词采样两步,利用这两个步骤采样迭代直到收敛,得到基于训练文档和测试文档的新参数,以此进行主题挖掘和情感预测。S4. Perform topic mining and sentiment prediction on the test document: After obtaining the parameters of the model, when performing topic mining and sentiment prediction on the test document, it is divided into two steps: community discovery and document word sampling, and use these two steps to sample Iterate until convergence, and obtain new parameters based on training documents and test documents for topic mining and sentiment prediction.
进一步的,所述的MTSM模型在原始的LDA主题模型的基础上添加如下生成条件:Further, the MTSM model adds the following generation conditions on the basis of the original LDA topic model:
1)添加全局的社区多项式概率π,使其先验服从狄利克雷分布,即π~Dirichlet(γ),该概率分布代表一批语料里的用户属于各个社区的概率;1) Add the global community polynomial probability π to make its prior obey the Dirichlet distribution, ie π~Dirichlet(γ), the probability distribution represents the probability that users in a batch of corpus belong to each community;
2)添加全局的特定社区下用户特征多项式概率ψ,每一种用户特征都有一个概率分布,使用j计数,使其先验服从狄利克雷分布,即ψj~Dirichlet(λ),该概率分布代表某个社区里,用户的特征分布概率;2) Add the polynomial probability ψ of the user feature under the global specific community, each user feature has a probability distribution, use j to count, so that its prior obeys the Dirichlet distribution, that is, ψj ~Dirichlet(λ), the probability The distribution represents the feature distribution probability of users in a certain community;
3)对于每个社区,添加其社区内的文章主题概率分布θc,即社区内所有用户的写的所有文章共同服从一个主题概率分布,使其先验服从狄利克雷分布,即θc~Dirichlet(α),该概率代表每个社区的用户所有文章的主题分布概率;3) For each community, add the topic probability distribution θc of articles in its community, that is, all articles written by all users in the community jointly obey a topic probability distribution, and make it a priori obey the Dirichlet distribution, that is, θc ~ Dirichlet(α), this probability represents the topic distribution probability of all articles of users in each community;
4)对于每一个主题添加情感概率分布φz,使其先验服从狄利克雷分布,即φz~Dirichlet(μ),该概率代表用户对一批语料里挖掘出来的主题的情感分布概率;4) Add the emotional probability distribution φz to each topic, so that its prior obeys the Dirichlet distribution, that is, φz ~Dirichlet(μ), which represents the probability of the user’s emotional distribution of topics excavated from a batch of corpus;
5)对于每一个主题添加时间概率分布τ,使其服从伯努利分布,即 t~Beta(τ),该概率代表一个主题的时间分布概率;5) Add the time probability distribution τ to each topic to make it obey the Bernoulli distribution, that is, t~Beta(τ), which represents the time distribution probability of a topic;
6)对于特定主题的特定情感添加词语概率分布使其先验服从狄利克雷分布该概率代表特定主题特定情感下所有词语的分布概率。6) Add word probability distributions for specific sentiments on specific topics make its prior obey the Dirichlet distribution This probability represents the distribution probability of all words under a specific sentiment of a specific topic.
进一步的,所述的S2步骤具体包括:Further, the step S2 specifically includes:
S21.在总共J维度的特征标签空间中,对于每一维的特征标签fj采样一个满足多项式分布的特征值概率分布ψj~Dirichlet(λ);S21. In the feature label space of a total of J dimensions, for each dimension of the feature label fj , sample an eigenvalue probability distribution ψj ~Dirichlet(λ) that satisfies the polynomial distribution;
S22.对于数据集中的所有用户采样一个满足多项式分布的社区概率分布π~Dirichlet(γ);S22. Sample a community probability distribution π~Dirichlet(γ) that satisfies the polynomial distribution for all users in the data set;
S23.对于每一个聚集的社区c采样一个满足多项式分布的主题概率分布θc~Dirichlet(α);S23. For each aggregated community c, sample a topic probability distribution θc ~Dirichlet(α) that satisfies the polynomial distribution;
S24.对于每一个主题z采样一个满足多项式分布的情感概率分布φz~Dirichlet(μ);S24. Sampling an emotion probability distribution φz ~Dirichlet(μ) that satisfies the polynomial distribution for each topic z;
S25.对于每一个主题z采样一个满足二项式分布的时间概率分布 t~Beta(τ);S25. For each topic z, sample a time probability distribution t~Beta(τ) that satisfies the binomial distribution;
S26.对于每一个主题z的每一个特定情感s,采样一个满足多项式分布的单词概率分布S26. For each specific emotion s of each topic z, sample a word probability distribution that satisfies the multinomial distribution
S27.对于每一个用户ur:S27. For each userur :
a)为该用户采样出所属的社区a) Sample the community to which the user belongs
b)对描述该用户的每一维度特征空间为该用户的第j维度的特征空间采样其特征值b) For each dimension feature space describing the user Sampling its eigenvalues for the jth dimension of the user's feature space
c)对于用户ur所写的每一篇文章其中的每一个单词wi,n:c) for each article written by userur Each of the words wi,n in it :
i.根据该文章的作者所属社区c采样出该单词的主题zi,n~Mul(θc);i. Sample the topic zi,n ~Mul(θc ) of the word according to the community c to which the author of the article belongs;
ii.根据主题zi,n采样出该单词的情感si,n~Mul(φz);ii. According to the topic zi,n, sample the sentiment of the wordsi,n ~Mul(φz );
iii.根据主题zi,n采样出该单词的时间戳ti,n~Bin(τz);iii. According to the topic zi, n, sample the timestamp ti,n ~Bin(τz ) of the word;
iv.根据主题zi,n和情感si,n采样出具体单词iv. Sample specific words according to topic zi, n and emotionsi, n
进一步的,所述的S3步骤具体包括:Further, the step S3 specifically includes:
S31.对文档用户的社区概率分布参数进行估计:S31. Estimate the community probability distribution parameters of document users:
在社区发现的步骤中,带有特征标签的用户ur属于社区c的概率如下式所示:In the step of community discovery, with feature label The probability of userur belonging to community c is as follows:
其中,和的求解公式为:in, and The solution formula is:
式中,为除了用户ur以外其他所有属于社区c的用户数量,为除了用户ur以外,在社区c的所有用户中,第j维特征的特征值kj的出现频数;In the formula, is the number of all users belonging to community c except userur , is the frequency of occurrence of the feature value kj of the j-th dimension feature among all users of the community c except for the user ur;
S32.完成对用户的社区发现之后,需要根据训练文档的文本内容、情感标签、时间戳对每一个用户生成的文档进行主题采样以及情感采样;S32. After completing the community discovery of the user, subject sampling and sentiment sampling need to be performed on each user-generated document according to the text content, sentiment label, and timestamp of the training document;
对于文档已知其情感标签为si,用户ur所属的社区为则其单词wi,n属于某一主题和情感的概率为:for documentation It is known that its emotional label is si , and the community to which userur belongs is Then the probability that the word wi,n belongs to a certain topic and emotion is:
其中,的参数为:in, The parameters are:
其中,为除了文档di的第n个单词wi,n外,属于社区c的所有用户的所有文档中,属于主题z的单词频数,为除了单词wi,n外,所有文档中属于主题z的情感s的单词频数,为除了单词wi,n外,单词w属于主题并且属于情感的频数;in, is the frequency of words belonging to topic z in all documents belonging to all users of community c except for the nth word wi,n of document di , is the word frequency of sentiment s belonging to topic z in all documents except words wi, n , is the frequency that the word w belongs to the topic and belongs to the emotion except for the word wi, n ;
S33.文档的情感标签si为已知参数,因此在对训练集文档训练采样过程中,每篇文档的单词仅仅对各个主题下与情感si相关的参数进行更新采样,通过情感标签达到有监督训练的目的;S33. The sentiment label si of the document is a known parameter, so in the process of training and sampling the training set document, the words of each document only update and sample the parameters related to sentiment si under each topic, and achieve the desired result through the sentiment label. the purpose of supervised training;
参数使用矩阵估计法进行参数更新,具体计算方法为:parameter Use the matrix estimation method to update the parameters. The specific calculation method is:
其中,和分别为所有被赋予主题z的单词的时间戳平均值和标准差。in, and are the mean and standard deviation of timestamps for all words assigned to topic z, respectively.
进一步的,所述的S4步骤具体包括:Further, the step S4 specifically includes:
S41.假设文档dtest,已知生成该文档的用户为utest,该用户的特征标签分别是以及该文档的时间戳为ttest,那么该文档的情感标签则根据式下式计算:S41. Suppose the document dtest , the user who generates the document is known to be utest , and the feature labels of the user are respectively And the timestamp of the document is ttest , then the sentiment label of the document is calculated according to the following formula:
其中,对其进行所属社区的概率根据下式计算:Among them, the probability of the community to which it belongs Calculate according to the following formula:
将上式简化为:Simplify the above formula to:
其中,为用户的第j维特征空间中的特征值上式中各个参数的计算公式为:in, is the feature value in the jth dimension feature space of the user The calculation formula of each parameter in the above formula is:
其中,为除了文档dtest的第n个单词wtest,n外,测试集和训练集中,属于社区c的所有用户的所有文档中,属于主题z的单词频数;为除了单词wtest,n外,测试集和训练集中,所有文档中属于主题z的情感s的单词频数;为除了单词wtest,n外,测试集和训练集中,单词w属于主题并且属于情感的频数;和分别替换成和其中,和分别为在训练集和测试集中,属于主题z的所有单词的时间戳的均值和标准差,为除了文档dtest的作者utest以外,属于社区c的用户数量;in, is the frequency of words belonging to topic z in all documents of all users belonging to community c in the test set and training set, except for the nth word wtest, n of document dtest ; is the word frequency of sentiment s belonging to topic z in all documents in the test set and training set except the word wtest, n ; In addition to the word wtest, n , the test set and training set, the word w belongs to the topic and belongs to the sentiment frequency; and replaced by and in, and are the mean and standard deviation of the timestamps of all words belonging to topic z in the training and test sets, respectively, is the number of users who belong to the community c except for the author utest of the document dtest ;
S42.由于文档的情感标签stest为未知参数,因此在对测试文档采样过程中,每篇文档的单词需要对各个主题下的每一种情感偏向的相关参数进行更新采样,进而确定该文档属于哪一种类别情感的概率最大。S42. Since the emotional label stest of the document is an unknown parameter, in the process of sampling the test document, the words of each document need to update and sample the relevant parameters of each emotional bias under each topic, and then determine that the document belongs to Which category of emotion has the highest probability.
与现有技术相比,有益效果是:Compared with the prior art, the beneficial effects are:
1.挖掘用户特征标签的价值,通过用户的特征标签对社交网络用户进行社区划分,可以进行社区级别的主题挖掘和情感分析,传统主题情感联合模型无法进行社区级别的主题挖掘和情感分析;1. Mining the value of user feature tags, and dividing social network users into communities by user feature tags, community-level topic mining and sentiment analysis can be performed, while traditional topic sentiment joint models cannot perform community-level topic mining and sentiment analysis;
2.针对网络社交文本的特点,有效整合文本信息、时间、用户特征、情感标签等四个维度的信息,重新定义网络社交文本生成方式,建立多维度主题情感分联合型,并且,提供从多个视角观测对比主题信息;2. According to the characteristics of online social texts, effectively integrate information from four dimensions, such as text information, time, user characteristics, and emotional tags, redefine the generation method of online social texts, establish a multi-dimensional theme emotion classification type, and provide multiple Observing and comparing topic information from different perspectives;
3.将多维度主题情感联合模型应用于公众情感分析领域做情感预测任务。通过对多维度信息的整合,提高对网络社交文本的情感预测准确度。3. Apply the multi-dimensional topic sentiment joint model to the field of public sentiment analysis to do sentiment prediction tasks. Through the integration of multi-dimensional information, the accuracy of sentiment prediction for online social texts is improved.
附图说明Description of drawings
图1是本发明多维度主题情感联合模型MTSM结构示意图。FIG. 1 is a schematic structural diagram of the multi-dimensional theme emotion joint model MTSM of the present invention.
图2是本发明MTSM分析文本流程图。FIG. 2 is a flow chart of text analysis by MTSM of the present invention.
图3是本发明多维度主题情感模型MTSM的算法流程图。FIG. 3 is an algorithm flow chart of the multi-dimensional theme emotion model MTSM of the present invention.
图4是本发明多维度主题情感联合模型MTSM预测步骤算法流程图。FIG. 4 is a flow chart of the algorithm for the prediction steps of the multi-dimensional theme emotion joint model MTSM of the present invention.
具体实施方式Detailed ways
附图仅用于示例性说明,不能理解为对本发明的限制;为了更好说明本实施例,附图某些部件会有省略、放大或缩小,并不代表实际产品的尺寸;对于本领域技术人员来说,附图中某些公知结构及其说明可能省略是可以理解的。附图中描述位置关系仅用于示例性说明,不能理解为对本发明的限制。The accompanying drawings are for illustrative purposes only, and should not be construed as limiting the present invention; in order to better illustrate the present embodiment, some parts of the accompanying drawings may be omitted, enlarged or reduced, and do not represent the size of the actual product; for those skilled in the art It is understandable to the artisan that certain well-known structures and descriptions thereof may be omitted from the drawings. The positional relationships described in the drawings are only for exemplary illustration, and should not be construed as limiting the present invention.
实施例1:Example 1:
一种基于用户特征优化的主题挖掘情感分析方法,包括以下步骤:A sentiment analysis method for topic mining based on user feature optimization, comprising the following steps:
步骤1.建立基于LDA主题模型的多维度主题情感联合模型MTSM,该模型融合了文本信息、时间、用户特征和情感标签;Step 1. Establish a multi-dimensional topic sentiment joint model MTSM based on the LDA topic model, which integrates text information, time, user characteristics and sentiment labels;
如图1所示,MTSM模型在原始的LDA主题模型的基础上添加如下生成条件:As shown in Figure 1, the MTSM model adds the following generation conditions to the original LDA topic model:
1)添加全局的社区多项式概率π,使其先验服从狄利克雷分布,即π~Dirichlet(γ),该概率分布代表一批语料里的用户属于各个社区的概率;1) Add the global community polynomial probability π to make its prior obey the Dirichlet distribution, ie π~Dirichlet(γ), the probability distribution represents the probability that users in a batch of corpus belong to each community;
2)添加全局的特定社区下用户特征多项式概率ψ,每一种用户特征都有一个概率分布,使用j计数,使其先验服从狄利克雷分布,即ψj~Dirichlet(λ),该概率分布代表某个社区里,用户的特征分布概率;2) Add the polynomial probability ψ of the user feature under the global specific community, each user feature has a probability distribution, use j to count, so that its prior obeys the Dirichlet distribution, that is, ψj ~Dirichlet(λ), the probability The distribution represents the feature distribution probability of users in a certain community;
3)对于每个社区,添加其社区内的文章主题概率分布θc,即社区内所有用户的写的所有文章共同服从一个主题概率分布,使其先验服从狄利克雷分布,即θc~Dirichlet(α),该概率代表每个社区的用户所有文章的主题分布概率;3) For each community, add the topic probability distribution θc of articles in its community, that is, all articles written by all users in the community jointly obey a topic probability distribution, and make it a priori obey the Dirichlet distribution, that is, θc ~ Dirichlet(α), this probability represents the topic distribution probability of all articles of users in each community;
4)对于每一个主题添加情感概率分布φz,使其先验服从狄利克雷分布,即φz~Dirichlet(μ),该概率代表用户对一批语料里挖掘出来的主题的情感分布概率;4) Add the emotional probability distribution φz to each topic, so that its prior obeys the Dirichlet distribution, that is, φz ~Dirichlet(μ), which represents the probability of the user’s emotional distribution of topics excavated from a batch of corpus;
5)对于每一个主题添加时间概率分布τ,使其服从伯努利分布,即 t~Beta(τ),该概率代表一个主题的时间分布概率;5) Add the time probability distribution τ to each topic to make it obey the Bernoulli distribution, that is, t~Beta(τ), which represents the time distribution probability of a topic;
6)对于特定主题的特定情感添加词语概率分布使其先验服从狄利克雷分布该概率代表特定主题特定情感下所有词语的分布概率。6) Add word probability distributions for specific sentiments on specific topics make its prior obey the Dirichlet distribution This probability represents the distribution probability of all words under a specific sentiment of a specific topic.
在该模型中,对数据集进行主题挖掘情感分析主要分为两步:用户社区发现,文档主题提取和情感预测。In this model, topic mining sentiment analysis on the dataset is mainly divided into two steps: user community discovery, document topic extraction and sentiment prediction.
首先是用户社区发现,将用户特征融入主题模型,进而约束主题形成的过程是本文的创新点之一。原因是:首先,在现实生活中,面对同一个主题,处于不同情况或者不同环境的人们往往会产生不同的情感。例如,对于同样的一个新闻事件,不同薪资阶层、不同地域、不同年龄的人们往往会抱有不一样的看法。而相似的人群,他们讨论的主题方面可能较为相似,其情感反馈也较为相近。其次,因此,本文提出数据集中的用户集合实际由隐变量不同的社区组成的观点,而每个社区均通过其高频出现的不同特征来表征。用户所属社区总数C为预定参数,这样便可以对用户进行不同粒度级别的划分。The first is the discovery of the user community. It is one of the innovations of this paper to incorporate user characteristics into the topic model and then constrain the process of topic formation. The reasons are: First, in real life, people in different situations or environments tend to have different emotions when faced with the same subject. For example, for the same news event, people from different salary classes, different regions, and different ages often have different opinions. Similar groups of people may discuss similar topics and have similar emotional feedback. Second, therefore, this paper proposes the idea that the set of users in the dataset is actually composed of communities with different latent variables, and each community is characterized by its different characteristics that appear frequently. The total number C of communities to which users belong is a predetermined parameter, so that users can be divided into different levels of granularity.
其次是主题挖掘与情感分析部分。当对用户进行社区聚集之后,针对每一个社区,MTSM对社区内的用户文章进行主题提取以及情感预测。假设每一个社区中,用户的文章均满足一个主题概率分布,同时不同的主题有不同的时间概率分布。在MTSM中,用户生成一篇文档的过程则是根据所属社区的主题概率分布和时间概率分布,用户以一定的概率选择某个主题,然后根据该主题的情感概率分布以一定概率选择某种偏向的情感,最后,根据特定主题以及特定情感,以一定概率选择选择某个单词填入文档。这个步骤循环至文档完成。由上可见,在MTSM模型中,一篇文档的生成不仅仅受到其主题的影响,还受到所属社区、时间,以及情感偏向的作用。各个维度信息通过这种方式的集合,不仅仅能够得到某一社区内的主题分布情况,还能同时观察得到该主题下的情感分布,该主题的时间分布,以及特定主题特定情感的单词分布。The second part is topic mining and sentiment analysis. After gathering users in the community, for each community, MTSM performs topic extraction and sentiment prediction on user articles in the community. It is assumed that in each community, the user's articles satisfy a topic probability distribution, and different topics have different time probability distributions. In MTSM, the process for a user to generate a document is to select a topic with a certain probability according to the topic probability distribution and time probability distribution of the community to which he belongs, and then select a certain bias with a certain probability according to the emotional probability distribution of the topic Finally, according to a specific topic and a specific emotion, a certain word is selected to fill in the document with a certain probability. This step loops until the document is complete. It can be seen from the above that in the MTSM model, the generation of a document is not only affected by its subject, but also by its community, time, and emotional bias. The collection of various dimensions information in this way can not only obtain the distribution of topics in a certain community, but also observe the distribution of emotions under the topic, the time distribution of the topic, and the word distribution of specific emotions in a specific topic.
MTSM模型所需的参数标注如表1所示:The parameter labels required by the MTSM model are shown in Table 1:
表1 MTSM模型所需要的参数Table 1 Parameters required by the MTSM model
步骤2.将测试文档放入该MTSM模型中,根据该MTSM模型内容进行文档的生成;
MTSM模型中一篇文章的生成过程如下所示:The generation process of an article in the MTSM model is as follows:
S21.在总共J维度的特征标签空间中,对于每一维的特征标签fj采样一个满足多项式分布的特征值概率分布ψj~Dirichlet(λ);S21. In the feature label space of a total of J dimensions, for each dimension of the feature label fj , sample an eigenvalue probability distribution ψj ~Dirichlet(λ) that satisfies the polynomial distribution;
S22.对于数据集中的所有用户采样一个满足多项式分布的社区概率分布π~Dirichlet(γ);S22. Sample a community probability distribution π~Dirichlet(γ) that satisfies the polynomial distribution for all users in the data set;
S23.对于每一个聚集的社区c采样一个满足多项式分布的主题概率分布θc~Dirichlet(α);S23. For each aggregated community c, sample a topic probability distribution θc ~Dirichlet(α) that satisfies the polynomial distribution;
S24.对于每一个主题z采样一个满足多项式分布的情感概率分布φz~Dirichlet(μ);S24. Sampling an emotion probability distribution φz ~Dirichlet(μ) that satisfies the polynomial distribution for each topic z;
S25.对于每一个主题z采样一个满足二项式分布的时间概率分布 t~Beta(τ);S25. For each topic z, sample a time probability distribution t~Beta(τ) that satisfies the binomial distribution;
S26.对于每一个主题z的每一个特定情感s,采样一个满足多项式分布的单词概率分布S26. For each specific emotion s of each topic z, sample a word probability distribution that satisfies the multinomial distribution
S27.对于每一个用户ur:S27. For each userur :
a)为该用户采样出所属的社区a) Sample the community to which the user belongs
b)对描述该用户的每一维度特征空间为该用户的第j维度的特征空间采样其特征值b) For each dimension feature space describing the user Sampling its eigenvalues for the jth dimension of the user's feature space
c)对于用户ur所写的每一篇文章其中的每一个单词wi,n:c) for each article written by userur Each of the words wi,n in it :
i.根据该文章的作者所属社区c采样出该单词的主题zi,n~Mul(θc);i. Sample the topic zi,n ~Mul(θc ) of the word according to the community c to which the author of the article belongs;
ii.根据主题zi,n采样出该单词的情感si,n~Mul(φz);ii. According to the topic zi,n, sample the sentiment of the wordsi,n ~Mul(φz );
iii.根据主题zi,n采样出该单词的时间戳ti,n~Bin(τz);iii. According to the topic zi, n, sample the timestamp ti,n ~Bin(τz ) of the word;
iv.根据主题zi,n和情感si,n采样出具体单词iv. Sample specific words according to topic zi, n and emotionsi, n
步骤3.参数求解
知道了该模型生成文档的流程之后,我们就可以根据流程推导该模型的参数。得到模型的参数后,就知道了一批训练文本的社区分布、每个社区的用户特征分布、主题分布、每个主题的情感分布、每个主题的时间分布、每个主题每个情感的词汇分布;根据该模型的参数,还能够预测一篇文档的主题分布和情感标签,当要预测一篇测试(未知)文本的主题和情感标签时,根据模型参数、文本的用户特征、时间戳、内容即可推测该文本的主题和情感标签。下面首先介绍模型参数求解的步骤,然后是预测一篇测试(未知)文本的主题和情感标签的步骤。主要流程如图2所示.After knowing the process of document generation by the model, we can deduce the parameters of the model according to the process. After the parameters of the model are obtained, the community distribution of a batch of training texts, the user feature distribution of each community, the topic distribution, the sentiment distribution of each topic, the time distribution of each topic, and the vocabulary of each topic and each emotion are known. distribution; according to the parameters of the model, it can also predict the topic distribution and sentiment label of a document. When predicting the topic and sentiment label of a test (unknown) text, according to the model parameters, text user characteristics, timestamp, content to infer the topic and sentiment tags of the text. The following first introduces the steps of solving the model parameters, followed by the steps of predicting the topic and sentiment labels of a test (unknown) text. The main process is shown in Figure 2.
鉴于Gibbs Sampling算法的简易明了及其有效性,本文使用Gibbs Sampling 算法对MTSM模型参数进行求解,其参数求解具体流程如算法1所示。参数的推导可以主要分为2个步骤,一个是对文档用户的社区概率分布参数进行估计,发现用户社区,第二个步骤是知道了用户的所属社区之后,对该用户所写的文档进行主题和情感检测。使用GibbsSampling算法根据公式不断对每一个用户和文档里的单词采样,推测每个用户其所属的可能社区,每个单词可能所属的主题和情感标签,直到收敛,那么就可以知道模型里的参数了。In view of the simplicity and effectiveness of the Gibbs Sampling algorithm, this paper uses the Gibbs Sampling algorithm to solve the parameters of the MTSM model. The specific process of parameter solving is shown in Algorithm 1. The derivation of parameters can be mainly divided into two steps. One is to estimate the community probability distribution parameters of document users and discover the user community. The second step is to theme the document written by the user after knowing the community to which the user belongs. and emotion detection. Use the GibbsSampling algorithm to continuously sample the words in each user and document according to the formula, infer the possible community to which each user belongs, the topic and sentiment label that each word may belong to, until convergence, then you can know the parameters in the model .
多维度主题情感联合模型参数的算法流程图如图3所示。The algorithm flow chart of the parameters of the multi-dimensional topic emotion joint model is shown in Figure 3.
S31.对文档用户的社区概率分布参数进行估计:S31. Estimate the community probability distribution parameters of document users:
在社区发现的步骤中,带有特征标签的用户ur属于社区c的概率如下式所示:In the step of community discovery, with feature label The probability of userur belonging to community c is as follows:
其中,和的求解公式为:in, and The solution formula is:
式中,为除了用户ur以外其他所有属于社区c的用户数量,为除了用户ur以外,在社区c的所有用户中,第j维特征的特征值kj的出现频数;In the formula, is the number of all users belonging to community c except userur , is the frequency of occurrence of the feature value kj of the j-th dimension feature among all users of the community c except for the user ur;
S32.完成对用户的社区发现之后,需要根据训练文档的文本内容、情感标签、时间戳对每一个用户生成的文档进行主题采样以及情感采样;S32. After completing the community discovery of the user, subject sampling and sentiment sampling need to be performed on each user-generated document according to the text content, sentiment label, and timestamp of the training document;
对于文档已知其情感标签为si,用户ur所属的社区为则其单词wi,n属于某一主题和情感的概率为:for documentation It is known that its emotional label is si , and the community to which userur belongs is Then the probability that the word wi,n belongs to a certain topic and emotion is:
其中,的参数为:in, The parameters are:
其中,为除了文档di的第n个单词wi,n外,属于社区c的所有用户的所有文档中,属于主题z的单词频数,为除了单词wi,n外,所有文档中属于主题z的情感s的单词频数,为除了单词wi,n外,单词w属于主题并且属于情感的频数;in, is the frequency of words belonging to topic z in all documents belonging to all users of community c except for the nth word wi,n of document di , is the word frequency of sentiment s belonging to topic z in all documents except words wi, n , is the frequency that the word w belongs to the topic and belongs to the emotion except for the word wi, n ;
S33.文档的情感标签si为已知参数,因此在对训练集文档训练采样过程中,每篇文档的单词仅仅对各个主题下与情感si相关的参数进行更新采样,通过情感标签达到有监督训练的目的;S33. The sentiment label si of the document is a known parameter, so in the process of training and sampling the training set document, the words of each document only update and sample the parameters related to sentiment si under each topic, and achieve the desired result through the sentiment label. the purpose of supervised training;
参数使用矩阵估计法进行参数更新,具体计算方法为:parameter Use the matrix estimation method to update the parameters. The specific calculation method is:
其中,和分别为所有被赋予主题z的单词的时间戳平均值和标准差。in, and are the mean and standard deviation of timestamps for all words assigned to topic z, respectively.
步骤4.对测试文档进行主题挖掘和情感预测:得到了模型的参数以后,当对测试文档进行主题挖掘和情感预测的时候,分为社区发现和文档的单词采样两步,利用这两个步骤采样迭代直到收敛,得到基于训练文档和测试文档的新参数,以此进行主题挖掘和情感预测。
多维度主题情感模型的预测步骤算法流程图如图4所示。The algorithm flow chart of the prediction steps of the multi-dimensional topic sentiment model is shown in Figure 4.
S4步骤具体包括:Step S4 specifically includes:
S41.假设文档dtest,已知生成该文档的用户为utest,该用户的特征标签分别是以及该文档的时间戳为ttest,那么该文档的情感标签则根据式下式计算:S41. Suppose the document dtest , the user who generates the document is known to be utest , and the feature labels of the user are respectively And the timestamp of the document is ttest , then the sentiment label of the document is calculated according to the following formula:
其中,对其进行所属社区的概率根据下式计算:Among them, the probability of the community to which it belongs Calculate according to the following formula:
将上式简化为:Simplify the above formula to:
其中,为用户的第j维特征空间中的特征值上式中各个参数的计算公式为:in, is the feature value in the jth dimension feature space of the user The calculation formula of each parameter in the above formula is:
其中,为除了文档dtest的第n个单词wtest,n外,测试集和训练集中,属于社区c的所有用户的所有文档中,属于主题z的单词频数;为除了单词wtest,n外,测试集和训练集中,所有文档中属于主题z的情感s的单词频数;为除了单词wtest,n外,测试集和训练集中,单词w属于主题并且属于情感的频数;和分别替换成和其中,和分别为在训练集和测试集中,属于主题z的所有单词的时间戳的均值和标准差,为除了文档dtest的作者utest以外,属于社区c的用户数量;in, is the frequency of words belonging to topic z in all documents of all users belonging to community c in the test set and training set, except for the nth word wtest, n of document dtest ; is the word frequency of sentiment s belonging to topic z in all documents in the test set and training set except the word wtest, n ; In addition to the word wtest, n , the test set and training set, the word w belongs to the topic and belongs to the sentiment frequency; and replaced by and in, and are the mean and standard deviation of the timestamps of all words belonging to topic z in the training and test sets, respectively, is the number of users who belong to the community c except for the author utest of the document dtest ;
S42.由于文档的情感标签stest为未知参数,因此在对测试文档采样过程中,每篇文档的单词需要对各个主题下的每一种情感偏向的相关参数进行更新采样,进而确定该文档属于哪一种类别情感的概率最大。S42. Since the emotional label stest of the document is an unknown parameter, in the process of sampling the test document, the words of each document need to update and sample the relevant parameters of each emotional bias under each topic, and then determine that the document belongs to Which category of emotion has the highest probability.
显然,本发明的上述实施例仅仅是为清楚地说明本发明所作的举例,而并非是对本发明的实施方式的限定。对于所属领域的普通技术人员来说,在上述说明的基础上还可以做出其它不同形式的变化或变动。这里无需也无法对所有的实施方式予以穷举。凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本发明权利要求的保护范围之内。Obviously, the above-mentioned embodiments of the present invention are only examples for clearly illustrating the present invention, rather than limiting the embodiments of the present invention. For those of ordinary skill in the art, changes or modifications in other different forms can also be made on the basis of the above description. There is no need and cannot be exhaustive of all implementations here. Any modifications, equivalent replacements and improvements made within the spirit and principle of the present invention shall be included within the protection scope of the claims of the present invention.
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201910218584.2ACN109933657B (en) | 2019-03-21 | 2019-03-21 | A sentiment analysis method for topic mining based on user feature optimization |
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201910218584.2ACN109933657B (en) | 2019-03-21 | 2019-03-21 | A sentiment analysis method for topic mining based on user feature optimization |
| Publication Number | Publication Date |
|---|---|
| CN109933657A CN109933657A (en) | 2019-06-25 |
| CN109933657Btrue CN109933657B (en) | 2021-07-09 |
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201910218584.2AActiveCN109933657B (en) | 2019-03-21 | 2019-03-21 | A sentiment analysis method for topic mining based on user feature optimization |
| Country | Link |
|---|---|
| CN (1) | CN109933657B (en) |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110807315A (en)* | 2019-10-15 | 2020-02-18 | 上海大学 | Topic model-based online comment emotion mining method |
| CN110851733A (en)* | 2019-10-31 | 2020-02-28 | 天津大学 | Community Discovery and Sentiment Interpretation Methods Based on Network Topology and Document Content |
| CN112948570A (en)* | 2019-12-11 | 2021-06-11 | 复旦大学 | Unsupervised automatic domain knowledge map construction system |
| CN111222315B (en)* | 2019-12-31 | 2023-04-18 | 天津外国语大学 | Movie scenario prediction method |
| CN111309903B (en)* | 2020-01-20 | 2023-06-16 | 北京大米未来科技有限公司 | A data processing method, device, storage medium and electronic equipment |
| CN112182187B (en)* | 2020-09-30 | 2022-09-02 | 天津大学 | Method for extracting important time segments in short text of social media |
| CN112445982A (en)* | 2020-11-26 | 2021-03-05 | 天津大学 | Social network-based emotion interaction community detection method |
| CN112905741B (en)* | 2021-02-08 | 2022-04-12 | 合肥供水集团有限公司 | Water supply user focus mining method considering space-time characteristics |
| CN113205117B (en)* | 2021-04-15 | 2023-07-04 | 索信达(北京)数据技术有限公司 | Community dividing method, device, computer equipment and storage medium |
| CN113935321B (en)* | 2021-10-19 | 2024-03-26 | 昆明理工大学 | Adaptive iterative Gibbs sampling method suitable for LDA topic model |
| CN114461879B (en)* | 2022-01-21 | 2024-10-15 | 哈尔滨理工大学 | Semantic social network multi-view community discovery method based on text feature integration |
| CN114913951B (en)* | 2022-05-14 | 2025-05-13 | 云知声智能科技股份有限公司 | A medical record inconsistency detection method, system, device and storage medium |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN107613520A (en)* | 2017-08-29 | 2018-01-19 | 重庆邮电大学 | A Method for Discovering Telecom User Similarity Based on LDA Topic Model |
| CN108694176A (en)* | 2017-04-06 | 2018-10-23 | 北京京东尚科信息技术有限公司 | Method, apparatus, electronic equipment and the readable storage medium storing program for executing of document sentiment analysis |
| CN108829799A (en)* | 2018-06-05 | 2018-11-16 | 中国人民公安大学 | Based on the Text similarity computing method and system for improving LDA topic model |
| CN109446404A (en)* | 2018-08-30 | 2019-03-08 | 中国电子进出口有限公司 | A kind of the feeling polarities analysis method and device of network public-opinion |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US9542477B2 (en)* | 2013-12-02 | 2017-01-10 | Qbase, LLC | Method of automated discovery of topics relatedness |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN108694176A (en)* | 2017-04-06 | 2018-10-23 | 北京京东尚科信息技术有限公司 | Method, apparatus, electronic equipment and the readable storage medium storing program for executing of document sentiment analysis |
| CN107613520A (en)* | 2017-08-29 | 2018-01-19 | 重庆邮电大学 | A Method for Discovering Telecom User Similarity Based on LDA Topic Model |
| CN108829799A (en)* | 2018-06-05 | 2018-11-16 | 中国人民公安大学 | Based on the Text similarity computing method and system for improving LDA topic model |
| CN109446404A (en)* | 2018-08-30 | 2019-03-08 | 中国电子进出口有限公司 | A kind of the feeling polarities analysis method and device of network public-opinion |
| Title |
|---|
| "Supervised Intensive Topic Models for Emotion Detection over Short Text";Yanghui Rao et al.;《International Conference on Database Systems for Advanced Applications》;20170322;第408-422页* |
| "基于多特征融合的微博主题情感挖掘";黄发良 等;《计算机学报》;20170430;第40卷(第4期);第872-888页* |
| Publication number | Publication date |
|---|---|
| CN109933657A (en) | 2019-06-25 |
| Publication | Publication Date | Title |
|---|---|---|
| CN109933657B (en) | A sentiment analysis method for topic mining based on user feature optimization | |
| Wang et al. | Feature extraction and analysis of natural language processing for deep learning English language | |
| CN113704546B (en) | Video natural language text retrieval method based on space time sequence characteristics | |
| CN108052593B (en) | A topic keyword extraction method based on topic word vector and network structure | |
| CN105117428B (en) | A kind of web comment sentiment analysis method based on word alignment model | |
| CN104036010B (en) | Semi-supervised CBOW based user search term subject classification method | |
| CN108549647B (en) | Method for realizing active prediction of emergency in mobile customer service field without marking corpus based on SinglePass algorithm | |
| US20060112146A1 (en) | Systems and methods for data analysis and/or knowledge management | |
| CN107688870B (en) | A method and device for visual analysis of hierarchical factors of deep neural network based on text stream input | |
| CN104216954A (en) | Prediction device and prediction method for state of emergency topic | |
| CN104636425A (en) | Method for predicting and visualizing emotion cognitive ability of network individual or group | |
| Pan et al. | Advancements of artificial intelligence techniques in the realm about library and information subject—a case survey of latent Dirichlet allocation method | |
| Wang et al. | Detecting hot topics from academic big data | |
| Tang et al. | Co-attentive representation learning for web services classification | |
| CN118069927A (en) | News recommendation method and system based on knowledge perception and user multi-interest feature representation | |
| CN118377900A (en) | Social public opinion event detection method based on hyperbolic graph clustering | |
| Kim | Text classification based on neural network fusion | |
| Kusum et al. | Sentiment analysis using global vector and long short-term memory | |
| Ge et al. | A Novel Chinese Domain Ontology Construction Method for Petroleum Exploration Information. | |
| Ruch | Can x2vec save lives? integrating graph and language embeddings for automatic mental health classification | |
| Parihar | A study on sentiment analysis of product reviews | |
| Ma et al. | Friend closeness based user matching cross social networks | |
| CN112036165A (en) | Method for constructing news characteristic vector and application | |
| CN116226533A (en) | Method, device and medium for news association recommendation based on association prediction model | |
| Wang et al. | A Method of Hot Topic Detection in Blogs Using N-gram Model. |
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |