CN105335352A

Movatterモバイル変換

Info

Publication number: CN105335352A
Application number: CN201510864383.1A
Authority: CN
Inventors: 崔晓辉; 朱卫平; 张威风; 杨威; 王志波; 李伟
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2015-11-30
Filing date: 2015-11-30
Publication date: 2016-02-17

Abstract

The invention provides an entity identification technology based on Weibo emotion. The entity identification technology comprises the steps that Weibo data are acquired through an api collection technology and preprocessed, wherein a Circumplex annular emotion model is used as an emotion analysis model, and four kinds of emotion keyword dictionaries are generated; Weibo data are acquired through the API collection technology, preprocessing is conducted on data and vectorization is conducted on a data set, learning and training are conducted through four machine learning algorithms, quintuplicate cross validation is conducted, and classification is conducted on the new data set through a selected optimal machine learning classification program; finally, entity extraction is conducted on the classified data.

Description

Translated fromChinese

基于微博情感的实体识别方法An Entity Recognition Method Based on Weibo Emotion

技术领域technical field

本发明涉及到网络中大数据的采集与分析领域，具体涉及一种基于微博情感的实体识别方法。The invention relates to the field of collection and analysis of big data in the network, in particular to an entity recognition method based on microblog emotion.

技术背景technical background

在国内，由于微博是近几年才发展起来的新型社交媒体平台，所以国内针对微博短文本的情感分析研究起步较晚。比较早的研究是叶强、张紫琼和罗振雄三位学者建立在普遍使用的N-POS语言模型的基础上进行中文词组的特征提取，提出了中文双词主观词组模型2-POS，为汉字文本内容的情感识别垫定基础。在此之后，徐军用朴素贝叶斯以及最大熵等机器学习的方法来进行文本情感挖掘分类，其研究结果表明，在基于情感的中文文本内容分类中利用机器学习方法可以取得比较满意的效果，准确率可以达到90％以上。对于电影评论，胡熠应用N-Gram语言模型、朴素贝叶斯分类方法和支持向量机(SVM)进行情感分类研究，发现在文本训练样本有限不足的情况下，N-Gram语言模型的分类准确率更高，而且具有良好的扩展性。在这些研究的基础上，基于情感的文本挖掘的研究不断增加，相关研究领域得到扩展，如庞磊等学者通过朴素贝叶斯、SVM和最大熵三种分类方法，对新浪微博中的股票评论内容进行看涨和看跌的正负态度分类。傅向华、孙先和冯时通过不同的角度对中文博客进行情感分析研究，并提出一种基于文档主题生成模型与知网词典的中文博客多方面话题情感挖掘方法；将基于词典统计的情感分析方法引入微博情感分析；提出一种基于句法依存分析技术的算法SOAD(sentimentorientationanalysisbasedonsyntacticdependency)对博文搜索结果进行情感倾向性分析。In China, since Weibo is a new social media platform that has only been developed in recent years, domestic research on sentiment analysis of short texts on Weibo started relatively late. In the earlier research, three scholars, Ye Qiang, Zhang Ziqiong and Luo Zhenxiong, based on the widely used N-POS language model, carried out the feature extraction of Chinese phrases, and proposed the Chinese two-word subjective phrase model 2-POS, which is a Chinese character text The emotional recognition of the content lays the foundation. After that, Xu Jun used machine learning methods such as naive Bayesian and maximum entropy to carry out text sentiment mining and classification. The research results showed that the use of machine learning methods in emotion-based Chinese text content classification can achieve satisfactory results. The accuracy rate can reach more than 90%. For movie reviews, Hu Yi applied N-Gram language model, naive Bayesian classification method and support vector machine (SVM) to conduct emotion classification research, and found that in the case of limited and insufficient text training samples, the classification of N-Gram language model is accurate The rate is higher and has good scalability. On the basis of these studies, the research on sentiment-based text mining has been increasing, and the related research fields have been expanded. Comment content is categorized into bullish and bearish positive and negative attitudes. Fu Xianghua, Sun Xian, and Feng Shi conducted sentiment analysis research on Chinese blogs from different perspectives, and proposed a multi-faceted sentiment mining method for Chinese blogs based on document topic generation models and HowNet dictionaries; the sentiment analysis method based on dictionary statistics Introduce sentiment analysis of microblog; propose an algorithm SOAD (sentiment orientation analysis based on syntactic dependency) based on syntactic dependency analysis technology to analyze the sentiment orientation of blog post search results.

总体而言，随着互联网的不断发展，近年来，国外很多学者开始在更加广泛的领域进行情感挖掘研究，包括旅游博客、法律博客、影视评论等。情感挖掘旨在根据特殊的分类方法从消费者对特定产品或者服务的评论中提取积极或者消极的态度，利用情感分类的结果，消费者可以了解到做出购买决策的必要信息，商家可以获悉用户的反应以及其竞争者的表现。随着计算机技术的广泛使用，评论内容的情感挖掘已经成为近来研究的趋势，广泛应用于各个领域。Generally speaking, with the continuous development of the Internet, in recent years, many foreign scholars have begun to conduct emotion mining research in a wider range of fields, including travel blogs, legal blogs, and film and television reviews. Sentiment mining aims to extract positive or negative attitudes from consumers' comments on specific products or services according to special classification methods. Using the results of emotional classification, consumers can learn the necessary information to make purchase decisions, and businesses can learn about user response and the performance of its competitors. With the widespread use of computer technology, sentiment mining of review content has become a recent research trend and is widely used in various fields.

命名实体识别，同时也被称之为实体识别或者Named-Entity-Recognition,是指在一串文本中具有特定意义的实体，主要是指人名、地名、机构名、专有名词等。近些年来，随着计算机信息检索技术以及搜索引擎技术得到了极速的发展，基于中文的命名实体识别技术已经成为自然语言处理研究界的热点课题，根据国内的研究现状，目前基于中文的命名实体识别的技术方法主要有以下四种：基于统计的识别方法、基于规则的识别方法、规则和统计相结合的识别方法、基于机器学习的识别方法。Named entity recognition, also known as entity recognition or Named-Entity-Recognition, refers to entities with specific meanings in a string of texts, mainly referring to names of people, places, institutions, proper nouns, etc. In recent years, with the rapid development of computer information retrieval technology and search engine technology, Chinese-based named entity recognition technology has become a hot topic in the research field of natural language processing. According to the domestic research status, the current Chinese-based named entity There are mainly four types of identification techniques: statistical-based identification methods, rule-based identification methods, identification methods combining rules and statistics, and machine learning-based identification methods.

(1)基于统计的方法(1) Based on statistical methods

中文的命名实体识别采用的统计模型主要有：隐马尔科夫模型、决策树模型、支持向量机模型、最大熵模型和条件随机场模型。Asahara通过采用支持向量机的方法对中国的人名及组织机构的进行了自动识别，取得了比较好的结果。The statistical models used in Chinese named entity recognition mainly include: hidden Markov model, decision tree model, support vector machine model, maximum entropy model and conditional random field model. Asahara used the support vector machine method to automatically identify the names of Chinese people and organizations, and achieved relatively good results.

(2)基于规则的方法(2) Rule-based method

基于规则的命名实体识别技术主要是利用两种信息：限制性成分和命名实体用词。Tan采取的是基于转换错误驱动的方法从而获取命名实体地名的上下文的联系规则，然后使用这些规则实现对中文地名的自动识别，经过一定的数据测试表明，该识别方法的准确率可以达到97％。The rule-based named entity recognition technology mainly utilizes two kinds of information: restrictive components and named entity words. Tan adopts a method based on conversion error-driven to obtain the contextual connection rules of named entity place names, and then uses these rules to realize automatic recognition of Chinese place names. After certain data tests, it shows that the accuracy of this recognition method can reach 97%. .

(3)规则与统计相结合的方法(3) The method of combining rules and statistics

目前主流的一些中文命名实体自动识别系统将规则以及统计相结合起来,它先采用统计学的方法对实体进行镜像识别，然后利用规则对其进行校正过滤。黄德根利用从大量的真实文本数据中得到的大量的统计数据，并计算出每个人名的持续构词可信度和构词可信度，然后结合一定的规则对中国人名进行自动识别。At present, some mainstream automatic recognition systems for Chinese named entities combine rules and statistics. It first uses statistical methods to identify entities as mirror images, and then uses rules to correct and filter them. Huang Degen uses a large amount of statistical data obtained from a large amount of real text data to calculate the continuous word-formation reliability and word-formation reliability of each name, and then combine certain rules to automatically identify Chinese names.

(4)基于机器学习的方法(4) Method based on machine learning

在英文中的命名实体识别技术比中文的命名实体识别技术要简单很多，因为英文没有分词带来的麻烦，而中文的分词准确率是影响中文命名实体识别技术的关键因素。英文里的命名实体识别技术已经比较成熟，利用支持向量机的机器学习方法对英文单词进行分类，可以达到99％以上的地名和人名识别准确率。The named entity recognition technology in English is much simpler than the named entity recognition technology in Chinese, because English does not have the trouble caused by word segmentation, and the accuracy of word segmentation in Chinese is a key factor affecting Chinese named entity recognition technology. The named entity recognition technology in English is relatively mature. Using the machine learning method of support vector machine to classify English words can achieve an accuracy rate of more than 99% for the recognition of place names and personal names.

微博作为一种社交网站的主要媒体形式，越来越受到人们的青睐。人们倾向于从微博上获取新闻、评论、娱乐等信息，不知不觉间，微博对网络舆情传播的影响越来越严重。微博信息中包含不同趋向的情感特征，挖掘这些特征对于舆情监控、市场营销、谣言控制都有重要意义。大多数的情感分析都只是把文本情感分成正中负3类，如果直接将这种粗粒度的情感分析应用到微博这个社交媒体，对人们的理解帮助有限，不足以达到真正的聆听社会脉动，倾听社会情感的目的。As the main media form of a social networking site, Weibo is more and more popular. People tend to obtain news, comments, entertainment and other information from Weibo. Unknowingly, Weibo has more and more serious influence on the spread of Internet public opinion. Microblog information contains emotional features of different trends, and mining these features is of great significance for public opinion monitoring, marketing, and rumor control. Most sentiment analysis only divides text sentiment into three categories: positive, middle and negative. If this kind of coarse-grained sentiment analysis is directly applied to social media such as Weibo, it will be of limited help to people’s understanding, and it is not enough to truly listen to the social pulse. Listening for Social Emotional Purposes.

发明内容Contents of the invention

针对现有技术的不足，本发明设计出了一种基于微博情感的实体分析技术，本发明识别精度高，处理速度快，适用于大规模数据的精确识别。Aiming at the deficiencies of the prior art, the present invention designs an entity analysis technology based on microblog emotion. The present invention has high recognition precision and fast processing speed, and is suitable for accurate recognition of large-scale data.

为实现上述目的，本发明采用了如下的技术方案，一种基于微博情感的实体识别方法，包括以下几个步骤：In order to achieve the above object, the present invention adopts the following technical solution, a method for entity recognition based on microblog emotion, including the following steps:

步骤1.训练阶段，选取最优机器学习算法；Step 1. In the training phase, select the optimal machine learning algorithm;

步骤1.1根据Circumplex环形情感模型，构造四类情感词词典；Step 1.1 constructs four types of sentiment word dictionaries according to the Circumplex ring sentiment model;

所述的四类情感词词典映射到一个二维坐标系之中，这四个维度的坐标轴分别是：快乐并活跃，快乐但不活跃，不快乐但活跃和不快乐不活跃；The four types of emotion word dictionaries are mapped into a two-dimensional coordinate system, and the coordinate axes of these four dimensions are respectively: happy and active, happy but not active, unhappy but active and unhappy and inactive;

步骤1.2使用网络API采集技术，以四类情感词为关键词从微博上获取微博数据，作为训练数据。Step 1.2 uses the network API collection technology to obtain microblog data from microblog with four types of emotional words as keywords as training data.

步骤1.3对采集到的训练数据进行预处理，生成规范的训练数据集；Step 1.3 preprocesses the collected training data to generate a standardized training data set;

步骤1.4对训练数据提取关键字，依据向量空间模型对训练数据集进行向量化；Step 1.4 extracts keywords from the training data, and vectorizes the training data set according to the vector space model;

将标点符号和表情符号同样作为一个标识进行向量化，可以更加有效和贴切的对文本的情感进行分析。标点符号和表情符号的向量化是将表情符号和标点符号替换成相应的英文单词，然后再进行单词向量化的，例如：笑脸替换为happy，happy的词向量(1，0，0，1，1，2)。Vectorizing punctuation marks and emoticons as a logo can more effectively and appropriately analyze the sentiment of the text. The vectorization of punctuation marks and emoticons is to replace emoticons and punctuation marks with corresponding English words, and then perform word vectorization, for example: smiley faces are replaced with happy, happy word vectors (1, 0, 0, 1, 1, 2).

步骤1.5依据预设的机器学习算法，分别对向量化的训练数据集进行情感分类和5重交叉验证；Step 1.5 Perform sentiment classification and 5-fold cross-validation on the vectorized training data set according to the preset machine learning algorithm;

步骤1.6计算每个机器学习算法5次交叉验证的准确率和召回率，挑选出准确率和召回率平均值最高的机器学习算法作为最优机器学习分类算法。Step 1.6 calculates the accuracy and recall of each machine learning algorithm for 5 times of cross-validation, and selects the machine learning algorithm with the highest average accuracy and recall as the optimal machine learning classification algorithm.

步骤2.实验阶段，根据步骤1得到的最优机器学习分类算法，得到被识别的情感实体。Step 2. In the experimental stage, according to the optimal machine learning classification algorithm obtained in step 1, the recognized emotional entity is obtained.

步骤2.1按照步骤1中步骤1.1至步骤1.4相同的方法获取向量化的实验数据集；Step 2.1 Obtain a vectorized experimental data set in the same way as step 1.1 to step 1.4 in step 1;

步骤2.2使用步骤1中的得到的最优机器学习分类算法，对实验数据集进行分类，得到四类情感数据集；Step 2.2 uses the optimal machine learning classification algorithm obtained in step 1 to classify the experimental data set to obtain four types of emotional data sets;

步骤2.3对四类情感数据集分别进行一次实体抽取，得到被识别的情感实体。Step 2.3 performs entity extraction on each of the four types of emotion data sets to obtain the recognized emotion entities.

进一步的，所述的步骤1.3中的预处理，包括修正错误词组、删除无关词组、修正错误词组、删除歧义的微博和同义词转换；所述的修正错误词组是指对拼写错误的单词进行修正；删除无关词组指对情感分析没有任何益处的单词进行删除；删除歧义的微博指在一个文本却属于不同的情感类别的微博；同义词转换是指把相同意思的词用另一个词代替。Further, the preprocessing in the step 1.3 includes correcting wrong phrases, deleting irrelevant phrases, correcting wrong phrases, deleting ambiguous microblogs and synonym conversion; described correcting wrong phrases refers to correcting misspelled words ; Deleting irrelevant phrases refers to deleting words that are not beneficial to sentiment analysis; deleting ambiguous microblogs refers to microblogs that belong to different emotional categories in one text; synonym conversion refers to replacing words with the same meaning with another word.

优选的，所述的步骤1.4中使用TF-IDF算法提取关键词，如果包含表情和标点符号，则将常用的表情符号和表示语气的标点符号转化为相应的单词。Preferably, the TF-IDF algorithm is used to extract keywords in the step 1.4, and if emoticons and punctuation marks are included, the commonly used emoticons and punctuation marks representing mood are converted into corresponding words.

优选的，所述的步骤1.4中使用word2vec开源工具构建词向量，依据向量空间模型对训练数据集进行向量化。Preferably, the word2vec open source tool is used to construct the word vector in the step 1.4, and the training data set is vectorized according to the vector space model.

优选的，所述的步骤2.3中，使用SENNA深度学习工具包，对四类情感数据集分别进行一次实体抽取。Preferably, in the step 2.3, the SENNA deep learning toolkit is used to perform entity extraction on the four types of emotion data sets.

优选的，所述的步骤1.5中，预设的机器学习算法包括朴素贝叶斯、逻辑回归、支持向量机和K近邻算法4种机器学习算法。Preferably, in the step 1.5, the preset machine learning algorithms include four machine learning algorithms including naive Bayesian, logistic regression, support vector machine and K-nearest neighbor algorithm.

本发明通过机器深度学习进行分类和实体识别，对微博的情感进行更加细粒度的实体识别，识别的精确度高，效果好。会产生如下的益处:The invention performs classification and entity recognition through machine deep learning, and performs more fine-grained entity recognition on microblog emotions, with high recognition accuracy and good effect. There will be the following benefits:

1.将数据处理和分析后可以进行更加细的粒度的情感分析；1. After data processing and analysis, finer-grained sentiment analysis can be carried out;

2.通过获得的细粒度情感分析，可以反应人们对微博这个群体的情绪状况；2. Through the obtained fine-grained sentiment analysis, it can reflect the emotional state of people towards the Weibo group;

3.有利于政府，组织，个体对社会情感的理解和把握。3. It is beneficial for the government, organizations, and individuals to understand and grasp social emotions.

附图说明Description of drawings

图1是本发明的流程图；Fig. 1 is a flow chart of the present invention;

具体实施方式detailed description

为使本发明的技术手段，创作特征，达成目的与功效易于明白了解，下面结合具体实施方式，进一步阐述本发明。In order to make the present invention's technical means, creative features, goals and effects easy to understand, the present invention will be further elaborated below in conjunction with specific embodiments.

微博中的数据很大，依靠人工的方法对其进行分类，将花费大量的人力物力财力，因此使用微博中提供的Hashtag主题标签作为该微博的情感。我们认为若一个微博被该情感类别标签标记，则该微博属于这个情感类别。The data in Weibo is huge, and relying on manual methods to classify it will cost a lot of manpower, material and financial resources. Therefore, the Hashtag provided in Weibo is used as the emotion of the Weibo. We believe that if a microblog is marked with the label of the emotion category, the microblog belongs to the emotion category.

一种基于微博情感的实体识别方法，包括以下几个步骤：A microblog emotion-based entity recognition method includes the following steps:

步骤1.1根据Circumplex环形情感模型，构造四类情感词词典；所述的四类情感词词典映射到一个二维坐标系之中，这四个维度的坐标轴分别是：快乐并活跃，快乐但不活跃，不快乐但活跃和不快乐不活跃；Step 1.1 constructs four types of emotional word dictionaries according to the Circumplex ring emotion model; the four types of emotional word dictionaries are mapped into a two-dimensional coordinate system, and the coordinate axes of these four dimensions are: happy and active, happy but not active, unhappy but active and unhappy not active;

步骤1.3对采集到的训练数据进行预处理，生成规范的训练数据集；对数据的预处理包括:修正错误词组，删除无关词组，删除歧义数据，同义词转换。Step 1.3 preprocesses the collected training data to generate a standardized training data set; data preprocessing includes: correcting wrong phrases, deleting irrelevant phrases, deleting ambiguous data, and converting synonyms.

修正错误词组是指对拼写错误的单词进行修正，例如：eta修正为eat，删除无关词组指那些对情感分析没有任何益处的单词进行删除，例如the，of等无实际意义的单词，删除歧义的微博指那些一个文本却属于不同的情感类别的微博。同义词转换是指把相同意思的词用一个词代替。Correcting wrong phrases refers to correcting misspelled words, for example: eta is corrected to eat, and deleting irrelevant phrases refers to deleting words that are not beneficial to sentiment analysis, such as the, of and other meaningless words, deleting ambiguous ones Microblogs refer to those microblogs in which a text belongs to different sentiment categories. Synonym conversion refers to replacing words with the same meaning with one word.

步骤1.4对训练数据提取关键字，使用TF-IDF算法提取关键词，如果包含表情和标点符号，则将常用的表情符号和表示语气的标点符号转化为相应的单词。Step 1.4 extracts keywords from the training data, and uses the TF-IDF algorithm to extract keywords. If emoticons and punctuation marks are included, the commonly used emoticons and punctuation marks indicating mood are converted into corresponding words.

使用word2vec开源工具构建词向量，依据向量空间模型对训练数据集进行向量化；所述的向量化过程中不仅包括文字，还包括标点符号和表情符号。Use the word2vec open source tool to construct word vectors, and vectorize the training data set according to the vector space model; the vectorization process includes not only text, but also punctuation marks and emoticons.

向量空间模型是经典的文本特征模型，由Salton等人在60年代提出，并且在SMART文本检索系统上已经取得了成功的应用。The vector space model is a classic text feature model, proposed by Salton et al. in the 1960s, and has been successfully applied in the SMART text retrieval system.

构建词向量:词向量是指用一个向量来代表一个词，例如：happy可以用向量(0，1，3，4，1，1)来表示。Building word vectors: word vectors refer to using a vector to represent a word, for example: happy can be represented by a vector (0, 1, 3, 4, 1, 1).

Word2vec是Goole在2013年中开源的一款将词表征为实数值向量的高效工具。我们使用此工具将每个词用向量来表示。Word2vec is an efficient tool that Goole open sourced in 2013 to represent words as real-valued vectors. We use this tool to represent each word as a vector.

数据集的向量化：对每一条数据提取关键词，这里使用的是比较成熟TF-IDF算法生成的一组关键词，然后把关键词转化为词向量。用这一组词向量代表这一条数据。例如：Iwanttogohome这条数据，可以提取出关键词：I，go，home三个关键词，三个关键词的词向量为(1，0，1，0，1，3)，(0，1，2，3，0，0)，(1，1，3，2，1，6)那么可以用这三个向量代表此条数据。Vectorization of datasets: Extract keywords for each piece of data. Here, a set of keywords generated by the more mature TF-IDF algorithm is used, and then the keywords are converted into word vectors. Use this set of word vectors to represent this piece of data. For example: the data of Iwanttogohome can extract keywords: three keywords: I, go, home, and the word vectors of the three keywords are (1, 0, 1, 0, 1, 3), (0, 1, 2, 3, 0, 0), (1, 1, 3, 2, 1, 6), then these three vectors can be used to represent this piece of data.

5重交叉验证：将得到的数据集随机分为5等份，将其中4等份作为训练集，1等份作为测试集，使用训练集对机器学习算法进行训练，训练完成之后，机器学习算法会生成一个决策树函数，并用决策树函数对剩下的测试集进行测试。并计算分类的准确率和召回率。该过程重复5次。5-fold cross-validation: The obtained data set is randomly divided into 5 equal parts, 4 equal parts are used as the training set, and 1 equal part is used as the test set, and the training set is used to train the machine learning algorithm. After the training is completed, the machine learning algorithm A decision tree function is generated and tested against the rest of the test set with the decision tree function. And calculate the classification precision and recall. This process was repeated 5 times.

本方法预设4种四种机器学习算法，使用了以下的机器学习算法：This method presets four kinds of machine learning algorithms, and uses the following machine learning algorithms:

1.朴素贝叶斯1. Naive Bayes

朴素贝叶斯的基本原理是：对于一个给定的等待分类的数据项，需要求出在此数据项出现的基础上其它各个类别分别出现的概率，该概率通常被称之为后验概率，哪个最大，就认为此待数据项属于哪一个目标类别。The basic principle of Naive Bayes is: for a given data item waiting to be classified, it is necessary to find the probability of occurrence of other categories based on the occurrence of this data item. This probability is usually called the posterior probability. Whichever is the largest is considered which target category the pending data item belongs to.

公式如下：The formula is as follows:

$p p (({C C}_{k k} | | x x)) = = \frac{p p (({C C}_{k k})) p p ((x x | | {C C}_{k k}))}{p p ((x x))}$

公式描述：事件C_k的概率为P(C_k),事件x的概率为P(x),事件Ck已发生条件下事件x的概率为P(A|Ck),事件x发生条件下Ck的发生概率为P(Ck|x)Formula description: The probability of event C_k is P(C_k ), the probability of event x is P(x), the probability of event x under the condition that event Ck has occurred is P(A|Ck), and the probability of Ck under the condition of event x occurs The probability of occurrence is P(Ck|x)

程序逻辑如下：Ck表示类别，P(x)表示待分类数据，对于确定的分类数目，P(Ck)是固定的，例如这里的概率是0.25(1/4)，对于一次数据分类，P(x)也是确定的，所以只需要计算出P(x|Ck)最大，便可得出P(Ck|x)最大。P(x|Ck)表示Ck类中出现x的概率，该概率在训练集中得到，例如:在训练集分类过程中，Ck中共有100个，x占有10个，则概率为0.1。The program logic is as follows: Ck represents the category, P(x) represents the data to be classified, for the determined number of classifications, P(Ck) is fixed, for example, the probability here is 0.25 (1/4), for one data classification, P( x) is also determined, so it is only necessary to calculate the maximum of P(x|Ck) to obtain the maximum of P(Ck|x). P(x|Ck) represents the probability of x appearing in class Ck, which is obtained in the training set, for example: in the training set classification process, there are 100 in Ck, and x occupies 10, then the probability is 0.1.

2.逻辑回归2. Logistic regression

逻辑回归与众多回归分析以及多重线性回归有一些类似之处，这些回归模型都属于广义线性模型的(generalizedlinearmodel)。对于广义线性模型家族成员中，各个回归分析的不同更多的是因变量的不同。构造逻辑回归的时候需要以下关键步骤：Logistic regression has some similarities with many regression analyzes and multiple linear regression, and these regression models are all generalized linear models. For the members of the generalized linear model family, the difference between the various regression analyzes is more due to the difference in the dependent variable. The following key steps are required when constructing logistic regression:

①建立预测函数，预测函数是指某一件事情的发生概率为多大。①Establish a prediction function, which refers to the probability of a certain event happening.

②构造逻辑函数，逻辑函数是指Sigmoid函数，由于预测函数是根据原有的训练数据得到的近似概率函数，所以该概率函数的取值范围有可能出现小于0的情况，因此就引入了逻辑函数的概念，逻辑函数能把负无穷大到正无穷大的数映射到[0,1]之间。② Construct a logic function, which refers to the Sigmoid function. Since the prediction function is an approximate probability function obtained based on the original training data, the value range of the probability function may be less than 0, so the logic function is introduced. The concept of logical function can map the number from negative infinity to positive infinity to [0,1].

③使用低度下降的方法求得回归参数，逻辑回归分类器的训练阶段，根据构建好的逻辑函数形式，我们可以得到该函数的似然函数，同时在求参数的过程中，通常采用的方法是最大似然法，然后再利用梯度下降法求得在参数的最佳值。③Use the method of low-level descent to obtain the regression parameters. In the training phase of the logistic regression classifier, according to the constructed logic function form, we can obtain the likelihood function of the function. At the same time, in the process of obtaining parameters, the usual method Is the maximum likelihood method, and then use the gradient descent method to find the optimal value of the parameters.

程序逻辑如下：把数据集的特征值设为X1，X2，X3……,对应的权值为W1，W2，W3……，设Z＝W1×X1+W2×X2+W3×X3……,然后使用sigmoid函数将结果映射到[0,1]区间上，p＝sigmoid(z),即1/(1+exp(-z)),然后使用梯度下降法和测试数据，求出各个权值的最大似然值。得到各个权值后，便可以得到此函数的表达式，便可以计算出各个类的可能性，对新的数据进行分类。The program logic is as follows: set the feature values of the data set as X1, X2, X3..., and the corresponding weights are W1, W2, W3..., set Z=W1×X1+W2×X2+W3×X3..., Then use the sigmoid function to map the result to the [0,1] interval, p=sigmoid(z), that is, 1/(1+exp(-z)), and then use the gradient descent method and test data to find each weight maximum likelihood value of . After obtaining each weight value, the expression of this function can be obtained, and the possibility of each class can be calculated to classify new data.

3.支持向量机3. Support Vector Machines

支持向量机是一种监督性的学习算法，在统计回归中有非常广泛的应用。支持向量机可以为训练数据构建一个或者很多个超高维度的超平面对一些低维度不可分数据在高维度可分。在文本分类中，支持向量机是最好的分类算法之一。Support vector machine is a supervised learning algorithm, which is widely used in statistical regression. The support vector machine can construct one or many ultra-high-dimensional hyperplanes for the training data, and some low-dimensional inseparable data can be separated in high dimensions. In text classification, support vector machine is one of the best classification algorithms.

程序逻辑如下:训练支持向量机的主要目的是找出分割两类的超平面方程，设方程函数为W^TX+b＝0，W和X分表代表一个矩阵和向量，这里的X表示词向量，引入松弛因子和惩罚因子，使用拉格朗日乘子法，求出最优的分类平面，求出平面函数，便可以对其他的向量X进行分类。The program logic is as follows: the main purpose of the training support vector machine is to find out the hyperplane equation of the two classes of segmentation, the equation function is W^T X+b=0, and the W and X sub-tables represent a matrix and a vector, where X represents a word Vector, introduce relaxation factor and penalty factor, use Lagrange multiplier method to find the optimal classification plane, and find the plane function, then other vectors X can be classified.

4.K近邻算法4. K nearest neighbor algorithm

K邻近算法，是机器学习算法中里非常成熟的算法之一，同时K邻近算法也是最简单的机器学习算法之一。邻近算法的基本思想是在给定的一些数据内容中，如果一个样本数据在特征向量空间中与其它最相邻的K个数据点种的大多数属于同一个类别，那么就把这个样本赋值该类别。The K-adjacent algorithm is one of the most mature algorithms in machine learning algorithms, and the K-adjacent algorithm is also one of the simplest machine learning algorithms. The basic idea of the proximity algorithm is that in the given data content, if a sample data belongs to the same category as most of the other K most adjacent data points in the feature vector space, then this sample is assigned the value of category.

程序逻辑如下：在训练集中，将训练向量投射到N维空间中，新数据向量X，计算与X最近的n个点，在这n个点中，若A类别的最多，则该新数据属于A类别。The program logic is as follows: In the training set, project the training vector into the N-dimensional space, and calculate the n points nearest to X for the new data vector X. Among the n points, if the A category is the most, the new data belongs to Category A.

步骤2.3使用SENNA深度学习工具包，对四类情感数据集分别进行一次实体抽取。Step 2.3 uses the SENNA deep learning toolkit to perform entity extraction on each of the four types of emotion datasets.

以上为本发明的基本原理和主要实现方法。本发明可以实现微博内容的提取，对大数据的深度学习，提高情感的分析精度，对微博情感实体的识别。帮助政府，组织或机构进行大众群体的情感研究，在舆论分析，群体事件，事件预警方面有较大的作用。The above are the basic principles and main implementation methods of the present invention. The invention can realize the extraction of microblog content, deep learning of big data, improve the accuracy of emotion analysis, and identify microblog emotional entities. Helping governments, organizations or institutions to conduct emotional research on mass groups plays a greater role in public opinion analysis, group events, and early warning of events.

Claims

1. based on an entity recognition method for microblog emotional, it is characterized in that, comprise following step:

Step 1. training stage, choose optimum machine learning algorithm;

Step 1.1, according to Circumplex annular emotion model, constructs four class emotion word dictionaries; Four described class emotion word dictionaries are mapped among a two-dimensional coordinate system, the coordinate axis of this four dimensions respectively: happy and active, happy but inactive, unhappy but active and unhappy inactive;

Step 1.2 uses network AP I acquisition technique, with four class emotion word for keyword obtains microblog data from microblogging, as training data;

Step 1.3 carries out pre-service to the training data collected, the training dataset of generating standard;

Step 1.4 pair training data extracts key word, carries out vectorization according to vector space model to training dataset;

Step 1.5, according to the machine learning algorithm preset, carries out emotional semantic classification and 5 retransposings checking to the training dataset of vectorization respectively;

Step 1.6 calculates accuracy rate and the recall rate of each machine learning algorithm 5 cross validations, picks out accuracy rate and the highest machine learning algorithm of recall rate mean value as optimum machine learning classification algorithm;

Step 2. experimental phase, according to the optimum machine learning classification algorithm that step 1 obtains, obtain the emotion entity be identified;

Step 2.1 obtains the experimental data collection of vectorization to the method that step 1.4 is identical according to step 1.1 in step 1;

Step 2.2 uses the optimum machine learning classification algorithm obtained in step 1, classifies to experimental data collection, obtains four class affection data collection;

Step 2.3 is carried out an entity respectively to four class affection data collection and is extracted, and obtains the emotion entity be identified.

2. a kind of entity recognition method based on microblog emotional according to claim 1, it is characterized in that, pre-service in described step 1.3, comprises the phrase that corrects mistakes, the irrelevant phrase of deletion, the phrase that corrects mistakes, the microblogging deleting ambiguity and synonym conversion; The described phrase that corrects mistakes refers to be revised the word of misspelling; Delete irrelevant phrase to refer to delete the word of sentiment analysis without any benefit; The microblogging deleting ambiguity refers to the microblogging but belonging to different emotion classifications at a text; Synonym conversion refers to and another word of the word of equivalent is replaced.

3. a kind of entity recognition method based on microblog emotional according to claim 1, it is characterized in that, TF-IDF algorithm is used to extract keyword in described step 1.4, if comprise expression and punctuation mark, then the punctuation mark of conventional emoticon and the expression tone is converted into corresponding word.

4. a kind of entity recognition method based on microblog emotional according to claim 1, is characterized in that, uses word2vec Open-Source Tools to build term vector, carry out vectorization according to vector space model to training dataset in described step 1.4.

5. a kind of entity recognition method based on microblog emotional according to claim 1, is characterized in that, in described step 2.3, uses SENNA degree of deep learning tool bag, carries out an entity respectively extract four class affection data collection.

6. a kind of entity recognition method based on microblog emotional according to claim 1, it is characterized in that, in described step 1.5, the machine learning algorithm preset comprises naive Bayesian, logistic regression, support vector machine and k nearest neighbor algorithm 4 kinds of machine learning algorithms.