CN107153658A

Movatterモバイル変換

Info

Publication number: CN107153658A
Application number: CN201610123465.5A
Authority: CN
Inventors: 赵昕; 赵一昕; 李华康; 杨天若; 杨天楚
Original assignee: Changzhou City Bus Group Co ltd; Changzhou Pushi Information Technology Co ltd
Current assignee: Changzhou City Bus Group Co ltd; Changzhou Pushi Information Technology Co ltd
Priority date: 2016-03-03
Filing date: 2016-03-03
Publication date: 2017-09-12

Abstract

The invention discloses a hot word discovery method, and particularly relates to a hot word discovery method based on a keyword weighting algorithm. The invention uses Chinese word segmentation tool to carry out preliminary word segmentation on mass public sentiment information, gives out part of speech labels, combines an IDF table, a filtering word table and a part of speech weight table, and carries out candidate word heat value calculation according to a weighted TF-IDF algorithm. In addition, the invention fully considers the characteristic of clear public sentiment title theme in the media era, mainly carries out linguistic processing on the public sentiment title, and solves the efficiency problem of hot word recognition under mass public sentiment information. And finally, the IDF table is updated dynamically and incrementally, so that the real-time property of the word anti-document frequency is ensured, and the accuracy of hot word identification is improved.

Description

Translated fromChinese

一种基于关键字加权算法的舆情热词发现方法A method for discovering public opinion hot words based on keyword weighting algorithm

技术领域technical field

本发明涉及一种热词发现方法，具体涉及一种基于关键字加权算法的热词发现方法。The invention relates to a hot word discovery method, in particular to a hot word discovery method based on a keyword weighting algorithm.

技术背景technical background

随着互联网的普及和飞速发展，每日海量的新闻数据在网络上涌现。另一方面，微博、博客、论坛等自媒体的出现使网络上信息的发布者从专业的新闻媒体记者转化为各行各业的普通网民，社会大众也由过去被动的信息接收者转为现在信息的传播者。网络用语由此变得越来越丰富多彩，譬如“给力”、“屌丝”、“躺枪”等新词层出不穷。在此情况下，如何在纷繁复杂的网络信息中挖掘热点词汇、如何获取热门的新词条和新概念进而有效的寻找热点话题，正成为舆情研究领域的热点和难点。With the popularization and rapid development of the Internet, a large amount of news data emerges on the Internet every day. On the other hand, the emergence of self-media such as microblogs, blogs, and forums has transformed the publishers of information on the Internet from professional news media reporters into ordinary netizens of all walks of life, and the general public has also changed from passive information receivers in the past to current ones. Disseminator of information. As a result, Internet terms have become more and more colorful, such as "Gili", "Diaosi", "Lie Gun" and other new words emerging one after another. Under such circumstances, how to mine hot words in the complicated network information, how to obtain popular new entries and new concepts, and then effectively find hot topics are becoming hot spots and difficulties in the field of public opinion research.

热词是伴随网络普及而出现的一种词汇现象，它通常反映了某一时间段内社会中发生的重大事件或是被社会大众所关注的热点问题，是组成互联网热点信息的一部分。热词具有创造性和突发性，它覆盖了当下网民或媒体关注的热点人物、热点事件。比如，“青岛大虾”就是出自于国庆期间，被爆出“38元一只”的天价虾事件，之后此词就暗讽某些商家的宰客行为。因此，快速识别热词就可以快速准确的了解社会以及民情，进而可以对舆论导向进行正确的引导和宣传。另外，对各大搜索领域而言，有效地识别热词可以提高网站的点击量，甚至增加利润。简单而言，热词发现是一种文本挖掘技术，就是从海量的网络信息中经过预处理、提取特征、以及聚类分析挖掘出在给定时间段内出现的热门词条。Hot words are a vocabulary phenomenon that appears with the popularization of the Internet. They usually reflect major events in society or hot issues that are concerned by the public in a certain period of time. They are part of Internet hotspot information. Hot words are creative and sudden, and they cover hot people and hot events that netizens or the media pay attention to. For example, "Qingdao prawns" came from the incident of "38 yuan a piece" of prawns that was exposed during the National Day. After that, the term satirized the behavior of some merchants. Therefore, quickly identifying hot words can quickly and accurately understand the society and people's sentiments, and then can correctly guide and publicize the direction of public opinion. In addition, for major search fields, effectively identifying hot words can increase the number of clicks on the website and even increase profits. To put it simply, hot word discovery is a text mining technology, which is to dig out popular terms that appear in a given period of time from massive network information through preprocessing, feature extraction, and cluster analysis.

热词发现主要包含语料切分、噪音词过滤、特征提取以及热词识别四个过程。Hot word discovery mainly includes four processes of corpus segmentation, noise word filtering, feature extraction and hot word recognition.

热词发现过程中最基本最关键的就是语料切分，简而言之，就是分词。众所周知，中文和英文的一个显著不同在于中文以汉字字符为最小单位，词条与词条之间不存在明显的词边界，任何相邻的字符都可能构成热词，这给中文处理造成了很大困难，因此，词条切分、确定词边界至关重要，极大的影响了后续垃圾词过滤，热词识别处理的准确度。中文分词大致分为基于词典匹配的方法和基于统计的方法。基于词典匹配的方法主要是将文本与给定的分词词典进行比较和匹配，然后通过歧义消除来进行处理，这种方法简单，效率高，但对于并未登录于词典上的词难以识别。基于统计的分词方法主要基于字和词的统计信息，将相邻字的共现信息应用于分词，这种方法主要包括互信息、隐马尔可夫模型(HMM)、随机条件场(CRF)和最大熵模型(ME)。与基于词典的分词相比，这种方式处理速度慢，但却对未登录词的识别有较好的效果。在实际处理中，大多会平衡分词速度和精度这两个因素，选择词典和统计相结合的方式进行分词处理。The most basic and crucial thing in the hot word discovery process is corpus segmentation, in short, word segmentation. As we all know, a significant difference between Chinese and English is that Chinese uses Chinese characters as the smallest unit, and there is no obvious word boundary between entries. Any adjacent characters may constitute a hot word, which causes a lot of problems for Chinese processing. Therefore, it is very important to segment entries and determine word boundaries, which greatly affects the accuracy of subsequent spam word filtering and hot word recognition processing. Chinese word segmentation can be roughly divided into dictionary-matching-based methods and statistical-based methods. The method based on dictionary matching mainly compares and matches the text with a given word segmentation dictionary, and then processes it through disambiguation. This method is simple and efficient, but it is difficult to recognize words that are not registered in the dictionary. The word segmentation method based on statistics is mainly based on the statistical information of words and words, and applies the co-occurrence information of adjacent words to word segmentation. This method mainly includes mutual information, hidden Markov model (HMM), random conditional field (CRF) and Maximum Entropy Model (ME). Compared with the word segmentation based on the dictionary, this method has a slower processing speed, but it has a better effect on the recognition of unregistered words. In actual processing, most of the two factors of word segmentation speed and accuracy are balanced, and a combination of dictionary and statistics is selected for word segmentation processing.

热词识别中，噪音词的过滤又称为停用词过滤。对网络文本进行预处理之后，我们得到了经过词性标注的词组。这些词组中有很多并无实际意义的词，此时过滤处理主要针对以下两种词组：一种为频繁出现的语气助词、介词、连词等虚词，例如“的”、“是”、“了”、“吗”等；另一种为修饰性地形容词、表征程度地副词和频率出现较高的数词和量词的搭配。经过过滤可以显著提高后续文本处理和热词识别的速度。In hot word recognition, noise word filtering is also called stop word filtering. After preprocessing the web text, we get the part-of-speech tagged phrases. There are many meaningless words in these phrases. At this time, the filteringprocess is mainly aimed at the following two phrases: one is the frequent occurrence of modal particles, prepositions, conjunctions and other function words, such as "的", "是", " The other is a combination of modifying adjectives, adverbs representing degrees, and frequently occurring numerals and quantifiers. After filtering, the speed of subsequent text processing and hot word recognition can be significantly improved.

文本表示是指用一种准确简单的方式表示文档内容，从而可以为计算机识别。目前的文本表示方法有布尔模型、向量空间模型、概率检索模型、N元语法模型等。其中，最为经典的是向量空间模型(VSM，Vector Space Model)，即将文本表示为特征项和特征项权值组成的空间向量的形式，特征项为该文档表示的一个维度，特征项的权值反映了特征项对该文档的重要程度。在空间向量模型中，每篇文档表示为如下的形式:Text representation refers to expressing the content of a document in an accurate and simple way so that it can be recognized by a computer. The current text representation methods include Boolean model, vector space model, probability retrieval model, N-gram model and so on. Among them, the most classic is the Vector Space Model (VSM, Vector Space Model), which expresses the text as a space vector composed of feature items and feature item weights. The feature item is a dimension represented by the document, and the weight of the feature item is Reflects the importance of the feature item to the document. In the space vector model, each document is represented as follows:

v(D)＝{w₁(d₁),w₂(d₂),...,w_n(d_n)}v(D)＝{w₁ (d₁ ),w₂ (d₂ ),...,w_n (d_n )}

其中，D表示文档，n表示在文本特征抽取时所抽取文本特征项的总数，w_j(d_j)表示第j个文本特征项在文档D中的权值。Among them, D represents the document, n represents the total number of text feature items extracted during text feature extraction, and w_j (d_j ) represents the weight of the jth text feature item in document D.

热词发现中的热词识别依赖于VSM模型中特征权值的计算。权值的计算有三种方法：第一种为二值法，特征项出现在文档中标记为1，否则标记为0；第二种方法权值表示为特征项在文档中出现的频率。这两种方法并没有考虑特征项在语料库的重要程度，因此，特征权值采用经典的TF-IDF方法更为合理。TF-IDF是一种用于信息检索的常用的加权统计技术，他可以反映特征项对于一个语料库中的一份文件的重要程度。某一特征项的权值随着其在文件中出现的次数成正比增加，同时会随着其在语料库中出现的频率成反比下降。他的具体定义形式如下：Hot word recognition in hot word discovery relies on the calculation of feature weights in the VSM model. There are three ways to calculate the weight: the first is the binary method, and the feature item is marked as 1 when it appears in the document, otherwise it is marked as 0; the secondmethod weight is expressed as the frequency of feature items appearing in the document. These two methods do not consider the importance of feature items in the corpus. Therefore, it is more reasonable to use the classic TF-IDF method for feature weights. TF-IDF is a commonly used weighted statistical technique for information retrieval, which can reflect the importance of feature items for a document in a corpus. The weight of a feature item increases proportionally to the number of times it appears in the document, and decreases inversely proportional to the frequency of its appearance in the corpus. His specific definition is as follows:

其中，tf_lk为特征项k出现在文档l中的频数，df_k为文档集中出现特征项k的文档数，N为文档集中的文档总数。TF-IDF方法是目前研究和应用最为广泛的一种方法。Among them, tf_lk is the frequency of feature item k appearing in document l, df_k is the number of documents in which feature item k appears in the document set, and N is the total number of documents in the document set. The TF-IDF method is currently the most widely studied and applied method.

综上所述，以上介绍的现有的热词发现方法存在以下问题：To sum up, the existing hot word discovery methods introduced above have the following problems:

(1)忽略了互联网舆情主题鲜明的特性。当前网络新闻为了提高网民的点击率与社会大众的关注度，发布者一般都会在舆情页面标题中明确地表达事件主题和观点，因此，标题中关键词信息价值很高。而现有的热词发现在文本表示时，对文档向量化时，并未考虑标题的信息价值，将其与舆情正文简单拼接进行文本处理，这样在特征提取时，不仅处理数据量大、处理效率低，而且易造成特征项提取的不准确。(1) Neglecting the distinct characteristics of Internet public opinion. In order to increase the click-through rate of netizens and the attention of the general public, publishers of current online news generally express event themes and opinions clearly in the title of the public opinion page. Therefore, the keyword information in the title is of high value. However, the existing hot word discovery does not consider the information value of the title when the document is vectorized in the text representation, and it is simply spliced with the public opinion text for text processing. In this way, not only the large amount of data is processed, but also the The efficiency is low, and it is easy to cause inaccurate extraction of feature items.

(2)权值计算所使用IDF表维护困难。一方面，对于那些新词以及表中未登录的词难以计算IDF值；另一方面，现有的IDF表需手工更新且更新周期长，而舆情新闻以每日数以万计的速度发布，IDF表无法实时更新会造成数据偏差越来越大。(2) It is difficult to maintain the IDF table used for weight calculation. On the one hand, it is difficult to calculate the IDF value for those new words and words that are not registered in thetable ; The inability of the table to be updated in real time will cause increasing data deviation.

(3)忽略词组中词性的有效信息。在TF-IDF的特征权值计算中，每个词都有不同的词性，而且命名实体的信息量大于非命名实体，未登录的特征项比可识别的更有可能是热词。但现有的热词发现算法并未考虑词性所包含的有效信息，而是将所有候选词组赋予了相同的权重。(3) Ignore the effective information of the part of speech in the phrase. In the feature weight calculation of TF-IDF, each word has a different part of speech, and the information content of named entities is greater than that of non-named entities, and unregistered feature items are more likely to be hot words than identifiable ones. However, the existing hot word discovery algorithm does not consider the effective information contained in the part of speech, but assigns the same weight to all candidate phrases.

针对以上问题，本发明引入了加权式TF-IDF计算方法对传统的TF-IDF计算公式进行改进，针对不同词性赋予不同权值，并且实现对IDF表增量式更新，提高了热词识别的准确度。同时，考虑到舆情信息的海量性和主题鲜明性，通过对舆情标题为主，舆情正文为辅进行处理来提高文本处理的效率。In view of the above problems, the present invention introduces a weighted TF-IDF calculation method to improve the traditional TF-IDF calculation formula, assign different weights to different parts of speech, and realize the incremental update of the IDF table, which improves the accuracy of hot word recognition. Accuracy. At the same time, considering the mass of public opinion information and the distinctiveness of the theme, the efficiency of text processing is improved by processing the public opinion title as the main part and the public opinion text as the supplement.

发明内容Contents of the invention

本发明主要解决自媒体时代下现有热词发现方法的问题与不足，提供了一种基于关键字加权式的TF-IDF算法的热词发现方法，以解决在海量舆情信息下热词发现的效率和准确度的问题，从而实现热点词汇的高效、准确识别。The present invention mainly solves the problems and deficiencies of the existing hot word discovery methods in the self-media era, and provides a hot word discovery method based on the keyword weighted TF-IDF algorithm to solve the problem of hot word discovery under massive public opinion information. Efficiency and accuracy issues, so as to achieve efficient and accurate identification of hot words.

为了实现上述目的，本发明提供的技术方案如下：In order to achieve the above object, the technical scheme provided by the invention is as follows:

一种基于关键字加权算法的舆情热词发现方法，包括：A method for discovering hot words of public opinion based on a keyword weighting algorithm, including:

一个舆情语料库，存储从互联网上抓取的经过预处理海量舆情信息；A public opinion corpus, which stores preprocessed mass public opinion information captured from the Internet;

一个过滤词库，分为词性过滤表和词义过滤表两部分，用以对分词结果中助词、介词、连词等虚词、表示修饰的形容词和表征程度的副词、数词和量词的搭配等词性以及并无实际含义的词进行过滤；A filter lexicon, which is divided into two parts: part-of-speech filter table and word-meaning filter table, which is used for part-of-speech such as auxiliary words, prepositions, conjunctions and other function words, adjectives expressing modification and adverbs showing the degree of representation, collocations of numerals and quantifiers in word segmentation results, etc. Words that have no actual meaning are filtered;

一个IDF表，用以存储词汇或短语的反文档频率，并且实现动态更新；An IDF table is used to store the inverse document frequency of words or phrases, and realize dynamic updating;

一个词性权重表，用以存储不同词性的权重。权重等级取值为1-5，依次递增。A part-of-speech weight table is used to store the weights of different parts of speech. The weight level ranges from 1 to 5, increasing sequentially.

舆情信息预处理模块，在对相关舆情网页进行采集后，过滤网页中图片、广告、链接等噪音数据，提取出舆情新闻的标题和内容，将其存入舆情语料库，为后续文本处理提供基础。The public opinion information preprocessing module, after collecting relevant public opinion web pages, filters noise data such as pictures, advertisements, and links in the web pages, extracts the title and content of public opinion news, and stores them in the public opinion corpus to provide a basis for subsequent text processing.

文本分词模块，对语料库中的文本采用基于词典和统计相结合的方法进行分词，并且对获得的每个词或短语进行词性标注，实现对新词和未登录词的识别。The text segmentation module uses a combination of dictionaries and statistics to segment the text in the corpus, and performs part-of-speech tagging on each word or phrase obtained to realize the recognition of new words and unregistered words.

噪音过滤模块，参照过滤词库对获得的分词集合进行词性、词义比对，对出现在过滤词库中的词和短语，不再作为候选热点词汇参与后续计算。The noise filtering module compares the part of speech and meaning of the obtained word segmentation set with reference to the filtered thesaurus, and the words and phrases appearing in the filtered thesaurus are no longer used as candidate hot words to participate in subsequent calculations.

权值计算模块，对经过噪音过滤模块筛选得到的候选热点词或短语参照词性权重表获取权重，同时参照IDF表获取其对应的反文档频率。根据这两个值，通过加权式TF-IDF计算方法生成该词的热度值。具体公式如下：The weight calculation module refers to the part-of-speech weight table to obtain the weight of the candidate hot words or phrases screened by the noise filtering module, and obtains its corresponding anti-document frequency by referring to the IDF table. According to these two values, the popularity value of the word is generated by the weighted TF-IDF calculation method. The specific formula is as follows:

其中，f_title表示特征项k出现在新闻标题中的频率，f_content表示特征项k出现在新闻正文中的频率，W_nature表示特征项k所属词性的权值，Sum_l表示该新闻分词集合的分词总数，df_k表示整个新闻集中出现特征项k的新闻的数目，N表示整个新闻集中所有新闻的数目。Among them, f_title represents the frequency of feature item k appearing in the news title, f_content represents the frequency of feature item k appearing in the news text, W_nature represents the weight of the part of speech to which feature item k belongs, Sum_l represents the news word segmentation set The total number of word segmentation, df_k represents the number of news with feature item k in the entire news set, and N represents the number of all news in the entire news set.

热词提取模块，维护一个候选热点列表，该表以键值对的形式存储了词与其对应的热度值。依次将经过热度计算处理的候选词插入该列表，如果该词已存在，则对该词的热度值进行更新。所有候选热点词汇处理完毕后，对列表中热词以降序排序，即可得到一段时间内的热词集合。The hot word extraction module maintains a list of candidate hot words, which stores words and their corresponding popularity values in the form of key-value pairs. The candidate words that have been processed by heat calculation are inserted into the list in turn, and if the word already exists, the heat value of the word is updated. After all candidate hot words are processed, the hot words in thelist are sorted in descending order to obtain a set of hot words within a period of time.

IDF表更新模块，每次处理完毕后，对IDF表根据增量式的IDF计算公式进行更新，以克服海量数据下反文档频率更新周期长所引起的热词识别不准确的问题。增量式的IDF计算公式如下：The IDF table update module, after each processing, updates the IDF table according to the incremental IDF calculation formula to overcome the problem of inaccurate hot word recognition caused by the long update cycle of anti-document frequency under massive data. The incremental IDF calculation formula is as follows:

其中，df_old表示在IDF表更新前总新闻集中N_old条新闻包含特征项k的新闻数，df_new表示在一次处理后，新增的N_new条新闻中包含特征项k的新闻数。Among them, df_old represents the number of news with feature item k in the N_old news in the total news set before the IDF table is updated, and df_new represents the number of news with feature item k in the newly added N_new news after one processing.

本发明的效益是：本发明利用中文分词工具对海量的舆情信息进行初步分词，并给出词性标注，同时结合一个IDF表、一个过滤词表和一个词性权值表，依据加权式TF-IDF算法进行候选词语热度值计算，该计算不仅仅只依据词频，而是充分考虑了词语的词性、位置等所包含的有效信息，为热词识别提供了可靠性依据。另外，本发明充分考虑了自媒体时代下舆情标题主题鲜明的特点，主要对舆情标题进行语料处理，解决了海量舆情信息下的热词识别的效率问题。最后对IDF表实现动态的增量式更新，保证了词语反文档频率的实时性，提高了热词识别的准确度。The benefits of the present invention are: the present invention utilizes the Chinese word segmentation tool to perform preliminary word segmentation on a large amount of public opinion information, and provides part-of-speech tagging, and at the same time combines an IDF table, a filter vocabulary table and a part-of-speech weight table, according to the weighted TF-IDF The algorithm calculates the popularity value of candidate words. This calculation is not only based on word frequency, but also fully considers the effective information contained in the word's part of speech and position, which provides a reliable basis for hot word identification. In addition, the present invention fully considers the characteristics of distinctive themes of public opinion titles in the self-media era, mainly performs corpus processing on public opinion titles, and solves the efficiency problem of hot word recognition under massive public opinion information. Finally, the dynamic incremental update of the IDF table is realized, which ensures the real-time performance of the reverse document frequency of words and improves the accuracy of hot word recognition.

附图说明Description of drawings

图1：系统流程图Figure1 : SystemFlowchart

具体实施方式detailed description

本发明公开一种面向海量舆情信息的基于关键字加权式TF-IDF的热词发现方法，包括：The invention discloses a method for discovering hot words based on keyword weighted TF-IDF for mass public opinion information, including:

一个舆情语料库，存储从互联网上抓取的经预处理后的海量舆情信息；A public opinion corpus, which stores a large amount of preprocessed public opinion information captured from the Internet;

权值计算模块，对经过噪音过滤模块筛选得到的候选热点词或短语参照词性权重表获取权重，同时参照IDF表获取其对应的反文档频率。根据这两个值，通过加权式TF-IDF计算方法生成该词的热度值。The weight calculation module refers to the part-of-speech weight table to obtain the weight of the candidate hot words or phrases screened by the noise filtering module, and obtains its corresponding anti-document frequency by referring to the IDF table. According to these two values, the popularity value of the word is generated by the weighted TF-IDF calculation method.

IDF表更新模块，每次处理完毕后，对IDF表根据增量式的IDF计算公式进行更新，以克服海量数据下反文档频率更新周期长所引起的热词识别不准确的问题。The IDF table update module updates the IDF table according to the incremental IDF calculation formula after each processing, so as to overcome the problem of inaccurate hot word recognition caused by the long update cycle of anti-document frequency under massive data.

下面结合一个较佳的实例对本发明进行详细描述。The present invention will be described in detail below in conjunction with a preferred example.

如图1所示，系统包含一个舆情语料库、一个过滤词库、一个IDF表和一个词性权重表。从网络抓取一个时间段内的网页新闻，使其逐条进入舆情预处理模块进行信息清洗，得到比较干净的新闻数据，将其以规范格式存入舆情语料库，然后使用中文分词工具将其进行分词以及词性标注。在噪音处理模块中，参考过滤词库，将分词集合中无实际含义的词组或短语进行筛除，获得结果输入到权值计算模块，依次依据每一个词依据词性、位置并应用加权式TF-IDF算法进行热度值的计算，最后在热词提取模块获得一个依据热度值降序排列的热词表。最终，将IDF表中的每个词输入到IDF更新模块进行反文档频率增量式的更新。具体流程如下：As shown inFigure1 , the system includes a public opinion corpus, a filter lexicon, an IDF table and a part-of-speech weight table. Grab webpage news within a period of time from the Internet, let them enter the public opinion preprocessing module one by one for information cleaning, get relatively clean news data, store it in the public opinion corpus in a standardized format, and then use the Chinese word segmentation tool to segment it and part-of-speech tagging. In the noise processing module, refer to the filtering thesaurus to filter out phrases or phrases that have no actual meaning in the word segmentation set, and input the result to the weight calculation module, and apply the weighted TF- The IDF algorithm calculates the popularity value, and finally obtains a hot word list in descending order according to the popularity value in the hot word extraction module. Finally, each wordin the IDF table is input to the IDF update module for incremental update of the inverse document frequency. The specific process is as follows:

步骤1：建立一个过滤词库。可以通过参照如结巴分词、ansj_seg等开源的分词软件包中现有的停用词表建立。本发明在本实例中主要处理舆情新闻数据，考虑到新闻报道的五要素，即5个W(when、where、who、what、why)，本发明建立一个词性过滤表，用以对分词结果中助词、介词、连词等虚词、表示修饰的形容词和表征程度的副词、数词和量词的搭配等词性进行过滤；同时，考虑到新闻数据中类似“搜狐”、“记者”、“人民日报”等与本身报道内容无关的高频词汇的影响，本发明结合ansj_seg、中科院分词的停用词表以及从实际新闻数据统计的无关高频词建立一个词义过滤表。Step 1: Build a filtering vocabulary. It can be established by referring to the existing stop word lists in open source word segmentation software packages such as stuttering word segmentation and ansj_seg. The present invention mainly handles public opinion news data in this example, considering the five elements of news reports, i.e. 5 W (when, where, who, what, why), the present invention sets up a part-of-speech filtering table, in order to the word segmentation result Part-of-speech filtering such as auxiliary words, prepositions, conjunctions and other function words, adjectives expressing modifications and adverbs representing the degree of representation, numerals and quantifiers; at the same time, considering news data such as "Sohu", "Reporter", "People's Daily", etc. Influenced by the high-frequency vocabulary irrelevant to the content of the report itself, the present invention combines ansj_seg, the stop vocabulary list of Chinese Academy of Sciences word segmentation, and irrelevant high-frequency words counted from actual news data to establish a semantic filter table.

步骤2：建立一个IDF表。可以通过从公开数据平台或者网络爬虫获取一段时间内的历史新闻数据，经过处理后获得词语的反文档频率。本发明采用网络爬虫的方式获取历史新闻集，经过分词处理、词频统计后计算出每个词语的IDF值，以此建立IDF表，IDF表以{词条k，idf_k，df_k，N}的格式存储，并通过后期的IDF表更新模块实现IDF表的自适应更新。其中，df_k表示整个新闻集中出现词条k的新闻的数目，N表示整个新闻集中所有新闻的数目，则idf_k表示词条k的反文档频率，可由以下公式计算可得：Step 2: Create an IDF table. The anti-document frequency of words can be obtained by obtaining historical news data for a period of time from public data platforms or web crawlers, after processing. The present invention adopts the mode of web crawler to acquire historical news collection, calculates the IDF value of each word after word segmentation processing and word frequency statistics, and establishes an IDF table with this, and the IDF table is {entry k, idf_k , df_k , N} format, and realize the adaptive update of the IDF table through the later IDF table update module. Among them, df_k represents the number of news with entry k in the entire news set, N represents the number of all news in the entire news set, then idf_k represents the inverse document frequency of entry k, which can be calculated by the following formula:

步骤3:建立一个词性权重表。可以参照现有的分词词库表，如ICTPOS3.0词性标注集。本发明以ansj_seg分词工具的词性集为基础，依据新闻报道的特点，对不同词性赋予1-5之间的不同权值，以名词为基准词性，定义权重为3，对地名、人名、未登陆的新词等所含信息量较大的词定义比基准词性较高的权重。Step 3: Build a part-of-speech weight table. You can refer to the existing part-of-speech thesaurus, such as ICTPOS3.0 part-of-speech tagging set. The present invention is based on the part-of-speech set of the ansj_seg part-of-speech tool, and according to the characteristics of news reports, different weights between 1-5 are given to different parts of speech, with nouns as the benchmark part-of-speech, the definition weight is 3, and place names, personal names, unregistered Definitions of words with larger information content, such as new words, have higher weights than the reference part-of-speech.

步骤4：建立一个舆情语料库，用于存储经过清洗的网络爬虫得到的新闻数据。Step 4: Establish a public opinion corpus for storing news data obtained by cleaned web crawlers.

步骤5：建立一张热词表，用于以{词名，热度值}的方式存储热词识别处理得到的热词集合。Step 5: Establish a hot word table for storing the hot word set obtained from the hot word recognition process in the form of {word name, hot value}.

步骤6：舆情数据抓取。本发明以天为时间单位，采用网络爬虫的方式周期性获取这个时间段内舆情新闻数据，在舆情信息预处理模块对抓取的网页进行数据清洗，过滤网页中图片、广告、链接等噪音数据，提取出舆情新闻的标题和内容，将其存入舆情语料库，为后续文本处理提供基础。Step 6: Capture public opinion data. The present invention takes days as the time unit, and periodically acquires public opinion news data within this period of time by means of a web crawler, performs data cleaning on captured webpages in the public opinion information preprocessing module, and filters noise data such as pictures, advertisements, and links in the webpages , extract the title and content of public opinion news, and store them in the public opinion corpus to provide a basis for subsequent text processing.

步骤7：将舆情语料库中的新闻逐条读取并进行分词。可以使用现有的开源分词工具，如结巴分词、盘古分词工具或者自定义分词算法。本发明主要采用开源的ansj_seg分词工具进行基于词典和统计相结合的方法对新闻进行分词得到表征新闻的分词集合，并进行词性标注，对于未登录词，以nw作为词性标记。同时，在此步统计该新闻l分词集合的分词总数Sum_l。Step 7: Read the news in the public opinion corpus one by one and perform word segmentation. You can use existing open source word segmentation tools, such as Jieba word segmentation tool, Pangu word segmentation tool or custom word segmentation algorithm. The present invention mainly uses the open-source ansj_seg word segmentation tool to segment the news based on the combination of dictionary and statistics to obtain the word segmentation set representing the news, and perform part-of-speech tagging. For unregistered words, nw is used as the part-of-speech tag. At the same time, in this step, the total number of word segments Sum_l of the news l word segment set is counted.

步骤8：对表征新闻的分词结果集进行过滤。通过噪音过滤模块，参照词性过滤表和词意过滤表对获得的分词集合中的词逐个进行词性、词义比对，对出现在过滤词库中的词性，如形容词：a；时间词：t；副词：d等和无实际意义的高频词和短语，如“记者”、“新华网”、“图片”等，对其权值设置为零，不再作为构成文档向量集的特征项参与后续处理。此步得到了表示新闻的特征向量集。Step 8: Filter the word segmentation result set representing news. Through the noise filtering module, refer to the part-of-speech filter table and the word-meaning filter table to perform part-of-speech and word-meaning comparisons one by one for the words in the obtained word segmentation set, and compare the parts of speech appearing in the filtered lexicon, such as adjective: a; time word: t; Adverbs: d, etc. and high-frequency words and phrases with no practical meaning, such as "reporter", "Xinhuanet", "picture", etc., set their weight to zero, and no longer participate in the follow-up as feature items constituting the document vector set deal with. In this step, a feature vector set representing news is obtained.

步骤9：对特征向量集中的每一个特征项计算权重值。依次取出特征向量集中的每一个特征项，作如下处理：Step 9: Calculate the weight value for each feature item in the feature vector set. Take out each feature item in the feature vector set in turn, and do the following:

A.计算特征项k的反文档频率idf_k。从IDF表中查找该特征项，若存在，则得到对应的反文档频率idf_k；如果不存在，则取得IDF表任意一条词条的文档总数N(IDF表中所有词条具有相同的文档总数)，并以{词条k，idf_k，0，N}的格式插入IDF表。该特征项的反文档频率可由以下公式计算可得：A. Calculate the inverse document frequency idf_{k of the feature item k} . Find this feature item from the IDFtable , if it exists, then get the corresponding inverse document frequency idf_k ; if it does not exist, then get the total number of documents N of any entry in the IDFtable ( all entries in the IDF table have the same total number of documents ), and insert into the IDF table in the format of {entry k, idf_k , 0, N}. The anti-document frequency of this feature item can be calculated by the following formula:

idf_k＝log(100+N)idf_k =log(100+N)

B.获取该特征项k的词性，通过对照词性权值表，获得该词性的权值W_nature。人名、地名、机构团体名、新词等含有较多信息性的词性权值较高。B. Obtain the part of speech of the feature item k, and obtain the weight value W_nature of the part of speech by comparing the part of speech weight value table. The part-of-speech weights that contain more information, such as person names, place names, organization names, and neologisms, are higher.

C.计算该特征项k的权重值，即TF-IDF值。依据该特征项的位置信息统计词频，如果该特征项出现在标题中，则将其权重设为5倍，并且融合步骤B中获取的词性权值W_nature、步骤A中获取的反文档频率值idf_k以及所属新闻l分词集合的分词总数Sum_l，使用如下公式计算该特征项k的权重：C. Calculate the weight value of the feature item k, that is, the TF-IDF value. According to the position information of the feature item, the word frequency is counted. If the feature item appears in the title, its weight is set to 5 times, and the part-of-speech weight W_nature obtained in step B and the inverse document frequency value obtained in step A are fused idf_k and the total number of word segments Sum_l of the news l word segment set to which it belongs, use the following formula to calculate the weight of the feature item k:

其中，f_title表示特征项k出现在新闻标题中的频率，f_content表示特征项k出现在新闻正文中的频率。Among them, f_title represents the frequency of feature item k appearing in the news title, and f_content represents the frequency of feature item k appearing in the news text.

步骤10：由步骤9得到带有权重的表示新闻l的向量，经统计一般10个词可涵盖一条新闻主要内容，因此对该向量以权重降序进行排列，取权重前10的特征项加入热词表，若该热词表存在该特征项，则以下述公式对其进行热度值更新：Step 10: Obtain the weighted vector representing news l from step 9. According to statistics, generally 10 words can cover themain content of a piece of news , so the vector is arranged in descending order of weight, and the top 10 feature items of weight are added to the hot Vocabulary, if the feature item exists in the hot vocabulary, update its popularity value with the following formula:

H_k＝H_k+w_lkH_k ＝H_k +w_lk

其中，H_k表示特征项k在热词表中已有的热度值。Among them, H_k representsthe existing popularity value of feature item k in the hot vocabulary .

步骤11：对舆情语料库中的所有新闻进行步骤7到步骤10的处理，最后将热词表以热度值降序排列，即可得到一个时间段内(本实例为一天)的舆情热词集合。Step 11: Process all the news in the public opinion corpus from step 7 to step 10, and finally sort the hot word list in descending order of popularity value, and then get the set of hot words in public opinion within a time period (in this example, one day).

步骤12：对IDF表进行增量式更新，目的是保证每次获取词条的反文档频率都是最新值。Step 12: Incrementally update the IDF table to ensure that the inverse document frequency of each entry is the latest value.

A.统计本次处理中舆情新闻总条数N_new。A. Count the total number N_new of public opinion news in this processing.

B.对IDF表中出现的所有词条k，逐个计算本次舆情新闻集中包含该词条的新闻数df_new。B. For all the entries k appearing in the IDFtable , calculate the number of news df_new containing the entry in this public opinion news set one by one.

C.以{词条k，idf_k，df_k，N}的格式更新IDF表。其中：C. Update the IDF table in the format of {term k, idf_k , df_k , N}. in:

df_k＝df_old+df_newdf_k =df_old +df_new

N＝N_old+N_newN＝N_old +N_new

df_old、N_old为词条k更新前在表中的对应值。df_old and N_oldare the corresponding values in the table before the entry k is updated.

Claims

Translated fromChinese

1.一个舆情语料库，存储从互联网上抓取的经过预处理海量舆情信息。1. A public opinion corpus, which stores preprocessed mass public opinion information captured from the Internet.

2.一个过滤词库，分为词性过滤表和词义过滤表两部分，用以对分词结果中助词、介词、连词等虚词、表示修饰的形容词和表征程度的副词、数词和量词的搭配等词性以及并无实际含义的词进行过滤。2. A filtering lexicon, which is divided into two parts: the part-of-speech filtering table and the meaning filtering table, which are used for the collocation of auxiliary words, prepositions, conjunctions and other function words, adjectives expressing modifications, adverbs showing degree of representation, numerals and quantifiers in word segmentation results. Part of speech and words without actual meaning are filtered.

3.一个IDF表，用以存储词汇或短语的反文档频率，并且实现动态更新。3. An IDF table is used to store the inverse document frequency of words or phrases, and realize dynamic updating.

4.一个词性权重表，用以存储不同词性的权重，权重等级取值为1-5，依次递增。4. A part-of-speech weight table is used to store the weights of different parts of speech, and the weight levels are 1-5, increasing in order.

6.文本分词模块，对语料库中的文本采用基于词典和统计相结合的方法进行分词，并且对获得的每个词或短语进行词性标注，实现对新词和未登录词的识别。6. The text segmentation module uses a combination of dictionaries and statistics to segment the text in the corpus, and performs part-of-speech tagging on each word or phrase obtained to realize the recognition of new words and unregistered words.

7.噪音过滤模块，参照过滤词库对获得的分词集合进行词性、词义比对，对出现在过滤词库中的词和短语，不再作为候选热点词汇参与后续计算。7. The noise filtering module compares the part of speech and meaning of the obtained word segmentation sets with reference to the filtered thesaurus, and the words and phrases that appear in the filtered thesaurus are no longer used as candidate hot words to participate in subsequent calculations.

8.权值计算模块，对经过噪音过滤模块筛选得到的候选热点词或短语参照词性权重表获取权重，同时参照IDF表获取其对应的反文档频率，根据这两个值，通过加权式TF-IDF计算方法生成该词的热度值。8. The weight calculation module obtains the weight of the candidate hot words or phrases filtered by the noise filtering module by referring to the part-of-speech weight table, and at the same time refers to the IDF table to obtain its corresponding anti-document frequency. According to these two values, through the weighted TF- The IDF calculation method generates the heat value of the word.

9.热词提取模块，维护一个候选热点列表，该表以键值对的形式存储了词与其对应的热度值，依次将经过热度计算处理的候选词插入该列表，如果该词已存在，则对该词的热度值进行更新，所有候选热点词汇处理完毕后，对列表中热词以降序排序，即可得到一段时间内的热词集合。9. The hot word extraction module maintains a candidate hot list, which stores words and their corresponding popularity values in the form of key-value pairs, and inserts the candidate words processed through popularity calculations into the list in turn. If the word already exists, then Update the popularity value of the word. After all candidate hot words are processed, sort the hot words in the list in descending order, and you can get the hot word set within a period of time.

10.IDF表更新模块，每次处理完毕后，对IDF表根据增量式的IDF计算公式进行更新，以克服海量数据下反文档频率更新周期长所引起的热词识别不准确的问题。10. The IDF table update module, after each processing, updates the IDF table according to the incremental IDF calculation formula to overcome the problem of inaccurate hot word recognition caused by the long update cycle of inverse document frequency under massive data.