Summary of the invention
Problem for needing to solve in the feature of network public sentiment information in above-mentioned background technology and network public sentiment information management, the invention provides a kind of based on the relevant network public sentiment information analytical approach of text semantic.
The technical solution adopted for the present invention to solve the technical problems is, a kind of based on the relevant network public sentiment information analytical approach of text semantic.Employing comprises the network public sentiment information analytic system that network public sentiment information acquisition module, public feelings information extraction module, public feelings information pretreatment module, public feelings information excavate module, public feelings information analysis module and comprise public feelings information database, and comprises the steps:
A. network public sentiment information acquisition module gathers various public feelings informations from webpage, and stores in public feelings information database;
B. the public feelings information that public feelings information extraction module and public feelings information pretreatment module gather step a tentatively filters and cutting, extracts the content information that text comprises, and for public feelings information excavates, provides data, services;
C. on step b basis, public feelings information excavates module and adopts the improvement Clustering Analysis of Text method based on semantic similarity, generates classification descriptor, filters out the text message comprising in cluster analysis result; The TFIDF words-frequency feature computing method statistics category feature of utilization based on characteristic statistics, obtain Based on Class Feature Word Quadric, select noun as candidate's Based on Class Feature Word Quadric, according to candidate feature word weight, sort, the larger candidate feature word of the weighted value of usining is as classification keyword, utilize the semantic relation between classification keyword, form classification results; Identify and set up new network public-opinion theme, the related content that detects, follows the tracks of existing public sentiment theme;
D. last, public feelings information analysis module excavates public feelings information data through step c are carried out OLAP multidimensional statistics analysis, analyze the public sentiment evaluation metricses such as public sentiment subject content attention rate, public sentiment theme emotion tendency.
In step a, described public feelings information acquisition module, that network public sentiment information source is gathered, different from general web crawlers is, it not only will complete crawling of webpage, and web page contents will be formatd to processing, extracts theme and the content of public sentiment, the data obtained deposits txt form or html formatted file in, and stores public feelings information database into; Network public sentiment information acquisition module adopts timesharing access, timing to change IP address and simulation browser carries out three kinds of technology of single-sign-on in conjunction with carrying out anti-shielding.Network public sentiment information acquisition module adopts timesharing access, timing to change IP address and simulation browser carries out three kinds of technology of single-sign-on in conjunction with carrying out anti-shielding.The concrete steps that network public sentiment information acquisition module is carried out are: the concrete steps that described public feelings information acquisition module is carried out are, from the URL of predefined Topic relative webpage, obtain the text message in webpage, and from current web page, extract new URL and put into queue, until that the public feelings information satisfying condition gathers is complete, till URL queue is sky; The web page text information collecting is stored in public feelings information database according to field classification, provide public feelings information extraction module to call.
Described public feelings information extraction module, it is the irrelevant contents of removing in webpage, as noise datas such as the advertisement in webpage, navigation information, picture, copyright notice, the metamessage of extraction to the useful body part of the analysis of public opinion, text is reconstructed, will there is the representational information aggregation of theme together; Described public feelings information pretreatment module, that public feelings information source to gathering is after the extraction of described public feelings information extraction module, carry out Chinese word segmentation processing, filtration stop words, named entity recognition, part-of-speech tagging, syntax parsing and Feature Words and extract, set up positive sequence index and inverted index; Set up text feature semantic network figure, using the entity E that comprises in the text node as figure, semantic relation between two entities is as the directed edge of figure, semantic relation between entity is the weight as node in conjunction with word frequency information, the weight of directed edge represents the significance level of entity relationship in text, and described entity E comprises things entity NE, event entity VE, event relation entity RE; Word frequency and the text frequency information of statistics text, then carry out Feature Words extraction, and the vocabulary of choosing embodiment text feature shows the text.
In step b, described public feelings information extraction module, is the irrelevant contents of removing in webpage, extracts the metamessage to the useful body part of the analysis of public opinion, and text is reconstructed, and will have the representational information aggregation of theme together; Described public feelings information pretreatment module, that public feelings information source to gathering is after the extraction of described public feelings information extraction module, carry out Chinese word segmentation processing, filtration stop words, named entity recognition, part-of-speech tagging, syntax parsing and Feature Words and extract, set up positive sequence index and inverted index; Set up text feature semantic network figure, using the entity E that comprises in the text node as figure, semantic relation between two entities is as the directed edge of figure, semantic relation between entity is the weight as node in conjunction with word frequency information, the weight of directed edge represents the significance level of entity relationship in text, and described entity E comprises things entity NE, event entity VE, event relation entity RE; Word frequency and the text frequency information of statistics text, then carry out Feature Words extraction, and the vocabulary of choosing embodiment text feature shows the text.
Realize the text analyzings such as network public sentiment information text mining, natural language processing, first to carry out word segmentation processing, use for reference the achievement in research in domestic Chinese word segmentation field, the functions such as the word segmentation that the ICTCLAS Chinese lexical analysis system of using Inst. of Computing Techn. Academia Sinica to develop has, part-of-speech tagging, named entity recognition, by public feelings information content of text is carried out to participle, extract the word that length is greater than two.After text participle, filter the stop words useless to computer understanding text, retain the word of the parts of speech such as noun, verb, adnoun, moving shape word, obtain alternative features word set, effectively reduce the size of index, increase recall precision, improve accuracy rate.Through the text document of word segmentation processing, set up positive sequence index and inverted index, the inquiry that realizes user is mutual.Text through participle, part-of-speech tagging, go after stop words, set up the Feature Semantics network chart of text, the information such as the statistics word frequency of text and text frequency, are then weighted with feature extraction etc.
In step c, described public feelings information excavates module, that text set is being carried out to pre-service, after comprising that Chinese word segmentation processing, stop words filtration and structuring label information are analyzed, the text data set that Information Extracting module is generated, the text semantic feature description scheme building according to text feature semantic network figure, utilizes method for evaluating similarity to calculate the semantic similarity between text, build similarity matrix, adopt the improvement Clustering Analysis of Text method based on semantic similarity to generate cluster result; Cluster analysis result generates classification descriptor, filters out the text message comprising in cluster analysis result; The TFIDF words-frequency feature computing method statistics category feature of utilization based on characteristic statistics, obtain candidate's Based on Class Feature Word Quadric, select noun as candidate's Based on Class Feature Word Quadric, according to candidate feature word weight, sort, the weighted value of usining determines that candidate feature word is as classification keyword, utilize the semantic relation between classification keyword, form classification results; Result is built to knowledge base, and knowledge base can also be arranged to have and support the text mining functions such as public sentiment motif discovery, public sentiment sentiment classification simultaneously.
In steps d, described public feelings information analysis module, that the data that the process step c to depositing in public feelings information database excavates are carried out OLAP multidimensional statistics analysis, analyzing public sentiment theme attention rate, public sentiment content erotic degree, public sentiment propagates the public sentiment evaluation metrics ,Wei relevant departments such as diffusibleness, public sentiment issue degree of impact and grasps in time public sentiment and issue public feelings information dynamically, in good time, make correct decisions and provide support.
Compared with prior art, the present invention has following beneficial effect:
1. current network public feelings information has reflected the features such as magnanimity, dynamic, imperfection, form of expression diversity, and existing public feelings information analytical approach has often been ignored the correlationship of public feelings information content of text, cause public feelings information analysis result inaccurate; The present invention adopts the text feature semantic network graph model that builds public feelings information text, introduces the contact between phrase semantic association and context of co-text in textual description structure; In conjunction with the improvement Text Clustering Algorithm based on semantic similarity, mining analysis goes out the semantic relevant content of context in public feelings information text.
2. by setting up the text feature semantic network figure of public feelings information text, context relation between word in public feelings information text is formed to the digraph structure of characteristic item and weight composition, when retaining text word contextual information structure, strengthened the intension of word context semanteme in text, describe preferably semantic information implicit in text and theme feature, solve the problem of phrase semantic loss of learning in text.
3. the improvement Text Clustering Algorithm based on semantic similarity is suitable under large-scale network environment, the cluster analysis of dynamic data and public sentiment theme focus being found, by text semantic similarity is calculated, build text semantic similarity matrix, the degree of depth is excavated the semantic relevant content of context in public feelings information text, detects in time, follows the tracks of new subject events; Adopt the theme method for expressing at a plurality of centers in class, select the similarity maximal value at each center in text and class as the similarity of this class text, effectively improved running efficiency of system, along with the increase of amount of text, cluster analysis effect can be more obvious.
Embodiment
Below in conjunction with the drawings and specific embodiments, the present invention will be further described.But embodiments of the present invention are not limited to this.
As shown in Figure 1, in method of the present invention, comprise the network public sentiment information analytic system that network public sentiment information acquisition module, public feelings information extraction module, public feelings information pretreatment module, public feelings information excavate module, public feelings information analysis module and comprise public feelings information database.Its treatment scheme is:
(1) public feelings information collection
Network public sentiment information source is gathered, different from general web crawlers is, it not only will complete crawling of webpage, and web page contents to be formatd to processing, extract useful public feelings information, as the theme of public sentiment and content, the data obtained deposits txt form or html formatted file in, writes original public feelings information database.Concrete steps are: according to default network public sentiment information acquisition strategies, from the URL of a plurality of kinds of sub-pages, send the instruction (adopting GET method) of following http agreement by each generic port; Remote server returns to the document of HTML type according to the content of application instruction.Public feelings information acquisition module is first saved to buffer memory after all information in collecting and returning to document, and is then sent in database and preserves, and obtains the text message in webpage; In obtaining web page text information process, constantly from current web page, extract emerging hyperlink URL access, and reject the hyperlink URL of having accessed, so iterative cycles, until it is complete to meet the web page text information acquisition of search strategy, till the URL queue of access is not sky.The web page text information of collection is stored in database according to field classification, provide public feelings information extraction module to call.
The anti-shielding strategy that network public sentiment information acquisition module adopts timesharing access, timing to change IP address conventionally, simulation browser carries out the multiple technologies combinations such as single-sign-on.For many websites, as passing through user's login mode, forum, blog, microblogging etc. could access, here adopt the strategy of simulation browser more easily to realize, the API Calls that the Web Browser control that utilizes the .NET of Microsoft developing instrument Visual Studio2008 to provide is MS internet explorer, utilize the simulation of SSO single-sign-on to submit user name and password login to, after waiting for that user login information has loaded, page jump is to corresponding URL address, by submitting to keyword to retrieve, obtain the source file of required webpage.
The web page text information gathering comprises web content information, Web structure and uses recorded information two parts.Web content packets of information is containing content of text information such as headline, body matter, review information, and Web structure and Web are used recorded information to comprise the statistical informations such as click volume, pageview, comment amount.
(2) public feelings information extraction
The info web gathering contains the noise datas such as advertisement, navigation information, picture, copyright notice, what concerning public feelings information analysis, really need is the metamessage of body part, dispose these irrelevant contents, the metamessage of useful body part is analyzed in extraction to public feelings information, for the follow-up excavation of text, analysis provide service.Idiographic flow is as follows:
(2-1) first use Tidy instrument to align web page text and carry out the standardization of HTML mark, then utilize html parser tools build HTML tree, node using HTML mark as tree, represents to be convenient to the management of HTML code and operation like this, can to code, carry out structuring excavation better.
(2-2) from the public feelings information source gathering, extract the relevant informations such as title, keyword, text, length, update time and URL, title can intercept the information between label <TITLE> and </TITLE>; Keyword is included in the META label of html file head, can from META label information, extract; Temporal information can be extracted by pattern match analysis and web page analysis.
(2-3) concrete steps that text extracts are: select suitable keyword, obtain the URL address of related web page, by the server at access place, URL address, obtain the html source code of webpage; The useless mark of deleting in webpage source code is capable, retains webpage body matter; Paragraph symbol in HTML code (as </p>, <br> etc.) is replaced with to special symbol (as * [/p] *, * [/br] * etc.), carriage return character and newline replace with line Separator, adopt row structure storage mode, retain web page contents form; Extract the text between every a line HTML mark " < " and " > "; Replace special symbol (as * [/p] *, * [/br] * etc.) with the carriage return character, keep the original paragraph of text; Result character string is removed to the special ESC of HTML (as & quot, & lt etc.) and process, in conjunction with regular expression, mate and extract final text result.
From the public feelings information source gathering, extract after the relevant informations such as title, keyword, text, length, update time and URL, public feelings information extraction module also will realize the reconstruct of text message.
Text reconstruct is by analyzing the architectural feature of the public feelings information existence forms such as Internet news, forum's model, microblogging blog article and text, the information of representative topic is formed to " purport piece ", the information of remainder forms " content blocks ", to improve cluster analysis effect.
For the text reconstruct of web page news, be that the title of web page news and first segment information are formed to " purport piece ", remaining news descriptor and comment content form " content blocks ".
For the text reconstruct of forum's model, be that the title of model and main note are formed to " purport piece ", by money order receipt to be signed and returned to the sender and follow-up information purified treatment, remove and there is no the model of Chinese character content and use the conventional model of evaluating word, select some models to form " content blocks ".
(3) public feelings information pre-service
After public feelings information extraction, next carry out the pre-service such as Chinese word segmentation processing, named entity recognition, part-of-speech tagging, syntax parsing, Feature Words extraction, result is saved in database.Realize the text analyzings such as network public sentiment information text mining, natural language processing, first to carry out word segmentation processing, use for reference the achievement in research in domestic Chinese word segmentation field, the Chinese lexical analysis system ICTCLAS that adopts Inst. of Computing Techn. Academia Sinica to develop carries out participle and the part-of-speech tagging of text, by Chinese word segmentation, process, extract the word that length is greater than two.The function of ICTCLAS has participle, part-of-speech tagging, neologisms identification of Chinese text etc.; Use the method for actor model (role model) to carry out named entity recognition; Support user to define as required personalized dictionary simultaneously, not only have the higher precision of word segmentation, participle effect is also better.Code is as follows:
After text participle, filter the stop words useless to computer understanding text, retain the word of the parts of speech such as noun, verb, adnoun, moving shape word, obtain alternative features word set, to avoid the lengthy and jumbled of text, effectively reduce the size of index, increase recall precision, improve retrieval rate.
Through the text of word segmentation processing, set up positive sequence index and inverted index, the inquiry that realizes user is mutual.For positive sequence index, according to the sequence of word frequency, select top n word to represent text, with Hash table, be expressed as: < filename, keyword phrase >; Set up after positive sequence index, the keyword in search text, finds out the All Files name that comprises this keyword, sets up filename phrase, can obtain inverted index, is expressed as: < keyword, filename phrase > with Hash table.
The foundation of index and the retrieval service of index realize based on the Apache project Lucene that increases income, and Lucene provides complete query engine and index engine, text analyzing engine; Adopt the index file of Hadoop store and management magnanimity.
The process of establishing of index is as follows:
1. create index and write object IndexWriter.During this Object Creation, need provide vocabulary resolver, different vocabulary resolvers adopt different dictionaries.Select ThesaurusAnalyzer, can extract synopsis;
2. for each result set of taking from database creates a Document object;
3. the data element in result set is created respectively to a Field object, and add Document object to;
4. write this Document object.
The process of indexed search is: first create query parser, this query parser needs Field object name and the corresponding parameters such as vocabulary resolver; By query parser and key word, obtain query object again; By query object, obtain the result set of retrieval, result set consists of Document object.
Text through participle, part-of-speech tagging, go after stop words, set up the Feature Semantics network chart of text, the information such as the statistics word frequency of text and text frequency, are then weighted with feature extraction etc.
Text feature semantic network figure is a kind of digraph of expressing public feelings information with entity and semantic relation thereof, the entity E(comprising in text of usining comprises things entity NE, event entity VE, event relation entity RE) as the node of figure, semantic relation between two entities is as the directed edge of figure, semantic relation between entity is the weight as node in conjunction with word frequency information, and the weight of directed edge represents the significance level of entity relationship in text.By introducing and the merging based on concept and the simplification of network node weights, build text feature semantic network figure, the core of extracting text is semantic.The word representing by network node merges, and node weights are added; Remerge directed edge, directed edge weights are added, and build text feature semantic network figure, describe semantic information and theme feature in text.Concrete concept is described below:
C1: things entity NE is defined as NE(id, concept, property, power).Id represents entity identification, and concept represents entitative concept, and property represents entity attribute, and power represents weight.
C2: event entity VE is defined as VE(id, concept, property, power, isN, subT, objT1, objT2).Except the several data item that comprise NE, whether isN representative is negative, and subT represents main body entity gauge outfit, and objTl and objT2 represent the gauge outfit of object entity 1 and 2.
C3: event relation entity RE is defined as RE(id, concept, property, power, isN, subT, objT).RE just can describe completely with a pair of Subjective and Objective entity.
Text feature semantic network graph model analytical procedure is as follows:
S1: when analyzing text, first take statement as unit, build each statement characteristic of correspondence semantic network figure.Analyze sentence by sentence every and produced which NE, NE and attribute information thereof are charged to entity information table.
After S2:NE analyzes, analyze VE, the concept of registration VE, attribute, subject and object.The VE entity list that Subjective and Objective is identical is shown same VE, otherwise different id is set.
S3: next analyze RE.Analyze RE will attention and NE, VE make a distinction, the concept of RE, attribute, main body, object are registered to entity information table.
S4: after analysis finishes, obtain the entity information table of this statement.Entity information table has been described the relation between entity, is used for constructing entity relationship diagram, between NE and VE, between RE and NE, VE, by different lines, entity relationship is visual between entity E and attribute T.
S5: analyzing on the Feature Semantics network chart basis that builds first statement, the Feature Semantics network chart of follow-up statement is merged, first merge node, remerges directed edge.
S6: during merge node, the node that identical or semantic similarity meets threshold condition word between node merges, and node weights are added; Otherwise retain this node.
S7: directed edge merges, is that the directed edge existing between the node after merging is merged, and directed edge weights are added.
S8: upgrade the weights that the weights of new merge node adjacent side are this node, the semantic relation between strengthening node.
S9: export after the Feature Semantics network chart of all merge statements, complete the structure of the Feature Semantics network chart of whole text.
Next step is to part of speech feature weight assignment, accurately to indicate text.According to Chinese part of speech feature and complete event, key element (time, place, personage and event content) is described, in conjunction with Chinese Academy of Sciences's Chinese part of speech label sets, text feature weight assignment is divided into: title weighted value is 3, subtitle and keyword weighted value are 2, summary weighted value is 1.5, and the first sentence of section and section tail sentence weighted value are 1.3.
Public feelings information, after pre-service, for title, text and the reply of text arranges different labels, when calculating weight, reads the label information of keyword, completes the assignment of the position weight of word.
(4) public feelings information excavates
Public feelings information excavates module, that text set is being carried out to pre-service, after comprising that Chinese word segmentation processing, stop words filtration and structuring label information are analyzed, the text data set that Information Extracting module is generated, the text semantic feature description scheme building according to text feature semantic network figure, utilize method for evaluating similarity to calculate the semantic similarity between text, build similarity matrix, adopt the improvement Clustering Analysis of Text method based on semantic similarity to generate cluster result; Cluster analysis result generates classification descriptor, filters out the text message comprising in cluster analysis result; The TFIDF words-frequency feature computing method statistics category feature of utilization based on characteristic statistics, obtain candidate's Based on Class Feature Word Quadric, select noun as candidate's Based on Class Feature Word Quadric, according to candidate feature word weight, sort, the weighted value of usining determines that candidate feature word is as classification keyword, utilize the semantic relation between classification keyword, form classification results; Result is built to knowledge base, and knowledge base can also be arranged to have and support the text mining functions such as public sentiment motif discovery, public sentiment sentiment classification simultaneously.
First define and calculate the similarity between text, the degree of correlation of the theme of discussing between text, uses Sim (D1, D2) expression text D1with text D2between similarity.Similarity span is between 0 and 1, with text D1and D2similarity degree be directly proportional.Similarity between text is larger, shows that the theme correlation degree between text is larger.Semantic similarity evaluation method between text is as follows:
If public feelings information extraction and pretreated text through step b are D1(t11, t12, t13..., t1m), D2(t21, t22, t23..., t2m), calculate text D1in all keyword t1iwith text D2in all keyword t2isimilarity, form similarity matrix as follows:
Simij(1=i, j=m) represents text D1keyword t1iwith text D2keyword t2jsimilarity; M(D1, D2) expression text D1with text D2between similarity matrix; I is text D1keyword number; M is text D2keyword number;
Word similarity formula is: S (T1, T2)=Max (i=1,2 ..., n; J=1,2 ..., m) S (y1i, y2j), word similarity is the maximal value in all senses of a dictionary entry of two words (a plurality of meaning of a word that word comprises) similarity.
Travel through successively similarity matrix M, find the corresponding combination of the maximum keyword of similarity Sim value, and delete corresponding row and column.Then continue the keyword combination that traversal similarity matrix M finds similarity value maximum, iterative cycles is until matrix M is null value matrix.Finally utilize the maximum keyword composite sequence of the similarity obtaining, try to achieve text D1and D2semantic similarity, computing formula is as follows:
Wherein, max is the maximal value of similarity Sim; I is text D1keyword number; J is text D2keyword number.
Improvement Clustering Analysis of Text method based on semantic similarity, is described below:
First to the text of all collections after pre-service, adopt TFIDF weighted method to carry out characteristic weighing to all categories keyword, extracts m optimal characteristics keyword and forms original in the vectorial Di* of keyword feature.
2. according to described knowledge base, original carried out to pre-service based on keyword in keyword feature vector Di*: in knowledge base, find the vocabulary mating with keyword and replaced, forming new proper vector Di, Di=(T1, T2..., Ti), i=1,2,3 ..., m.
3. form m proper vector D of n texti, utilize text semantic calculating formula of similarity to calculate the semantic similarity between the text gathering, form the similarity matrix M of text set, and obtain the average similarity MA of all proper vectors.Computing formula is as follows:
Wherein, n is textual data;
4. set three similarity thresholds, a multiplicity threshold value is that 0.9, one theme central threshold is 0.5, and a new theme threshold value is 0.3;
5. by text and central theme comparison, if the initial center similarity of text and central theme is greater than multiplicity threshold value 0.9, think that the text belongs to the same content text of same subject; If similarity is less than new theme threshold value 0.3, the text needs a newly-built class; If similarity is in 0~0.5 scope, the text belongs to the core content text of the not ipsilateral discussion of same subject, is labeled as second center, by that analogy, forms the cluster result of the stratification at a plurality of centers.
6. for the theme method for expressing at a plurality of centers, select the maximal value of the similarity at each center in text and class as the similarity of this class text.
Improvement Text Clustering Algorithm based on semantic similarity is suitable under large-scale network environment, the cluster analysis of dynamic data and public sentiment theme focus being found, can new events be detected in time, detects, follows the tracks of new public sentiment theme; The public sentiment theme method for expressing that adopts a plurality of centers in class, has improved running efficiency of system effectively, and along with the increase of amount of text, effect can be more obvious.
5) public feelings information analysis
The data that described public feelings information analysis module excavates the process step c depositing in public feelings information database are carried out OLAP multidimensional statistics analysis, analyzing the public sentiment evaluation metrics ,Wei relevant departments such as public sentiment subject content attention rate, public sentiment theme emotion tendency grasps in time public sentiment and issues public feelings information dynamically, in good time, makes correct decisions and provide support.
The public sentiment theme producing by collection, processing and mining analysis, is expressed as: T=(T1, T2..., Tn), T whereinithe text that represents public sentiment theme.The attention rate of public sentiment subject text is expressed as: Ti=α Np+ β Nr, the attention rate tolerance formula of public sentiment theme is:α wherein, β represents weight, Npthe clicks that represents public sentiment subject text, Nrrepresent comment number; Np_i represents the clicks of i public sentiment subject text, and Nr_i represents the comment number of i public sentiment subject text.Due to Np>Nr, through statistics, α value is that 0.02, β value is 0.98.
The cluster analysis data description of the emotion tendency of public sentiment theme based on public sentiment subject text.First set a threshold, only have the tendency metric when text to be greater than threshold, text just shows polarity (front property, negative property).The tendency metric of text is for just, and the text is positive comment, otherwise is negative comment.
Public feelings information, through collection, pre-service, Information Extracting, excavation and analysis, can obtain the detailed data of public sentiment theme, according to the public sentiment indicator evaluation system of setting up, processes, and the result of processing provides decision-making to help.