Movatterモバイル変換


[0]ホーム

URL:


CN104881458B - A kind of mask method and device of Web page subject - Google Patents

A kind of mask method and device of Web page subject
Download PDF

Info

Publication number
CN104881458B
CN104881458BCN201510266108.XACN201510266108ACN104881458BCN 104881458 BCN104881458 BCN 104881458BCN 201510266108 ACN201510266108 ACN 201510266108ACN 104881458 BCN104881458 BCN 104881458B
Authority
CN
China
Prior art keywords
text
title
feature vector
webpage
type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510266108.XA
Other languages
Chinese (zh)
Other versions
CN104881458A (en
Inventor
李扬曦
杜翠兰
李睿
佟玲玲
翟羽佳
王晶
刘洋
秦韬
付戈
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National Computer Network and Information Security Management Center
Original Assignee
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Computer Network and Information Security Management CenterfiledCriticalNational Computer Network and Information Security Management Center
Priority to CN201510266108.XApriorityCriticalpatent/CN104881458B/en
Publication of CN104881458ApublicationCriticalpatent/CN104881458A/en
Application grantedgrantedCritical
Publication of CN104881458BpublicationCriticalpatent/CN104881458B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

The invention discloses a kind of mask method of Web page subject and devices.The described method includes: web-based title and text, obtain the theme feature vector of the webpage;The classifier obtained using preparatory training carries out classification processing to the theme feature vector;Judge whether there is type belonging to the theme feature vector;If so, being type belonging to the theme feature vector by the webpage label;If it is not, being then webpage to be marked by the Web Page Tags;Further, clustering processing is carried out to multiple webpages to be marked;Analyze the type of each cluster set;It is the type of the cluster set belonging to it by webpage label to be marked.The present invention is obtained theme from webpage using the classification method and the unsupervised cascade mode of clustering method for having supervision, automatically and marks webpage, and the efficiency and accuracy of Web page subject mark are effectively increased.

Description

A kind of mask method and device of Web page subject
Technical field
The present invention relates to technical field of data processing, more particularly to the mask method and device of a kind of Web page subject.
Background technique
It is that internet data management and excavation etc. are answered to extract and mark Web page subject by analyzing internet web page contentsImportant foundation.Currently, Web page subject mark mostly uses key word matching method, closed by presetting web page title and partKeyword carries out the mark that webpage is realized in matching.But it is this directly matched way it is too simple, moreover, if web page titleIn keyword change, then this method will be unable to accurately mark theme, and the accuracy rate of web standards will be unable to guarantee.It is anotherKind Web page subject mark is to be clustered using the method for cluster to webpage, is made from gathering for extraction keyword in a kind of webpageFor the mark of this kind of webpages.But since clustering algorithm is more time-consuming, when webpage quantity to be marked is more, this kind of calculationThe practicability of method is poor, and the webpage label accuracy rate that unsupervised learning algorithm is used only is not high.
Summary of the invention
The present invention provides the mask method and device of a kind of Web page subject, to solve Web page subject mark in the prior artThe low problem of accuracy rate.
Based on above-mentioned technical problem, the present invention solves by the following technical programs.
The present invention provides a kind of mask methods of Web page subject, comprising: web-based title and text, described in acquisitionThe theme feature vector of webpage;The classifier obtained using preparatory training carries out classification processing to the theme feature vector;SentenceIt is disconnected to whether there is type belonging to the theme feature vector;If so, being the theme feature vector by the webpage labelAffiliated type;If it is not, being then webpage to be marked by the Web Page Tags;Further, multiple webpages to be marked are gatheredClass processing;Analyze the type of each cluster set;It is the type of the cluster set belonging to it by webpage label to be marked.
Wherein, web-based title and text obtain the theme feature vector of the webpage, comprising: extract net respectivelyTitle and text in page;According to the title, title feature vector is constructed;According to the text, text feature vector is constructed;Text feature vector described in the title feature vector sum is spliced into the theme feature vector.
Wherein, web page title feature vector is constructed according to the title, comprising: right using the title dictionary constructed in advanceThe title carries out word segmentation processing, obtains title participle;Title participle is mapped in the title dictionary;Based on describedThe weighted value of title participle, is weighted processing to the title dictionary, constructs the title feature vector of the webpage.
Wherein, Web page text feature vector is constructed according to the text, comprising: right using the text dictionary constructed in advanceThe text carries out word segmentation processing, obtains multiple text participles, and records each text participle going out in the textNow sequence;Multiple texts are respectively mapped in the text dictionary;Based on each text participle weighted value andAppearance sequence, is weighted processing to the text dictionary, constructs the text feature vector of the webpage.
Wherein, the classifier obtained using preparatory training carries out classification processing to the theme feature vector, comprising: pre-First define a variety of type of webpage;The classifier is directed to each type, is once commented the theme feature vector of the webpagePoint;Each type of corresponding scoring score value is compared with preset mark threshold value respectively;It will be greater than the mark threshold valueThe corresponding type of scoring score value, be determined as type belonging to the theme feature vector;Wherein, the theme feature vector instituteThe type of category is one or more.
Wherein, analysis cluster set type, comprising: respectively extract cluster set in each webpage to be marked title andText;Using the title dictionary constructed in advance, word segmentation processing is carried out to all titles, obtains multiple title participles;Using preparatoryThe text dictionary of building carries out word segmentation processing to all texts, obtains multiple text participles;In multiple titles participles and moreIn a text participle, the most participle of the frequency of occurrences is obtained, using the type as the cluster set.
The present invention also provides a kind of annotation equipments of Web page subject, comprising: obtains module, is used for web-based titleAnd text, obtain the theme feature vector of the webpage;Categorization module, for the classifier using training acquisition in advance, to instituteIt states theme feature vector and carries out classification processing;Judgment module, for judging whether there is class belonging to the theme feature vectorType;Labeling module, for determining in the judgment module there are in the case where type belonging to the theme feature vector, by instituteStating webpage label is type belonging to the theme feature vector;Mark module, for being not present in judgment module judgementIt is webpage to be marked by the Web Page Tags in the case where type belonging to the theme feature vector;Cluster module, for pairMultiple webpages to be marked carry out clustering processing;Analysis module, for analyzing the type of each cluster set;The mark mouldBlock is also used to the type by webpage label to be marked for the cluster set belonging to it.
Wherein, the acquisition module includes: extraction unit, for extracting title and text in webpage respectively;First structureUnit is built, for constructing title feature vector according to the title;Second construction unit, for according to the text, building to be justLiterary feature vector;Concatenation unit, it is special for text feature vector described in the title feature vector sum to be spliced into the themeLevy vector.
Wherein, first construction unit is specifically used for: using the title dictionary constructed in advance, dividing the titleWord processing obtains title participle;Title participle is mapped in the title dictionary;Weighting based on title participleValue, is weighted processing to the title dictionary, constructs the title feature vector of the webpage;The second construction unit toolBody is used for: using the text dictionary constructed in advance, being carried out word segmentation processing to the text, is obtained multiple text participles, and recordAppearance sequence of each text participle in the text;Multiple texts are respectively mapped to the positive clictionIn allusion quotation;Weighted value and appearance sequence based on each text participle, are weighted processing to the text dictionary, construct the netThe text feature vector of page.
Wherein, categorization module is specifically used for: pre-defining a variety of type of webpage;The classifier is called, it is described to makeClassifier is directed to each type, is once scored the theme feature vector of the webpage;It corresponding is commented each type ofScore value is divided to be compared respectively with preset mark threshold value;The corresponding type of scoring score value that will be greater than the mark threshold value, sentencesIt is set to type belonging to the theme feature vector;Wherein, type belonging to the theme feature vector is one or more;PointAnalysis module is specifically used for: extracting the title and text of each webpage to be marked in cluster set respectively;Utilize the mark constructed in advanceAllusion quotation is write inscription, word segmentation processing is carried out to all titles, obtains multiple title participles;Using the text dictionary constructed in advance, to allText carries out word segmentation processing, obtains multiple text participles;In multiple title participles and multiple text participles, obtainThe most participle of the frequency of occurrences, using the type as the cluster set.The present invention has the beneficial effect that:
The present invention is using the classification method and the unsupervised cascade mode of clustering method for having supervision, automatically from webpageIt obtains theme and marks webpage, effectively increase the efficiency and accuracy of Web page subject mark.
Detailed description of the invention
Fig. 1 is the flow chart of the mask method of Web page subject according to an embodiment of the invention;
Fig. 2 is the flow chart of the mask method of Web page subject according to another embodiment of the present invention;
Fig. 3 is the step flow chart of building web page title feature vector according to an embodiment of the invention;
Fig. 4 is the step flow chart of building Web page text feature vector according to an embodiment of the invention;
Fig. 5 is the splicing schematic diagram of title feature vector sum text feature vector according to an embodiment of the invention;
Fig. 6 is the step flow chart according to an embodiment of the invention classified to theme feature vector;
Fig. 7 is the structure chart of the annotation equipment of Web page subject according to an embodiment of the invention;
Fig. 8 is the structure chart according to an embodiment of the invention for obtaining module.
Specific embodiment
Below in conjunction with attached drawing and embodiment, the present invention will be described in further detail.It should be appreciated that described hereinSpecific embodiment be only used to explain the present invention, limit the present invention.
A kind of mask method of Web page subject is present embodiments provided, as shown in Figure 1, for according to one embodiment of the inventionThe flow chart of the mask method of Web page subject.The present embodiment is the step of execution for each webpage.
Step S110, web-based title and text obtain the theme feature vector of the webpage.
Since the length of web page title and text, diction are different, the present embodiment extract respectively title in webpage andText;According to title, title feature vector is constructed;According to text, text feature vector is constructed;By title feature vector sum textFeature vector is spliced into the theme feature vector of webpage.Wherein, title feature vector sum text feature vector all includes for bodyThe word vectors of the theme of existing webpage.
Using different dictionaries, construction feature vector, can more accurately describe web page contents, and then improve so respectivelyThe accuracy of Web page subject mark.
Step S120, the classifier obtained using preparatory training carry out classification processing to the theme feature vector.
Classifier determines the type of theme feature vector for classifying to theme feature vector.Theme feature vectorWeb page subject can be embodied, then it is determined that the type of theme feature vector that is to say the type of determining webpage.The type includes: newHear class, economy class, amusement class, science and technology etc..
In order to improve the accuracy of Web page classifying, for the present embodiment using the classification method for having supervision, classifier is using pre-The classification annotation system and training data first prepared is obtained by training.
Classification annotation system refers to a variety of type of webpage predetermined.Such as: news category, economy class, amusement class, science and technologyClass.Training data includes: to be parsed out multiple webpages of type based on classification annotation system.Based on classification annotation systemAnd training data, classifier is trained using support vector machines.
Step S130 judges whether there is type belonging to the theme feature vector.If so, thening follow the steps S140;IfIt is no, then follow the steps S150.
According to the classification processing of classifier as a result, judging whether there is type belonging to the theme feature vector.If depositedThe type belonging to theme feature vector, then the classification processing result is the theme type belonging to feature vector;If there is noType belonging to theme feature vector, then the classification processing result is null value.
The webpage label is type belonging to the theme feature vector by step S140.
The Web Page Tags are webpage to be marked by step S150.
The webpage that can determine type for classifier marks corresponding classification.Type can not be determined for classifierWebpage, be put into collections of web pages to be marked, handled using subsequent method, to guarantee the accuracy of webpage label.
As shown in Fig. 2, for according to the flow chart of the mask method of the Web page subject of another embodiment of the present invention.The present embodimentIt is the processing carried out for webpage to be marked.
Step S210 carries out clustering processing to multiple webpages to be marked.
Each preset time period determines the webpage quantity for being marked as webpage to be marked, if the webpage quantity is greater than in advanceIf amount threshold, then to webpage to be marked carry out clustering processing, if the webpage quantity be less than or equal to amount threshold, be spacedPreset time period carries out webpage quantity again and determines.
The present embodiment uses unsupervised clustering method, therefore, when carrying out clustering processing, using pre-set similarAlgorithm is spent, for example, similarity calculation between any two is carried out to multiple webpages to be marked, by similarity using kmeans algorithmIt is divided into same cluster set greater than two webpages to be marked of preset similarity threshold.
Step S220 analyzes the type of each cluster set.
Canopy algorithm can be used, to analyze the type of each cluster set.
In one embodiment, following steps can be executed for each cluster set: extracted respectively every in cluster setThe title and text of a webpage to be marked;Using title dictionary, word segmentation processing is carried out to all titles, obtains multiple titles pointWord;Using text dictionary, word segmentation processing is carried out to all texts, obtains multiple text participles;Multiple titles participle and it is multipleIn text participle, the most participle of the frequency of occurrences is obtained, using the type as the cluster set.Wherein, the frequency of occurrences is mostParticiple can be title participle, be also possible to text participle.
Webpage label to be marked is the type of the cluster set belonging to it by step S230.
In other words, what the type for clustering set is, then what type is exactly, the webpage to be marked in the cluster setWhat mark is exactly.
In one embodiment, at regular intervals, using cluster result, classifier is trained again, to increaseThe precision of bonus point class.Further, new type that can be obtained this by cluster after the completion of mark and this is newThe webpage of type is added in classification annotation system and training data.And then it can increase to new type and the new typeWebpage be trained.
The type that webpage is determined in such a way that classifier and clustering processing combine, can be improved the standard of webpage labelTrue property and standard performance.
For step S110,
Fig. 3 is the step flow chart according to the building web page title feature vector of one embodiment of the invention.
Step S310 constructs title dictionary in advance.
Step 1, the title of webpage is collected, title corpus is formed.
Step 2, the title text in title corpus is segmented, only retains qualified word in word segmentation resultLanguage.For example, the word segmentation result has practical significance.It can use preset segmentation methods, segmentation methods generally comprise a wordTitle text is divided into one or more participle words by allusion quotation, the dictionary.
Step 3, IDF (Inverted Document Frequency) value of retained word is calculated, and by IDF valueWord greater than default first IDF threshold value forms title dictionary.The bigger word representativeness of IDF value is stronger, the smaller word of IDF valueLanguage representativeness is weaker.
The calculation of the IDF value of word w is shown below:
In formula (1.1), N indicates the quantity for the title that entire corpus is collected, ndIndicate the title number for word w occurredAmount.Log indicates logarithm, and the truth of a matter takes 10 or e, determines with specific reference to demand.
Step S320 carries out word segmentation processing to title using title dictionary, obtains title participle.
Using the word in title dictionary, word segmentation processing is carried out to title, obtains one or more title participles.
Title participle is mapped in title dictionary by step S330.
Multiple titles are respectively mapped in title dictionary.It further, include multiple words in title dictionary;?Mapping relations are established between word in title participle and title dictionary.Wherein, there are the title of mapping relations participle and wordsIt is identical.
After mapping relations foundation, the vector that a length is equal to title dictionary length, the dimension of vector can be obtainedEqual to the quantity of word in title dictionary, each dimension corresponds to a word in dictionary.
Step S340 is weighted processing to title dictionary, is constructed the title of webpage based on the weighted value of title participleFeature vector.
Processing is weighted to title dictionary, that is to say that the vector for being equal to title dictionary length to above-mentioned length is weightedProcessing.For there are the words of mapping relations in title dictionary, i.e., segments with title there are the word of mapping relations, make in vectorIt is weighted with TFIDF (term frequency-inverse document frequency) value, the vector obtained after weighting isTitle feature vector.Wherein, TFIDF is a kind of common weighting technique for information retrieval and information exploration.
In weighting, the value of each dimension of vector is TFIDF value of the corresponding word of the dimension in the title.WordThe calculation of the TFIDF value of language w is shown below:
In formula (1.2), with (1.1) formula, TF value indicates the frequency that word w occurs in current head, c for the calculating of IDF valuewIndicate that the number that word w occurs in current head, c indicate the number of current head word (participle).
Fig. 4 is the step flow chart according to the building Web page text feature vector of one embodiment of the invention.
Step S410, the text dictionary constructed in advance.
Collection body matter is that text corpus is only retained by segmenting to the body text in text corpusQualified word in word segmentation result, such as: the word being of practical significance;Calculate the IDF value of retained word;By IDF valueWord greater than default 2nd IDF threshold value forms text dictionary.The building mode of text dictionary is identical as the building of title dictionary.The calculating of IDF value refers to formula (1.1).
Step S420 carries out word segmentation processing to text using the text dictionary of building, obtains multiple text participles, and rememberRecord the appearance sequence of each text participle in the body of the email.
Using the word in text dictionary, text is segmented;According to the sequence of text from front to back, each point of recordThe appearance sequence of word (word), the participle of first appearance are denoted as 1, and the participle of second appearance is denoted as 2, and so on, it repeatsThe participle of appearance does not record.
Multiple texts are respectively mapped in text dictionary by step S430.
The text of webpage tends to using starting brief text projecting motif, attracting eyeball, i.e., important word is inclined toIn appearing in front of text.
It include multiple words in text dictionary;Mapping relations are established between the word in text participle and text dictionary.Wherein, there are the text of mapping relations participle is identical with word.
After mapping relations foundation, the vector that a length is equal to text dictionary length, the dimension of vector can be obtainedEqual to the quantity of word in text dictionary, each dimension corresponds to a word in dictionary.
Step S440, weighted value and appearance sequence based on each text participle, is weighted processing, structure to text dictionaryThe text feature vector of networking page.
Processing is weighted to text dictionary, that is to say that the vector for being equal to text dictionary length to above-mentioned length is weightedProcessing.For there are the words of mapping relations in text dictionary, i.e., segments with text there are the word of mapping relations, make in vectorThe appearance sequence segmented with TFIDF value and the text of mapping weights, and the vector obtained after weighting is text feature vector.TextEach dimension of feature vector corresponds to a word in dictionary, and the corresponding word of the dimension exists according to the value of each dimensionThe TFIDF value of appearance sequence and the word in the text, the weighted value weight of acquisitionzw:
In formula (1.3), weightzw(w) weighted value (dimension value) of word w in text feature vector, rank (w) are indicatedFor the serial number that w occurs in the body of the email, ∑w∈WRank (w) is the summation of all word orders number, and TFIDF (w) can refer to formula(1.2), description relevant to title is changed to the relevant description of text.Text feature can be obtained using the above methodVector.The symbol of word uses consistent with the symbol of word in formula (1.2) in formula (1.3), all uses w, only for convenience of understanding formula(1.3) calculating process of TFIDF (w) in.
In general, title designates the content of webpage, theme using brief sentence.Therefore, title is shorter, text compared withLong, the present embodiment is usually less than the length of text feature vector, but title feature vector in view of the length of title feature vectorImportance be but greater than text feature vector, the present embodiment is proposed title feature vector sum text feature vector using weightingMode is spliced into the feature vector for expressing the Web page subject, i.e. theme feature vector.Such as attached connecting method shown in fig. 5.It is logicalIt crosses the present embodiment and can avoid title feature vector, text feature vector plays a role unbalance deviation in study.
Before splicing, for dimension value TFIDF (w) value of the word w in title feature vector, title weight is usedwbtIt is weighted, it may be assumed that
weightbt(w)=wbt*TFIDF(w) (1.4)
Before splicing, weighted value is not used for the dimension value of the word in text feature vector.
In splicing, the unweighted text feature vector of title feature vector sum after weighting is spliced.This implementationExample is spliced by the way of end to end, forms a length equal to the sum of title feature vector sum text feature vectorVector, wherein the title feature vector after weighting is located at before unweighted text feature vector.
The present embodiment obtains w by the way of grid searchbt, wbtRange of choice refer to formula (1.5).In each wbtUnder,Classifier carries out cross validation to training data, calculates classification accuracy rate, takes the corresponding w of highest accuracybtIt is used as finalWbtValue.
In formula (1.5), NbtIndicate the dimension of title feature vector, NzwIndicate text feature vector dimension.
For step S120 specifically,
Fig. 6 is the step flow chart classified to theme feature vector according to one embodiment of the invention.
Step S610, classifier are directed to each type, are once scored the theme feature vector of webpage.
Each type, the theme feature vector of webpage have a scoring score value.That is, then having more if there is multiple typesA scoring score value.Scoring score value is for measuring whether webpage meets the corresponding type of scoring score value.
Classifier includes multiple classifier functions, the corresponding type of each classifier functions;By theme feature vector pointEach classifier functions are not substituted into, so that it may obtain the scoring score value of each type.
For example, a=[a1, a2, a3] is classifier, y=a1*x1+a2*x2+a3*x3 is news category classifier functions;WhenCan also so there are other kinds of classifier functions;Title feature vector is substituted into news category classifier functions, available yValue, i.e. scoring score value indicate that the corresponding webpage of title feature vector is news category, otherwise are not when the scoring score value is greater than 0News category;Assuming that a=[1, -2,3], substitutes into news category classifier functions for title feature vector x=[1,2,3] that dimension is 3,Available y=6, then y > 0, title feature vector x=[1,2,3] corresponding webpage is news web page.
Each type of corresponding scoring score value is compared with preset mark threshold value by step S620 respectively.
Step S630 will be greater than the corresponding type of scoring score value of mark threshold value, be determined as belonging to theme feature vectorType;Wherein, type belonging to the theme feature vector is one or more.
Specifically, can be ranked up according to the sequence of value from big to small to multiple scoring score values;Judge maximum scoringWhether score value is greater than preset mark threshold value, if so, be the corresponding type of the maximum scoring score value by webpage label, ifIt is no, then it is webpage to be marked by Web Page Tags;Then, judge that size is only second to whether maximum scoring score value is greater than preset markThreshold value is infused, if so, being that the size is only second to the corresponding type of maximum scoring score value by webpage label, if it is not, then by webpageLabeled as webpage to be marked;And so on, until each scoring score value carried out comparison with mark threshold value.
The present invention also provides a kind of annotation equipments of Web page subject, as shown in fig. 7, for according to one embodiment of the inventionThe structure chart of the annotation equipment of Web page subject.
The device includes:
Module 710 is obtained, web-based title and text is used for, obtains the theme feature vector of webpage.
Categorization module 720, for carrying out classification processing to theme feature vector using the classifier that training obtains in advance.
Judgment module 730, for judging whether there is type belonging to theme feature vector.
Labeling module 740, for determining in judgment module there are in the case where type belonging to theme feature vector, by netPage is labeled as type belonging to theme feature vector.
Mark module 750, for determining to incite somebody to action there is no in the case where type belonging to theme feature vector in judgment moduleWeb Page Tags are webpage to be marked.
Cluster module 760, for carrying out clustering processing to multiple webpages to be marked.
Analysis module 770, for analyzing the type of each cluster set.
Labeling module 780 is also used to the type by webpage label to be marked for the cluster set belonging to it.
In one embodiment, obtain module 710 include: extraction unit 711, for extract respectively the title in webpage andText;First construction unit 712, for constructing title feature vector according to title;Second construction unit 713, for according to justText constructs text feature vector;Concatenation unit 714, for title feature vector sum text feature vector to be spliced the spy that is the themeLevy vector.As shown in Figure 8.
First construction unit 712 is used for: using the title dictionary constructed in advance, being carried out word segmentation processing to title, is markedTopic participle;Title participle is mapped in title dictionary;Based on the weighted value of title participle, place is weighted to title dictionaryReason, constructs the title feature vector of webpage.
Second construction unit 713 is used for: using the text dictionary constructed in advance, being carried out word segmentation processing to text, is obtained moreA text participle, and record the appearance sequence of each text participle in the body of the email;Multiple texts are respectively mapped to textIn dictionary;Weighted value and appearance sequence based on each text participle, are weighted processing to text dictionary, are constructing webpage justLiterary feature vector.
In another embodiment, categorization module 720 is specifically used for: pre-defining a variety of type of webpage;Calling classification device, withJust make classifier for each type, once scored the theme feature vector of webpage;It corresponding is commented each type ofScore value is divided to be compared respectively with preset mark threshold value;The corresponding type of scoring score value that will be greater than mark threshold value, is determined asType belonging to theme feature vector;Wherein, type belonging to theme feature vector is one or more.
In another embodiment, analysis module 770 is specifically used for: extracting each webpage to be marked in cluster set respectivelyTitle and text;Using the title dictionary constructed in advance, word segmentation processing is carried out to all titles, obtains multiple title participles;BenefitWith the text dictionary constructed in advance, word segmentation processing is carried out to all texts, obtains multiple text participles;Multiple titles participle andIn multiple text participles, the most participle of the frequency of occurrences is obtained, using the type as cluster set.
The function of device described in the present embodiment is described in Fig. 1-embodiment of the method shown in fig. 6, thereforeNot detailed place, may refer to the related description in previous embodiment, this will not be repeated here in the description of the present embodiment.
Although for illustrative purposes, the preferred embodiment of the present invention has been disclosed, those skilled in the art will recognizeIt is various improve, increase and replace be also it is possible, therefore, the scope of the present invention should be not limited to the above embodiments.

Claims (7)

CN201510266108.XA2015-05-222015-05-22A kind of mask method and device of Web page subjectActiveCN104881458B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201510266108.XACN104881458B (en)2015-05-222015-05-22A kind of mask method and device of Web page subject

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201510266108.XACN104881458B (en)2015-05-222015-05-22A kind of mask method and device of Web page subject

Publications (2)

Publication NumberPublication Date
CN104881458A CN104881458A (en)2015-09-02
CN104881458Btrue CN104881458B (en)2019-05-28

Family

ID=53948951

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201510266108.XAActiveCN104881458B (en)2015-05-222015-05-22A kind of mask method and device of Web page subject

Country Status (1)

CountryLink
CN (1)CN104881458B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN105550292B (en)*2015-12-112018-06-08北京邮电大学A kind of Web page classification method based on von Mises-Fisher probabilistic models
CN105760526B (en)*2016-03-012019-05-07网易(杭州)网络有限公司A kind of method and apparatus of news category
CN105975573B (en)*2016-05-042019-08-13北京广利核系统工程有限公司A kind of file classification method based on KNN
CN106021418B (en)*2016-05-132019-09-06北京奇虎科技有限公司 Clustering method and device for news events
CN106844328B (en)*2016-08-232020-04-21华南师范大学 A large-scale document topic semantic analysis method and system
CN107784037B (en)*2016-08-312022-02-01北京搜狗科技发展有限公司Information processing method and device, and device for information processing
CN108090099B (en)*2016-11-222022-02-25科大讯飞股份有限公司Text processing method and device
CN108241662B (en)*2016-12-232021-12-28北京国双科技有限公司Data annotation optimization method and device
CN109471937A (en)*2018-10-112019-03-15平安科技(深圳)有限公司 A text classification method and terminal device based on machine learning
CN109359301A (en)*2018-10-192019-02-19国家计算机网络与信息安全管理中心A kind of the various dimensions mask method and device of web page contents
CN109299271B (en)*2018-10-302022-04-05腾讯科技(深圳)有限公司Training sample generation method, text data method, public opinion event classification method and related equipment
CN110287314B (en)*2019-05-202021-08-06中国科学院计算技术研究所 Method and system for long text credibility assessment based on unsupervised clustering

Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN101727500A (en)*2010-01-152010-06-09清华大学Text classification method of Chinese web page based on steam clustering
CN102831193A (en)*2012-08-032012-12-19人民搜索网络股份公司Topic detecting device and topic detecting method based on distributed multistage cluster
CN103177024A (en)*2011-12-232013-06-26微梦创科网络科技(中国)有限公司Method and device of topic information show
CN103235824A (en)*2013-05-062013-08-07上海河广信息科技有限公司Method and system for determining web page texts users interested in according to browsed web pages

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US9172762B2 (en)*2011-01-202015-10-27Linkedin CorporationMethods and systems for recommending a context based on content interaction

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN101727500A (en)*2010-01-152010-06-09清华大学Text classification method of Chinese web page based on steam clustering
CN103177024A (en)*2011-12-232013-06-26微梦创科网络科技(中国)有限公司Method and device of topic information show
CN102831193A (en)*2012-08-032012-12-19人民搜索网络股份公司Topic detecting device and topic detecting method based on distributed multistage cluster
CN103235824A (en)*2013-05-062013-08-07上海河广信息科技有限公司Method and system for determining web page texts users interested in according to browsed web pages

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Web文本分类方法研究与系统实现";程博;《中国优秀硕士学位论文全文数据库 信息科技辑》;20110415(第4期);正文第39页第5.1.3节、第41-43页第5.2-5.3节

Also Published As

Publication numberPublication date
CN104881458A (en)2015-09-02

Similar Documents

PublicationPublication DateTitle
CN104881458B (en)A kind of mask method and device of Web page subject
CN110598203B (en) A method and device for extracting entity information of military scenario documents combined with dictionaries
CN108717408B (en) A sensitive word real-time monitoring method, electronic equipment, storage medium and system
CN110188197B (en)Active learning method and device for labeling platform
CN112347778A (en)Keyword extraction method and device, terminal equipment and storage medium
WO2015149533A1 (en)Method and device for word segmentation processing on basis of webpage content classification
CN106960001B (en)A kind of entity link method and system of term
US20140032207A1 (en)Information Classification Based on Product Recognition
CN111078943A (en)Video text abstract generation method and device
CN103049435A (en)Text fine granularity sentiment analysis method and text fine granularity sentiment analysis device
CN104063387A (en)Device and method abstracting keywords in text
CN110738033B (en)Report template generation method, device and storage medium
CN107679070B (en)Intelligent reading recommendation method and device and electronic equipment
CN102682000A (en)Text clustering method, question-answering system applying same and search engine applying same
CN112989208B (en)Information recommendation method and device, electronic equipment and storage medium
CN113468339B (en)Label extraction method and system based on knowledge graph, electronic equipment and medium
CN109255022B (en)Automatic abstract extraction method for network articles
CN107463703A (en)English social media account number classification method based on information gain
CN108038099A (en)Low frequency keyword recognition method based on term clustering
CN110851593B (en)Complex value word vector construction method based on position and semantics
CN110019820A (en)Main suit and present illness history symptom Timing Coincidence Detection method in a kind of case history
CN113761125A (en)Dynamic summary determination method and device, computing equipment and computer storage medium
CN110196910A (en)A kind of method and device of corpus classification
CN112069322A (en)Text multi-label analysis method and device, electronic equipment and storage medium
CN111199151A (en)Data processing method and data processing device

Legal Events

DateCodeTitleDescription
C06Publication
PB01Publication
EXSBDecision made by sipo to initiate substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp