Movatterモバイル変換


[0]ホーム

URL:


CN110347931A - The detection method and device of the new chapters and sections of article - Google Patents

The detection method and device of the new chapters and sections of article
Download PDF

Info

Publication number
CN110347931A
CN110347931ACN201910649833.3ACN201910649833ACN110347931ACN 110347931 ACN110347931 ACN 110347931ACN 201910649833 ACN201910649833 ACN 201910649833ACN 110347931 ACN110347931 ACN 110347931A
Authority
CN
China
Prior art keywords
candidate word
article
sections
term vector
chapters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910649833.3A
Other languages
Chinese (zh)
Inventor
蔡兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co LtdfiledCriticalTencent Technology Shenzhen Co Ltd
Priority to CN201910649833.3ApriorityCriticalpatent/CN110347931A/en
Publication of CN110347931ApublicationCriticalpatent/CN110347931A/en
Pendinglegal-statusCriticalCurrent

Links

Landscapes

Abstract

The invention discloses a kind of detection method and device of the new chapters and sections of article, belong to Internet technical field.The described method includes: determining the first theme term vector for having detected chapters and sections of article;The first theme term vector is used to identify the content for having detected chapters and sections of the article;Determine the second theme term vector of the new chapters and sections of the article;The second theme term vector is used to identify the content of the new chapters and sections of the article;Calculate the similarity of second theme term vector described in the first descriptor vector sum;According to the size relation of the similarity and default similarity threshold, judge the new chapters and sections whether be the article false chapters and sections.By using technical solution of the present invention, online recognition process only needs Millisecond, does not influence chapters and sections pushing speed at all, so as to timely push new chapters and sections, be effectively guaranteed the pushing efficiency of the new chapters and sections of article when new chapters and sections are effective chapters and sections.

Description

The detection method and device of the new chapters and sections of article
The application be submit on 06 06th, 2013 application No. is 201310223253.0, it is entitled that " article is newThe divisional application of the application for a patent for invention of the detection method and device of chapters and sections ".
Technical field
The present invention relates to Internet technical field, in particular to the detection method and device of a kind of new chapters and sections of article.
Background technique
With the development of Internet science and technology, more and more people can carry out various activities, example by internetSuch as people can read some articles published in instalments by internet.
In the prior art, with the increasingly hot birth for also having expedited the emergence of more and more article websites of web documents, according toIncomplete statistics, various middle-size and small-size article Websites quantities have reached hundreds of thousands of, and quality is very different, is frequently present ofIt steals content and manufactures false new chapters and sections even to gain article user click, the behavior of harm users experience by cheating.It is poly- as articlePlatform is closed, after grabbing the new chapters and sections data of article of these websites, manual examination and verification are carried out to the new chapters and sections of article, by falsenessNew chapters and sections are identified and are filtered out in time, to provide a user the article of better quality.The program is to improve polymerizable clc to put downPlatform quality, the important link for optimizing user's reading experience.
In the implementation of the present invention, the inventor finds that the existing technology has at least the following problems: above-mentioned existing to adoptThe method that the mode manually audited audits the new chapters and sections of article, audit time is longer, cause article new chapters and sections cannot by andWhen push.
Summary of the invention
In order to solve problems in the prior art, the embodiment of the invention provides the detection methods and dress of a kind of new chapters and sections of articleIt sets.The technical solution is as follows:
On the one hand, a kind of detection method of new chapters and sections of article is provided, polymerizable clc platform is used for, which comprises
Text fractionation is carried out to the new chapters and sections of article, obtains multiple candidate words;
Calculate the weight of each candidate word in the multiple candidate word;
According to the weight of the candidate word each in the multiple candidate word and the multiple candidate word, it is main to generate secondVector is write inscription, the second theme term vector refers to the theme term vector of the new chapters and sections;
The similarity of the second theme term vector and the first theme term vector is calculated, the first theme term vector refers toThe theme term vector for having detected chapters and sections of the article;
When the similarity is less than default similarity threshold, determine that the new chapters and sections are the false chapters and sections of the article.
On the other hand, a kind of detection device of new chapters and sections of article is provided, described device includes:
Split cells carries out text fractionation for the new chapters and sections to article, obtains multiple candidate words;
Computing unit, for calculating the weight of each candidate word in the multiple candidate word;
Generation unit, for the power according to the candidate word each in the multiple candidate word and the multiple candidate wordWeight, generates second theme term vector, and the second theme term vector refers to the theme term vector of the new chapters and sections;
Computing module, for calculating the similarity of the second theme term vector and the first theme term vector, described firstTheme term vector refers to the theme term vector for having detected chapters and sections of the article;
Judgment module, for when the similarity is less than default similarity threshold, determining that the new chapters and sections are the textThe false chapters and sections of chapter.
In another aspect, providing a kind of computer readable storage medium, it is stored thereon with computer program, the computerProgram is when being executed by processor, for realizing the detection method of the new chapters and sections of above-mentioned article.
The detection method and device of the new chapters and sections of the article of the embodiment of the present invention, by detected chapters and sections for determining articleOne theme term vector;First theme term vector is used to identify the content for having detected chapters and sections of article;Determine the new chapters and sections of articleSecond theme term vector;Second theme term vector is used to identify the content of the new chapters and sections of article;Calculate the first descriptor vector sumThe similarity of second theme term vector;According to the size relation of similarity and default similarity threshold, judge new chapters and sections whether beThe false chapters and sections of article.Using the technical solution of the embodiment of the present invention, the testing process of the entire new chapters and sections of article is not needed manuallyIntervene, cost is extremely low, can to avoid the new chapters and sections for auditing article by the way of manual examination and verification, can be effectively saved manpower atThis.And using the technical solution of the embodiment of the present invention, by the detection chapters and sections for intelligently effectively penetrating analysis articleAnd new chapters and sections, can accurately determine out whether new chapters and sections are false chapters and sections.The technical solution online recognition of the embodiment of the present inventionProcess only needs Millisecond, does not influence chapters and sections pushing speed at all, so as to timely push away when new chapters and sections are effective chapters and sectionsNew chapters and sections are sent, the pushing efficiency of the new chapters and sections of article is effectively guaranteed.
Detailed description of the invention
To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodimentAttached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, forFor those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings otherAttached drawing.
Fig. 1 is the flow chart of the detection method for the new chapters and sections of article that one embodiment of the invention provides.
Fig. 2 be another embodiment of the present invention provides the new chapters and sections of article detection method flow chart.
Fig. 3 is the structural schematic diagram of the detection device for the new chapters and sections of article that one embodiment of the invention provides.
Fig. 4 be another embodiment of the present invention provides the new chapters and sections of article detection device structural schematic diagram.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached drawing to embodiment party of the present inventionFormula is described in further detail.
Fig. 1 is the flow chart of the detection method for the new chapters and sections of article that one embodiment of the invention provides.As shown in Figure 1, this realityThe detection method for applying the new chapters and sections of article of example, can specifically include following steps:
100, the first theme term vector for having detected chapters and sections of article is determined;
Wherein the first theme term vector is used to identify the content for having detected chapters and sections of article;The detection chapters and sections of the present embodimentFor the effective chapters and sections having determined of this article, which can be understood as true using the method for the embodiment of the present inventionIt is set to the chapters and sections of effective chapters and sections.It should be noted that in the first chapters and sections for determining this article, since there is no chapter has been detectedWhether section, the method that cannot use the embodiment of the present invention, the method that can use manual examination and verification audit the first chapters and sections for effective chapterSection.
For example, determining that the process of the first theme term vector for having detected chapters and sections of article can be understood as to having detected chapters and sectionsIt is trained the process for extracting the first theme term vector.
101, the second theme term vector of the new chapters and sections of article is determined;
Wherein second theme term vector is used to identify the content of the new chapters and sections of article.
In the present embodiment, step 101 " the second theme term vector for determining the new chapters and sections of article " and step " determine articleDetected the first theme term vector of chapters and sections " specific implementation process can be identical.For example, determining the second of the new chapters and sections of articleThe process of theme term vector can be understood as being trained new chapters and sections the process for extracting second theme term vector.Wherein preferablyGround, second theme term vector is identical as the quantity of descriptor that the first theme term vector includes in the present embodiment.
102, the similarity of the first descriptor vector sum second theme term vector is calculated;
103, according to the size relation of similarity and default similarity threshold, judge new chapters and sections whether be article false chapterSection.
The executing subject of the detection method of the new chapters and sections of the article of the present embodiment can be the detection device of the new chapters and sections of an article.Such as the detection device of the new chapters and sections of this article can be set in polymerizable clc platform.
The detection method of the new chapters and sections of the article of the present embodiment, by determine article the first descriptor for having detected chapters and sections toAmount;First theme term vector is used to identify the content for having detected chapters and sections of article;Determine the second theme word of the new chapters and sections of articleVector;Second theme term vector is used to identify the content of the new chapters and sections of article;Calculate the first descriptor vector sum second theme wordThe similarity of vector;According to the size relation of similarity and default similarity threshold, judge new chapters and sections whether be article falsenessChapters and sections.Using the technical solution of the present embodiment, the testing process of the entire new chapters and sections of article does not need manual intervention, and cost is extremely low,Human cost can be effectively saved to avoid the new chapters and sections for auditing article by the way of manual examination and verification.And using thisThe technical solution of embodiment can be accurate by intelligently effectively penetrating the chapters and sections of detection and new chapters and sections of analysis articleGround determines whether new chapters and sections are false chapters and sections.The technical solution online recognition process of the present embodiment only needs Millisecond, at allChapters and sections pushing speed is not influenced, so as to new chapters and sections timely be pushed, be effectively guaranteed when new chapters and sections are effective chapters and sectionsThe pushing efficiency of the new chapters and sections of article.
Optionally, on the basis of the technical solution of above-mentioned embodiment illustrated in fig. 1, wherein step 100 " determines articleDetect the first theme term vector of chapters and sections ", it can specifically include following steps:
(1) text fractionation is carried out to the chapters and sections of detection of article, obtains multiple candidate words;
(2) weight of each candidate word in multiple candidate words is calculated;
(3) according to the weight of each candidate word in multiple candidate words and multiple candidate words, the first theme term vector is generated.
Such as wherein step (2) " weight for calculating each candidate word in multiple candidate words ", it can specifically include: calculating everyThe length of a candidate word, the frequency occurred in article, the entropy of left adjacent character set and right adjacent character set entropy;And according to each timeThe length for selecting word, the frequency occurred in article, the entropy of left adjacent character set and right adjacent character set entropy, calculate each candidate wordWeight.Left neighbour's character set refers to that the set for the character set that some word left side occurs in one section of word, right neighbour's character set refer to one section of wordIn the set of character set that occurs on the right of some word.For example it " sees their appearance, thinks that they especially feel bad, also wish for themGood fortune." the words, left adjacent character set={ see, obtain, be } of candidate word " they ", right neighbour's character set=, it is special, wish.SpecificallyThe determination of left neighbour's character set and right adjacent character set can refer to related art, and details are not described herein.
Still optionally further, wherein " according to the length of each candidate word, the frequency occurred in article, left adjacent character setEntropy and right adjacent character set entropy, calculate the weight of each candidate word ", the power of each candidate word is specifically calculated using following formulaWeight:
Wherein, W is the weight of candidate word, and TF is the frequency that candidate word occurs in articleRate, Ha are the entropy of left adjacent character set, and Hb is the entropy of right adjacent character set, and L is the length of candidate word.
Still optionally further, wherein above-mentioned steps (3) " according to each candidate word in multiple candidate words and multiple candidate wordsWeight, generate the first theme term vector " can specifically include: from multiple candidate words, according to the sequence of weight from high to lowM candidate word is taken out, the first theme term vector is generated.Specifically, second theme term vector and the first theme term vector includeThe size of the quantity M of descriptor can select according to the actual situation, such as can be with the weighting higher Top10 of weight, can also be with weightingHigher Top100 or Top200 of weight etc..
Such as wherein the length of candidate word between 2-5 Chinese character.Such as " abcd " can split to obtain " ab ", " bc "," cd ", " abc ", " bcd ", the candidate word of " abcd ".And count frequency, length that each candidate word occurs in this this article,The entropy of the entropy of left neighbouring character set and right neighbouring character set, the medium entropy the big, and it is more important to express this candidate word.Finally utilize public affairsFormulaThe weight of each candidate word is calculated, and is sorted on earth by weight by height, such as can be with weighting weightHighest TOP500 word forms the first theme term vector, the first theme term vector as this this article.The formula of its medium entropyIt is: H=-plogp.P indicates probability of each character in the character set in character set.If than its left character of some candidate wordCollection is { a, a, b, c }, then the entropy of its left character set isThe bigger table of obvious entropyThis bright candidate's string independence is stronger, is more likely to be the descriptor of article.Such as table 1 is before a certain this article is calculated10 highest candidate words of weight, it can be seen that mainly based on character name, mechanism etc. in article, there is apparent uniquenessProperty.In practical application, this first theme term vector of 10 highest candidate words of weight as this article can be used.
Table 1
Still optionally further, on the basis of the technical solution of above-described embodiment, wherein " calculating multiple times in step (2)Select the weight of each candidate word in word " after, step (3) is " according to each candidate word in multiple candidate words and multiple candidate wordsWeight, generate the first theme term vector " before, can also include the following steps:
(a) document frequency of each candidate word in multiple candidate words is counted;
The document frequency of the present embodiment is the article record occurred in the N included by article pond article of candidate word.ThanIf article pond shares 100 this article, word x occurred in the candidate topics term vector of wherein 20 this article, then its document frequencyDF=20.The document frequency DF of one descriptor is bigger, then this word uniqueness is poorer, therefore it gets over respect to for certain this articleIt is not important enough.On the contrary, only occurred in the theme term vector an of this article if the document frequency DF=1 of a descriptor,Then this word is likely to be the exclusive word of this this article, unique very high.
(c) the N articles according to included by the document frequency of candidate word each in multiple candidate words and article pond update moreThe weight of each candidate word in a candidate word.
Such as the weight of each candidate word can be specifically calculated using following formula:
W=W*log (N/DF), wherein W is the weight of candidate word, and DF is the document frequency of candidate word.
After step (c), updated weight rearrangement can then proceed in, select M such as TOP200 as everyThe first final theme term vector of article.
It should be noted that above-described embodiment is all to explain the determination mode of the first theme term vector, wherein second is mainThe method of determination for writing inscription vector is identical as the determination mode of the first theme term vector, can refer to the note of above-described embodiment in detailIt carries, details are not described herein.
Still optionally further, on the basis of the technical solution of above-mentioned embodiment illustrated in fig. 1, step 102 " it is main to calculate firstWrite inscription the similarity of vector sum second theme term vector ", it can specifically include and the first theme term vector is calculated using following formulaWith the similarity of second theme term vector:
Wherein D indicates the first theme term vector, DiIndicate i-th of descriptor in the first theme term vector;Q indicates described theTwo theme term vectors, QiIndicate i-th of descriptor in second theme term vector;M indicates the first descriptor vector sum second themeThe number of each included descriptor of term vector;Sim (D, Q) indicates the similar of the first descriptor vector sum second theme term vectorDegree.Wherein for sim (D, Q) value range between 0-1, value two vector similarities of bigger expression are higher.
Still optionally further, on the basis of the technical solution of above-mentioned embodiment illustrated in fig. 1, step 103 is " according to similarityWith the size relation of default similarity threshold, judge new chapters and sections whether be article false chapters and sections ", can specifically include: working as phaseIt is more than or equal to default similarity threshold like degree, determines that new chapters and sections are effective chapters and sections of article;When similarity is less than default similarityThreshold value determines that new chapters and sections are the false chapters and sections of article.
Still optionally further, on the basis of the technical solution of above-described embodiment, when the falseness that the new chapters and sections of determination are articleIt can also include: the new chapters and sections for filtering article after chapters and sections.That is, not showing the void to the user of polymerizable clc platformFalse new chapters and sections improve the Experience Degree of user to improve the article quality of polymerizable clc platform.
All optional technical solutions of above-described embodiment can form the present invention using combinable mode any combinationAlternative embodiment, this is no longer going to repeat them.
The detection method of the new chapters and sections of the article of above-described embodiment, the testing process of the entire new chapters and sections of article do not need manually dryIn advance, cost is extremely low, can to avoid the new chapters and sections for auditing article by the way of manual examination and verification, can be effectively saved manpower atThis.And using the technical solution of the embodiment of the present invention, by the detection chapters and sections for intelligently effectively penetrating analysis articleAnd new chapters and sections, can accurately determine out whether new chapters and sections are false chapters and sections.The technical solution online recognition of the embodiment of the present inventionProcess only needs Millisecond, does not influence chapters and sections pushing speed at all, so as to timely push away when new chapters and sections are effective chapters and sectionsNew chapters and sections are sent, the pushing efficiency of the new chapters and sections of article is effectively guaranteed.
Fig. 2 be another embodiment of the present invention provides the new chapters and sections of article detection method flow chart.The text of the present embodimentThe detection method of the new chapters and sections of chapter further introduces the present invention on the basis of above-mentioned Fig. 1 and its alternative embodiment in further detailTechnical solution.As shown in Fig. 2, the detection method of the new chapters and sections of the article of the present embodiment, can specifically include following steps:
200, text fractionation is carried out to the chapters and sections of detection of article, obtains multiple candidate words;
201, the length of each candidate word in multiple candidate words, the frequency occurred in article, left adjacent character set are calculatedThe entropy of entropy and right adjacent character set;
202, according to the entropy and right adjacent character of the length of each candidate word, the frequency occurred in article, left adjacent character setThe entropy of collection calculates the weight of each candidate word;
The correlation technique of above-described embodiment can be specifically used, details are not described herein.
203, the document frequency of each candidate word in multiple candidate words is counted;
204, the N articles according to included by the document frequency of candidate word each in multiple candidate words and article pond, according toFollowing formula updates the weight of each candidate word in multiple candidate words;
According to the weight of each candidate word in the multiple candidate words of formula W=W*log (N/DF) update, wherein W is candidate wordWeight, DF are the document frequency of candidate word.The W of left side of the equal sign is the weight of updated candidate word, and the W of right side of the equal sign is stepThe weight of 202 candidate words being calculated, that is, the weight of the candidate word before updating.
205, from multiple candidate words, Top200 candidate word is taken out according to the sequence of weight from high to low, generates firstTheme term vector;
206, the second theme term vector of the new chapters and sections of article is determined;
The second theme term vector specific implementation process and above-mentioned steps 200-205 for determining the new chapters and sections of article determine firstThe process of theme term vector is identical, can refer to the record of above-mentioned steps 200-205 in detail, details are not described herein.It needs to illustrate, the quantity of descriptor for including in the descriptor and second theme term vector that include in the first theme term vector is identical.
207, the similarity of the first descriptor vector sum second theme term vector is calculated using following formula:
Wherein D indicates the first theme term vector, DiIndicate i-th of descriptor in the first theme term vector;Q indicates described theTwo theme term vectors, QiIndicate i-th of descriptor in second theme term vector;M indicates the first descriptor vector sum second themeThe number of each included descriptor of term vector;Sim (D, Q) indicates the similar of the first descriptor vector sum second theme term vectorDegree.Wherein for sim (D, Q) value range between 0-1, value two vector similarities of bigger expression are higher.
208, judge whether similarity is greater than or equal to default similarity threshold T, when being greater than or equal to, execute step209;Otherwise when being less than, step 210 is executed;
209, determine that new chapters and sections are effective chapters and sections of this article;
210, it determines that new chapters and sections are the false chapters and sections of this article, executes step 211;
211, the new chapters and sections of article are filtered.
Such as following table 2 is some information of the article of the entitled novel_tiancaixiangshi of article, wherein the 2nd columnIt is certain article name, the 3rd column are the chapters and sections from different articles, and the 1st column are 2,3 column similarity values.The first row is indicated from thisSome chapters and sections detected of article, eighth row indicate false chapters and sections.It can be seen that this article chapter of only the first row expressionSection is greater than 0.3 with the similarity of the 2nd column article vector, and the chapters and sections for being left other articles and false chapters and sections similarity are allLess than 0.05, therefore extremely accurate effective chapters and sections can be distinguished with false chapters and sections and come.
Table 2
The detection method of the new chapters and sections of the article of the present embodiment, the testing process of the entire new chapters and sections of article do not need manually dryIn advance, cost is extremely low, can to avoid the new chapters and sections for auditing article by the way of manual examination and verification, can be effectively saved manpower atThis.And using the technical solution of the embodiment of the present invention, by the detection chapters and sections for intelligently effectively penetrating analysis articleAnd new chapters and sections, can accurately determine out whether new chapters and sections are false chapters and sections.The technical solution online recognition of the embodiment of the present inventionProcess only needs Millisecond, does not influence chapters and sections pushing speed at all, so as to timely push away when new chapters and sections are effective chapters and sectionsNew chapters and sections are sent, the pushing efficiency of the new chapters and sections of article is effectively guaranteed.
Fig. 3 is the structural schematic diagram of the detection device for the new chapters and sections of article that one embodiment of the invention provides.As shown in figure 3,The detection device of the new chapters and sections of the article of the present embodiment includes: the first determining module 10, the second determining module 11,12 and of computing moduleJudgment module 13.
Wherein the first determining module 10 is used to determine the first theme term vector for having detected chapters and sections of article;First themeTerm vector is used to identify the content for having detected chapters and sections of article;Second determining module 11 is used to determine the second of the new chapters and sections of articleTheme term vector;The second theme term vector is used to identify the content of the new chapters and sections of article;Computing module 12 is true with first respectivelyCover half block 10 and the connection of the second determining module 11, computing module 12 are used to calculate the first descriptor that the first determining module 10 determinesThe similarity for the second theme term vector that the second determining module of vector sum 11 determines;Judgment module 13 is connect with computing module 12,The size relation of similarity and default similarity threshold that judgment module 13 is used to be calculated according to computing module 12, judgement are newChapters and sections whether be article false chapters and sections.
The detection device of the new chapters and sections of the article of the present embodiment, by using above-mentioned module realize the new chapters and sections of article detection withThe realization mechanism of above-mentioned related method embodiment is identical, can refer to the record of above-mentioned related embodiment in detail, no longer superfluous hereinIt states.
The detection device of the new chapters and sections of the article of the present embodiment realizes the detection chapter for determining article by using above-mentioned moduleFirst theme term vector of section;First theme term vector is used to identify the content for having detected chapters and sections of article;Determine the new of articleThe second theme term vector of chapters and sections;Second theme term vector is used to identify the content of the new chapters and sections of article;Calculate the first descriptorThe similarity of vector sum second theme term vector;According to the size relation of similarity and default similarity threshold, new chapters and sections are judgedWhether be article false chapters and sections.Using the technical solution of the present embodiment, the testing process of the entire new chapters and sections of article is not required to very important personWork intervention, cost is extremely low, can be effectively saved manpower to avoid the new chapters and sections for auditing article by the way of manual examination and verificationCost.And using the present embodiment technical solution, by intelligently effectively penetrate analysis article the chapters and sections of detection andNew chapters and sections can accurately determine out whether new chapters and sections are false chapters and sections.The technical solution online recognition process of the present embodiment is onlyMillisecond is needed, does not influence chapters and sections pushing speed at all, so as to timely push new chapter when new chapters and sections are effective chapters and sectionsSection, is effectively guaranteed the pushing efficiency of the new chapters and sections of article.
Fig. 4 be another embodiment of the present invention provides the new chapters and sections of article detection device structural schematic diagram.Such as Fig. 4 instituteShow, the detection device of the new chapters and sections of the article of the present embodiment further comprises following skill on the basis of above-mentioned embodiment illustrated in fig. 3Art scheme.
As shown in figure 4, the first determining module 10 in the detection device of the new chapters and sections of the article of the present embodiment includes splitting listMember 101, computing unit 102 and generation unit 103.
Wherein split cells 101 is used to carry out text fractionation to the chapters and sections of detection of article, obtains multiple candidate words;It calculatesUnit 102 is connect with split cells 101, and computing unit 102 is used to calculate split cells 101 and splits in obtained multiple candidate wordsThe weight of each candidate word;Generation unit 103 is connect with split cells 101 and computing unit 102 respectively, and generation unit 103 is usedIt is every in multiple candidate words that the multiple candidate words and computing unit 102 that are obtained according to the fractionation of split cells 101 are calculatedThe weight of a candidate word generates the first theme term vector.
Still optionally further, in the detection device of the new chapters and sections of the article of the present embodiment, computing unit 102 is specifically used for calculatingSplit cells 101 splits the length of each candidate word in obtained multiple candidate words, the frequency occurred in article, left adjacent characterThe entropy of the entropy of collection and right adjacent character set;And according to the length of each candidate word, the frequency occurred in article, left adjacent character setThe entropy of entropy and right adjacent character set, calculates the weight of each candidate word.
Still optionally further, in the detection device of the new chapters and sections of the article of the present embodiment, computing unit 102 is specifically using as followsFormula calculates the weight that split cells 101 splits each candidate word in obtained multiple candidate words:
Wherein, W is the weight of candidate word, and TF is the frequency that candidate word occurs in articleRate, Ha are the entropy of left adjacent character set, and Hb is the entropy of right adjacent character set, and L is the length of candidate word.
Still optionally further, in the detection device of the new chapters and sections of the article of the present embodiment, generation unit 103 is specifically used for from tearing openSub-unit 101 is split in obtained multiple candidate words, according to the weight weight for each candidate word that computing unit 102 is calculatedSequence from high to low takes out M candidate word, generates the first theme term vector.
Still optionally further, in the detection device of the new chapters and sections of the article of the present embodiment, the first determining module 20 further includes systemCount unit 104 and updating unit 105.
Statistic unit 104 is connect with split cells 101, and statistic unit 104 is used to calculate multiple candidates in computing unit 102In word after the weight of each candidate word, generation unit 103 is according to each candidate word in multiple candidate words and multiple candidate wordsWeight, before generating the first theme term vector, statistics split cells 101 splits each candidate word in obtained multiple candidate wordsDocument frequency;The document frequency is the article record occurred in the N included by article pond article of candidate word.Updating unit105 connect with statistic unit 104 and computing unit 102 respectively;Updating unit 105 is used to be obtained according to the statistics of statistic unit 104Multiple candidate words in each candidate word document frequency and article pond included by N articles and computing unit 102 calculateThe weight of each candidate word in obtained multiple candidate words updates each in multiple candidate words that computing unit 102 is calculatedThe weight of candidate word.
Corresponding generation unit 103 is connect with updating unit 105 at this time, and generation unit 103 is used for according to split cells 101It splits obtained multiple candidate words and updating unit 105 updates the weight of each candidate word in obtained multiple candidate words, it is rawAt the first theme term vector.
Such as updating unit 105 specifically calculates the weight of each candidate word using following formula:
W=W*log (N/DF), wherein W is the weight of candidate word, and DF is the document frequency of candidate word.The W of left side of the equal sign isThe weight of updated candidate word, the W of right side of the equal sign are the weight for the candidate word that step 202 is calculated, that is, the time before updatingSelect the weight of word.
Still optionally further, in the detection device of the new chapters and sections of the article of the present embodiment, computing module 12 specifically can be with lifeIt is connected at unit 103, the first descriptor vector sum second theme word that generation unit 103 generates specifically is calculated using following formulaThe similarity of vector:
Wherein D indicates the first theme term vector, DiIndicate i-th of descriptor in the first theme term vector;Q indicates described theTwo theme term vectors, QiIndicate i-th of descriptor in second theme term vector;M indicates the first descriptor vector sum second themeThe number of each included descriptor of term vector;Sim (D, Q) indicates the similar of the first descriptor vector sum second theme term vectorDegree.
Specifically, the second determining module 11 also includes as the above-mentioned split cells 101 in the first determining module 10, calculating are singleMember 102 and generation unit 103 and statistic unit 104 and updating unit 105 realize the determination of the first theme term vector, in detailThe record of above-described embodiment can be referred to, details are not described herein.
Still optionally further, in the detection device of the new chapters and sections of the article of the present embodiment, judgment module 13 is specifically used for judgementThe size relation of similarity and default similarity threshold that computing module 12 is calculated, when similarity is similar more than or equal to presettingThreshold value is spent, determines that new chapters and sections are effective chapters and sections of article;Similarity threshold is preset when similarity is less than, determines that new chapters and sections are articleFalse chapters and sections.
It still optionally further, further include filtering module 14 in the detection device of the new chapters and sections of the article of the present embodiment.The filteringModule 14 is connect with judgment module 13, filtering module 14 be used for judgment module 13 determine false chapters and sections that new chapters and sections are article itAfterwards, the new chapters and sections of article are filtered.
All optional technical solutions in the detection device of the new chapters and sections of the article of the present embodiment can use combinableMode any combination forms alternative embodiment of the invention, and this is no longer going to repeat them.
The detection device of the new chapters and sections of the article of the present embodiment, by using above-mentioned module realize the new chapters and sections of article detection withThe realization mechanism of above-mentioned related method embodiment is identical, can refer to the record of above-mentioned related embodiment in detail, no longer superfluous hereinIt states.
The detection device of the new chapters and sections of the article of the present embodiment, the inspection of the new chapters and sections of entire article is realized by using above-mentioned moduleFlow gauge does not need manual intervention, and cost is extremely low, can be to avoid the new chapters and sections for auditing article by the way of manual examination and verification, can be withIt is effectively saved human cost.And using the technical solution of the embodiment of the present invention, by intelligently effectively penetrating analysisThe chapters and sections of detection and new chapters and sections of article can accurately determine out whether new chapters and sections are false chapters and sections.The embodiment of the present inventionTechnical solution online recognition process only needs Millisecond, does not influence chapters and sections pushing speed at all, so as to be to have in new chapters and sectionsWhen imitating chapters and sections, new chapters and sections are timely pushed, the pushing efficiency of the new chapters and sections of article is effectively guaranteed.
The embodiment of the present invention can also provide a kind of polymerizable clc platform, and figure as above is provided on this article aggregation platform3 or embodiment illustrated in fig. 4 the new chapters and sections of article detection device, the detection device of the new chapters and sections of this article can specifically use upperThe detection method for stating the new chapters and sections of article of Fig. 1 or embodiment illustrated in fig. 2 realizes the detection of the new chapters and sections of article, can use in detailThe record of above-mentioned related embodiment, details are not described herein.
It should be understood that detection of the detection device of the new chapters and sections of article provided by the above embodiment in the new chapters and sections of articleWhen, only the example of the division of the above functional modules, in practical application, it can according to need and divide above-mentioned functionWith being completed by different functional modules, i.e., the internal structure of device is divided into different functional modules, to complete above descriptionAll or part of function.In addition, the inspection of the detection device of the new chapters and sections of article provided by the above embodiment and the new chapters and sections of articleIt surveys embodiment of the method and belongs to same design, specific implementation process is detailed in embodiment of the method, and which is not described herein again.
The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.
Those of ordinary skill in the art will appreciate that realizing that all or part of the steps of above-described embodiment can pass through hardwareIt completes, relevant hardware can also be instructed to complete by program, the program can store in a kind of computer-readableIn storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention andWithin principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims (15)

CN201910649833.3A2013-06-062013-06-06The detection method and device of the new chapters and sections of articlePendingCN110347931A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201910649833.3ACN110347931A (en)2013-06-062013-06-06The detection method and device of the new chapters and sections of article

Applications Claiming Priority (2)

Application NumberPriority DateFiling DateTitle
CN201310223253.0ACN104239285A (en)2013-06-062013-06-06 The detection method and device of the new chapter of the article
CN201910649833.3ACN110347931A (en)2013-06-062013-06-06The detection method and device of the new chapters and sections of article

Related Parent Applications (1)

Application NumberTitlePriority DateFiling Date
CN201310223253.0ADivisionCN104239285A (en)2013-06-062013-06-06 The detection method and device of the new chapter of the article

Publications (1)

Publication NumberPublication Date
CN110347931Atrue CN110347931A (en)2019-10-18

Family

ID=52227382

Family Applications (2)

Application NumberTitlePriority DateFiling Date
CN201910649833.3APendingCN110347931A (en)2013-06-062013-06-06The detection method and device of the new chapters and sections of article
CN201310223253.0APendingCN104239285A (en)2013-06-062013-06-06 The detection method and device of the new chapter of the article

Family Applications After (1)

Application NumberTitlePriority DateFiling Date
CN201310223253.0APendingCN104239285A (en)2013-06-062013-06-06 The detection method and device of the new chapter of the article

Country Status (1)

CountryLink
CN (2)CN110347931A (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN105302913B (en)*2015-11-122018-09-18北京奇虎科技有限公司Network novel Chapter List appraisal procedure and device
CN105701076B (en)*2016-01-132018-05-22湖南通远网络科技有限公司A kind of paper plagiarizes detection method and system
CN105677641B (en)*2016-01-132018-03-16夏峰A kind of paper self checking method and system
CN105701085B (en)*2016-01-132018-05-22湖南通远网络科技有限公司A kind of network duplicate checking method and system
CN106294292B (en)*2016-07-202020-12-25腾讯科技(深圳)有限公司Chapter catalog screening method and device
CN107085568B (en)*2017-03-292022-11-22腾讯科技(深圳)有限公司Text similarity distinguishing method and device
CN112016292B (en)*2020-09-092022-10-11平安科技(深圳)有限公司Method and device for setting article interception point and computer equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20060172276A1 (en)*2005-02-032006-08-03Educational Testing ServiceMethod and system for detecting off-topic essays without topic-specific training
CN101836205A (en)*2007-08-232010-09-15谷歌股份有限公司Domain dictionary creation
CN102081598A (en)*2011-01-272011-06-01北京邮电大学Method for detecting duplicated texts
CN102411583A (en)*2010-09-202012-04-11阿里巴巴集团控股有限公司Text matching method and device
CN103020022A (en)*2012-11-202013-04-03北京航空航天大学Chinese unregistered word recognition system and method based on improvement information entropy characteristics

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN101122909B (en)*2006-08-102010-06-16株式会社日立制作所 Text information retrieval device and text information retrieval method
US8874663B2 (en)*2009-08-282014-10-28Facebook, Inc.Comparing similarity between documents for filtering unwanted documents
CN103077157B (en)*2013-01-222015-08-19清华大学A kind of method for visualizing of text collection similarity and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20060172276A1 (en)*2005-02-032006-08-03Educational Testing ServiceMethod and system for detecting off-topic essays without topic-specific training
CN101836205A (en)*2007-08-232010-09-15谷歌股份有限公司Domain dictionary creation
CN102411583A (en)*2010-09-202012-04-11阿里巴巴集团控股有限公司Text matching method and device
CN102081598A (en)*2011-01-272011-06-01北京邮电大学Method for detecting duplicated texts
CN103020022A (en)*2012-11-202013-04-03北京航空航天大学Chinese unregistered word recognition system and method based on improvement information entropy characteristics

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
JIN HU HUANG等: "《Proceedings of the 17th Asian Pacific Conference on Language Information and Computation》", 31 December 2003*

Also Published As

Publication numberPublication date
CN104239285A (en)2014-12-24

Similar Documents

PublicationPublication DateTitle
CN110347931A (en)The detection method and device of the new chapters and sections of article
CN103902570B (en)A kind of text classification feature extracting method, sorting technique and device
KR101536520B1 (en)Method and server for extracting topic and evaluating compatibility of the extracted topic
CN107437038B (en)Webpage tampering detection method and device
CN105893478B (en)A kind of tag extraction method and apparatus
CN104346406B (en)Training corpus expanding device and training corpus extending method
CA2791597C (en)Biometric training and matching engine
CN109165688A (en)A kind of Android Malware family classification device construction method and its classification method
CN108616491B (en)Malicious user identification method and system
Liu et al.Automatic event salience identification
CN108717459B (en)A kind of mobile application defect positioning method of user oriented comment information
CN107679135A (en)The topic detection of network-oriented text big data and tracking, device
CN106649849A (en)Text information base building method and device and searching method, device and system
CN106126719A (en)Information processing method and device
CN104462979A (en)Automatic dynamic detection method and device of application program
CN110990676A (en)Social media hotspot topic extraction method and system
CN106445907A (en)Domain lexicon generation method and apparatus
CN106909600A (en)The collection method and device of user context information
CN106649334A (en)Conjunction word set processing method and device
JP2008203933A (en) Category creation method and apparatus, document classification method and apparatus
CN105631464B (en)The method and device classified to chromosome sequence and plasmid sequence
CN103577557A (en)Device and method for determining capturing frequency of network resource point
CN107977454A (en)The method, apparatus and computer-readable recording medium of bilingual corpora cleaning
CN113988616A (en) An enterprise risk assessment system and method based on industry data
GomathyThe Twitter Behavioural Analytics

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination

[8]ページ先頭

©2009-2025 Movatter.jp