The application is application number to be 201010259001.X, the applying date be August 20 in 2010, day, were called the divisional application of the Chinese invention patent application of " a kind of method for analyzing internet web page contents and device ".
Summary of the invention
In view of this, the present invention provides a kind of web page template to generate method and device, can generate the most adaptive template and carry out analyzing web page.
The embodiment of the present invention provides a kind of method that web page template generates, and comprises the steps:
Obtain the webpage under the equivalent catalogue of web page address of predetermined quantity;
Described segmenting web page is become some cutting blocks, calculates the eigenvalue of described each cutting block;
Calculated described eigenvalue is added up;
Frequency of occurrence is saved in eigenvalue storehouse more than the eigenvalue of predetermined threshold value, as the eigenvalue of template part.
In some optional embodiments, when described segmenting web page is become some cutting blocks, carry out cutting using DOM Document Object Model DOM node as separation.
In some optional embodiments, described segmenting web page becoming some cutting blocks, the length of each piecemeal content is no less than 20 bytes.
In some optional embodiments, the computational methods of the eigenvalue of described each cutting block are by the content of each piecemeal is adopted Hash operation.
The embodiment of the present invention also provides for the device that a kind of web page template generates, including:
Acquisition module, the webpage under the equivalent catalogue of web page address for obtaining predetermined quantity;
Computing module, for described segmenting web page is become some cutting blocks, calculates the eigenvalue of described each cutting block;
Statistical module, for adding up calculated described eigenvalue;
Generation module, for being saved in eigenvalue storehouse by frequency of occurrence more than the eigenvalue of predetermined threshold value, as the eigenvalue of template part.
In some optional embodiments, described computing module, during specifically for described segmenting web page is become some cutting blocks, carry out cutting using DOM Document Object Model DOM node as separation.
In some optional embodiments, described computing module, specifically for segmenting web page becomes some cutting blocks, the length of each piecemeal content is no less than 20 bytes.
In some optional embodiments, described computing module, specifically for:
The computational methods of the eigenvalue of described each cutting block are by the content of each piecemeal is adopted Hash operation.
The invention provides a kind of web page template and generate method, it is possible to the webpage under the equivalent catalogue of web page address by obtaining predetermined quantity;The statistical result of the eigenvalue according to each cutting block of webpage, using frequency of occurrence more than the eigenvalue of predetermined threshold value as the eigenvalue of template part, it is achieved generate the most adaptive template and carry out analyzing web page.The shortcoming that the present invention overcomes current method, the template generated can better meet user's request, it is for time in web analysis process, it also is able to realize only content part real in webpage being resolved, thus reducing the interference of junk information, improve the accuracy and precision of web page analysis, the effect of web page analysis is greatly improved.
Figure of description
Fig. 1 is the method for analyzing internet web page contents flow chart provided in the embodiment of the present invention;
Fig. 2 is the flow chart that the web page template provided in the embodiment of the present invention generates method;
Fig. 3 is the particular flow sheet generating new template in the embodiment of the present invention;
Fig. 4 show a kind of internet web page contents resolver schematic diagram in the embodiment of the present invention.
Detailed description of the invention
Defect for prior art, the invention provides a kind of method for analyzing internet web page contents, can for the different channel paging of even each website, each website, analysis and the process of webpage is carried out by method targetedly, webpage can be automatically analyzed whether by template generation, and the template corresponding with webpage can be automatically generated, thus the most adaptive template is utilized to carry out analyzing web page.The shortcoming that the present invention overcomes current method, it is possible to only content part real in webpage is resolved, thus reducing the interference of junk information, improving the accuracy and precision of web page analysis, the effect of web page analysis is greatly improved.
With reference to Fig. 1, a kind of method for analyzing internet web page contents that the embodiment of the present invention provides, comprise the steps:
S11, it is judged that whether webpage to be resolved is by template generation;If this webpage is not by template generation, then forward step S12 to;Otherwise, step S13 is forwarded to;
S12, resolves this webpage by default mode;
Whether S13, existed the template matched with webpage to be resolved in query webpage template base;
If web page template storehouse has existed template that match with webpage to be resolved, then perform step S15, utilize the template corresponding with webpage to be resolved to resolve the content of this webpage;Otherwise, step S14 is performed;
S14, generates the web page template corresponding with webpage to be resolved, and is joined in web page template storehouse by the web page template of generation;
S15, utilizes the template corresponding with webpage to be resolved to resolve the content of this webpage;
For new Blockbased Web Page, the corresponding template generated is utilized to resolve this webpage.
In step S11, web page template storehouse pre-builds, and initializes before first time inquiry.
Judge that whether webpage to be resolved is by identifying that uniform resource position mark URL realizes by template generation, specifically include:
Judge according to the rule that URL generates;Or
Whether identify in URL has the mark of catalogue to judge.
In step S13, whether having there is the template matched with webpage in described query template storehouse, concrete steps include:
The character string of the instruction catalogue in the URL that acquisition webpage is corresponding;
Above-mentioned character string is utilized to inquire about in template base.
In step 15, the template corresponding with webpage to be resolved is utilized to resolve the content of this webpage, specific as follows:
Described Webpage is split, and calculates the eigenvalue of each piece;
Inquire about in the template corresponding with this webpage according to features described above value;
If having there is this eigenvalue in template, then corresponding with this eigenvalue web page release is without resolving;
If template is absent from this eigenvalue, then the web page release corresponding with this eigenvalue is resolved by default mode.
Generate the webpage splitting method adopted in web page template process identical with utilizing the webpage splitting method adopted in template analyzing web page content process.
In step S15, generate the web page template corresponding with webpage to be resolved, specifically include:
A () obtains other webpages being equal under catalogue with web page address to be browsed, and the webpage number chosen reaches required predetermined threshold;
B Webpage under this catalogue chosen is split by (), each piece all generates an eigenvalue, the corresponding multiple eigenvalues of each Webpage;
C the All Eigenvalues of webpages all under this catalogue is added up by (), obtain the frequency of occurrences part eigenvalue higher than threshold value, and be saved in template base.
In step S15, the web page template of generation is joined in web page template storehouse, including:
The character string of the instruction catalogue in the URL that acquisition webpage is corresponding;
Above-mentioned character string is added template base with all frequency of occurrences under this Web page listings higher than the eigenvalue of predetermined threshold value in the way of key-value.
With reference to Fig. 2, the embodiment of the present invention also provides for a kind of method that web page template generates, and comprises the steps:
S21, obtains the webpage under the equivalent catalogue of web page address of predetermined quantity;
S22, becomes some cutting blocks by described segmenting web page, calculates the eigenvalue of described each cutting block;
When described segmenting web page is become some cutting blocks, carry out cutting using DOM Document Object Model DOM node as separation.
Segmenting web page becomes some cutting blocks, and the length of each piecemeal content is no less than 20 bytes.
The computational methods of the eigenvalue of described each cutting block are that the content to piecemeal adopts Hash operation.
S23, adds up calculated described eigenvalue;
S24, is saved in eigenvalue storehouse by frequency of occurrence more than the eigenvalue of predetermined threshold value, as the eigenvalue of template part.
For making principles of the invention, characteristic and advantage clearly, it is described below in conjunction with specific embodiment.
In the present embodiment, if webpage to be analyzed is http://news.sina.com.cn, then this URL and corresponding original web page are sent into system and processes.Assuming that the template number just started in common template is 0 (namely just to start, do not generate any template), first, system can judge whether it is template generation according to uniform resource position mark URL, URL (URL, the abbreviation of Uniform/UniversalResourceLocator) it is also referred to as web page address, it is the address (Address) of the resource of standard on the Internet.According to the URL rule generated, it can be determined that this URL is the news channel page of sina.com.cn, so not being template generation.In such a case, it is possible to the method without template that returns processes.Alternatively, it is also possible to judge that it is not by template generation by another principle: because this URL do not have/, i.e. the mark of catalogue, it is taken as that this URL is not belonging to any catalogue, namely not by template generation.Also directly return, resolve by general mode.
And for this webpage of http://news.sina.com.cn/h/2010-07-15/141820685517.shtml, according to the URL rule generated, can interpolate that out that its catalogue is the part before " http://news.sina.com.cn/h/2010-07-15 " i.e. last "/" easily, this character string is utilized to inquire about in template base, because at this moment not generating template in common template storehouse, so character string does not have the template of correspondence, in this case will call template generation module, generate new template:
As it is shown on figure 3, in the present embodiment, the idiographic flow generating new template is as follows:
S31, acquisition are such as other webpages under the equivalent catalogue of http://news.sina.com.cn/h/2010-07-15/075320682851.shtml, and its webpage number to exceed the threshold value generating the minimum webpage of template needs, if be unsuccessfully returned to.
S32, by obtain this catalogue under all pages all split, each piece all generates one eigenvalue (md5 value), each page correspondence multiple eigenvalues (md5 value).
S33, the All Eigenvalues of webpages all under this catalogue is added up, show that the frequency of occurrences is higher than the part eigenvalue of threshold value.
S34, by this directory characters string, join and join in existing template base higher than the eigenvalue of threshold value with the frequency of occurrences in S33.So just the parsing template corresponding with webpage to be resolved is generated.
In step S31, it is possible to according to known URL as follows
Http:// news.sina.com.cn/h/2010-07-15/075320682851.shtml learns that the catalogue at its place is http://news.sina.com.cn/h/2010-07-15, travels through this catalogue, it is possible to obtain other webpages under this catalogue.
In step S32, the piecemeal of webpage and the generation of block eigenvalue: general web page code is in compliance with HTML standard specification, and a corresponding DOM model, this model is made up of some content nodes.
With nature node for separation, generally nature cutting should be carried out with labels such as tr, td, div when web page release.The length general control of piecemeal content is no less than 20 bytes.
When concrete cutting, it is possible to from the first character of webpage, the node that scanning sets, (node such as set is td, tr, div etc.), if running into these nodes, just position herein is set to the starting position of block.Then go for next position by same method, if the distance length of adjacent position is more than the minimum length (here with 20) set, just the part in the middle of two positions is used as one piece, this block is generated fingerprint just passable.The end position concurrently setting this block is exactly the starting position of next block, if the distance of adjacent position is less than minimum length, continue to find next node (it is invalid that middle node is just set to) until finding the node distance with the node of this block beginning more than minimum range (or finding the ending of webpage).
The generation of specific features value, generally in order to ensure that different blocks has different eigenvalues, generally can select relatively reliable encryption method, for instance md5 algorithm.
In step S33, first count the number of webpage under this catalogue, the eigenvalue of all web page release under this catalogue is being added up.If the frequency of occurrence of certain eigenvalue is more than default threshold value, this just illustrates: the web page release corresponding with this eigenvalue occurs in a lot of webpage, and therefore its content is valueless, it is likely to advertising message, navigation information etc..All frequency of occurrences are stored in template base more than the eigenvalue of threshold value.
If run into the webpage under same catalogue more later, as:
Http:// news.sina.com.cn/h/2010-07-15/075320682851.shtml,
Similarly, the catalogue of this URL is obtained
Http:// news.sina.com.cn/s/2010-07-15,
And inquire about in template base by this character string.Because the template corresponding with this character string exists, so this template can be found in template base.At this moment can to following webpage:
The content of http://news.sina.com.cn/h/2010-07-15/075320682851.shtml splits, and all generate a md5 value each piece split, by this md5 value in the template corresponding with above-mentioned character string, namely characteristic value sequence is found, if this md5 value exists in a template, just illustrate that this block is valueless piecemeal, not resolve;If can not find this md5 just illustrate that this block is the meaningful part of webpage.This piecemeal is resolved by default mode.
With reference to Fig. 4, the embodiment of the present invention also provides for a kind of internet web page contents resolver 40, including such as lower module:
Judge module 41, for judging that whether webpage to be resolved be by template generation;
Memory module 42, is used for storing web page template storehouse;
Whether the first enquiry module 43, for existing the template corresponding with webpage to be resolved in query webpage template base;
Second enquiry module 44, user's inquiry and webpage to be resolved are to whether there is certain eigenvalue in deserved template;
Generation module 45, for generating the template corresponding with webpage to be resolved;
First parsing module 46, for resolving webpage to be resolved by default mode;
Second parsing module 47, resolves by default mode for certain piecemeal treated in analyzing web page;
Presetting module 48, for arranging the concrete analysis mode of the first parsing module 46 and the second parsing module 47.
The workflow of this device is essentially identical with preceding method, does not repeat them here.
The embodiment of the present invention also provides for a kind of internet web page contents resolver, including such as lower module:
Judge module, for judging that whether webpage to be resolved be by template generation;
Whether enquiry module, if being by template generation for this webpage, then existed the template matched with webpage to be resolved in query webpage template base;
Generation module, if for being absent from the template matched with webpage to be resolved in web page template storehouse, generating the web page template corresponding with webpage to be resolved, and joined in web page template storehouse by the web page template of generation;
Parsing module, if having there is, in web page template storehouse, the template matched with webpage to be resolved, then utilizes the template corresponding with webpage to be resolved to resolve the content of this webpage;If web page template storehouse is absent from the template matched with webpage to be resolved, the template that raw module generates is utilized to resolve above-mentioned webpage.
The embodiment of the present invention also provides for the device that a kind of web page template generates, including:
Acquisition module, the webpage under the equivalent catalogue of web page address for obtaining predetermined quantity.
Computing module, for segmenting web page is become some cutting blocks, calculates the eigenvalue of each cutting block.
Statistical module, for adding up calculated eigenvalue.
Generation module, for being saved in eigenvalue storehouse by frequency of occurrence more than the eigenvalue of predetermined threshold value, as the eigenvalue of template part.
Above-mentioned computing module, during specifically for described segmenting web page is become some cutting blocks, carries out cutting using DOM Document Object Model DOM node as separation.
Above-mentioned computing module, specifically for segmenting web page becomes some cutting blocks, the length of each piecemeal content is no less than 20 bytes.
Above-mentioned computing module, the computational methods specifically for the eigenvalue of each cutting block are by the content of each piecemeal is adopted Hash operation.
In sum, the invention provides a kind of method for analyzing internet web page contents, when webpage to be resolved is by template generation, if web page template storehouse has existed template that match with webpage to be resolved, then the template corresponding with webpage to be resolved is utilized to resolve the content of this webpage;Otherwise, generate the web page template corresponding with webpage to be resolved, and the web page template of generation is joined in web page template storehouse, and utilize this template to resolve above-mentioned webpage.Can for the different channel paging of even each website, each website according to the present invention, analysis and the process of webpage is carried out by method targetedly, webpage can be automatically analyzed whether by template generation, and the template corresponding with webpage can be automatically generated, thus utilizing the most adaptive template to carry out analyzing web page.The shortcoming that the present invention overcomes current method, it is possible to only content part real in webpage is resolved, thus reducing the interference of junk information, improving the accuracy and precision of web page analysis, the effect of web page analysis is greatly improved.
According to described disclosed embodiment, it is possible to make those skilled in the art be capable of or use the present invention.To those skilled in the art, the various amendments of these embodiments are apparent from, and the general principles defined here can also be applied to other embodiments on without departing from the basis of the scope and spirit of the present invention.Embodiment described above is only presently preferred embodiments of the present invention, not in order to limit the present invention, all within the spirit and principles in the present invention, any amendment of making, equivalent replacement, improvement etc., should be included within protection scope of the present invention.