Summary of the invention
The object of the invention is exactly in order to address the above problem, and a kind of Web data pick-up method based on the visual customization of extraction template is provided, and it has visual, friendly user interactions ability advantage.
To achieve these goals, the present invention adopts following technical scheme:
A kind of Web data pick-up method based on the visual customization of extraction template may further comprise the steps:
A. template page pre-service.
B. the visual customization of extraction template.
C. the page extracts frequency configuration in batches.
D, the page extract in batches.
The pre-service of the said template page is the conversion and the displaying of template page source code: through analyzing the html source code of the template page, resolve its dom tree structure, and be translated into the XML form, and in user interface, show;
The visual customization of said extraction template is meant on user interface, to provide to pull chooses function, by the corresponding relation that the user sets up attribute in attribute tags and data value and the domain model on the template page on their own, sets up extraction template;
The said page extract in batches frequency configuration by at set intervals (as 8 hours) carry out batch and extract once climbing the html page of getting acquisition;
The said page extracts in batches and is meant and uses corresponding extraction template to carry out batch and extract climbing a large amount of html pages of getting acquisition, and wherein semi-structured data commentaries on classics composite structure data are saved to local data base.
The conversion of template page source code and displaying specifically may further comprise the steps in the said steps A:
A1. the template page that provides is carried out the html source code analysis, change into the pagefile that meets the XML standard.
A2. to its complete DOM Document Object Model DOM structure of page analysis, and be illustrated in user interface.
A3. to the page after transforming, under the condition of not destroying page original structure, add necessary Js control routine, in order to realize page mark.
The page that A4. will pass through the XML form that above step process crosses displays in user interface and offers the user and carry out the visual customization of template and use.
The visual customization of extraction template specifically may further comprise the steps among the said step B:
B1. the user opens after the template page, drags with mouse and chooses the data item that will extract, and program can be dragged the data item of selecting according to the user, analyzes the XPATH path of this data item and notes.
B2. if this data item also has the corresponding page label in the page, then this data label is also dragged and select, program can be noted the XPATH path of this data label and the content of text of this label, and unifies the bar decimation rule with the data item XPATH mutual group of selecting.If this data item does not have corresponding data label, then need not select.
B3. the user is according to domain model; For the decimation rule that forms through above-mentioned B1, B2 step back is selected an attribute tags; This label is included in the domain model of having set up in advance; And it is semantic to meet this decimation rule corresponding data item, and this attribute tags indicates the semanteme of the corresponding data item of this decimation rule, and it is in the nature has accomplished the mapping of page data item to being listed as in the tables of data.
B4. repeat above B1 to B3 and go on foot, marked out, will pass through the decimation rule set that above step obtains and save as a page extraction template up to all data that will extract.
The page extracts specifically in batches and may further comprise the steps among the said step C:
C1. the current page that will extract is changed into the XML file of standard.
C2. utilize the decimation rule that writes down in the extraction template, it is in the nature the XPATH path, extracts needed data item.
C3. corresponding according to every decimation rule data label is saved in the data item that extracts in the corresponding row of database table.
Wherein the C2 step can also be subdivided into following steps:
C2-1. select an also original decimation rule.
C2-2. if this decimation rule does not write down the corresponding page label information, then directly read out corresponding content of text, and this decimation rule is labeled as uses, forward step C2-8 to according to the corresponding XPATH path of data item.If this decimation rule has record corresponding page label information, forward step C2-3 to.
C2-3. extract corresponding text according to the corresponding XPATH path of this page-tag.If extract successfully, forward step C2-4 to.If extract failure, explain then that in current page the corresponding data item of this page-tag possibly forwarded to step C2-7 by default or displacement.
C2-4. the page-tag text that writes down in the text that extracts and this decimation rule is compared.If coupling, the XPATH according to data recorded item in the decimation rule extracts corresponding data, and this decimation rule is labeled as uses, and forwards step C2-8 to.If do not match, explain then that in current page the corresponding data item of this page-tag possibly then forwarded to step C2-5 by default or displacement.
C2-5. check whether the text matees the page-tag in the original decimation rule of certain bar.If there is corresponding decimation rule, then this text possibly also be a page-tag, forwards step C2-6 to, otherwise forwards step C2-7 to.
C2-6. according to taking out the page-tag that writes down in the rule and the XPATH of data item, calculate when this text is page-tag the XPATH of corresponding data item; And extraction corresponding data; If extract the data non-NULL, then the decimation rule of correspondence is labeled as and uses, forward step C2-7 to.
C2-7. in the page, carry out expanded search according to the XPATH path of original page-tag, seek this page-tag.If finally do not find, data item that then maybe this label is corresponding in current page is by default.If find, then, calculate the XPATH of this page-tag corresponding data item, the extraction corresponding data according to taking out the page-tag and the XPATH of data item that writes down in the rule.At last former decimation rule is labeled as and uses, forward step C2-8 to.
C2-8. repeat above step, all be used up to all decimation rules.
Step C2-3 is for preventing that from there is the situation of the not unique or misspelling of attribute order in semi-structured data in the Web page.Guarantee can not occur the situation of loss of data through an expanded search.
Beneficial effect of the present invention:
1, the present invention is directed to each data source, adopt visual user customizing method, design parameterization, configurable wrapper make it to possess visual, friendly user interactions ability, and the extensive Web page of gathering is implemented Automatic Extraction according to wrapper.
2, owing to the content and structure on the Web page often changes; The decimation rule that causes having produced lost efficacy; Adaptive ability to how improving the Web data pick-up is effectively studied, and enables to adjust automatically according to the variation that target web takes place, and upgrades corresponding decimation rule.
3, data pick-up method applicability of the present invention is strong, and precision is high, can change by self adaptive net, can improve extraction efficiency greatly.
Embodiment
Below in conjunction with accompanying drawing and embodiment the present invention is described further.
Among Fig. 1, a kind of Web data pick-up method based on the visual customization of extraction template, it may further comprise the steps
A. template page pre-service;
B. the visual customization of extraction template;
C. the page is extracted frequency setting in batches;
D, the page extract in batches.
The pre-service of the said steps A template page, i.e. the conversion of template page source code and displaying: it is resolved its dom tree structure, and is translated into the XML form, and in the user interface of display, show through analyzing the html source code of the template page in the internally stored program.
The visual customization of said step B extraction template is meant on user interface, to provide to pull chooses function, by the corresponding relation that the user sets up attribute in attribute tags and data value and the domain model on the template page on their own, sets up extraction template.
The said step C page extracts frequency configuration in batches by whenever carrying out the batch extraction once at a distance from 8 hours to climbing the html page of getting acquisition.
The said step D page extracts in batches and is meant and uses corresponding extraction template to carry out batch and extract climbing a large amount of html pages of getting acquisition, and wherein semi-structured data commentaries on classics composite structure data are saved to local data base.
Among Fig. 2, the conversion of template page source code and displaying specifically may further comprise the steps in the said steps A:
A1. the template page that provides is carried out the html source code analysis, change into the pagefile that meets the XML standard;
A2. to its complete DOM structure of page analysis, and be illustrated in user interface;
A3. to the page after transforming, under the condition that satisfies page original structure, add necessary Js control routine, in order to realize page mark;
The page that A4. will pass through above step process XML form displays in user interface and offers the user and carry out the visual customization of template and use.
Among Fig. 3, the visual customization of extraction template specifically may further comprise the steps among the said step B:
B1. the user opens after the template page that display shows, drags with mouse and chooses the data item that will extract, and program can be dragged the data item of selecting according to the user, analyzes the XPATH path of this data item and notes;
B2. if this data item also has the corresponding page label in the page, then this data label is also dragged and select, program can be noted the XPATH path of this data label and the content of text of this label, and unifies the bar decimation rule with the data item XPATH mutual group of selecting; If this data item does not have corresponding data label, then need not select;
B3. the user is according to domain model; For the decimation rule that forms through above-mentioned B2, B3 step back is selected an attribute tags; This label is included in the domain model of having set up in advance; And it is semantic to meet this decimation rule corresponding data item, and this attribute tags indicates the semanteme of the corresponding data item of this decimation rule, and its essence is exactly to have accomplished the mapping of page data item to being listed as in the tables of data;
B4. repeat above B2 to B4 and go on foot, marked out, will pass through the decimation rule set that above step obtains and save as a page extraction template up to all data that will extract.
Among Fig. 4, the visual customization of extraction template specifically may further comprise the steps among the said step C:
C1. the current page that will extract is changed into the XML file of standard;
C2. utilize the decimation rule that writes down in the extraction template, its essence is exactly the XPATH path, extracts needed data item;
C3. corresponding according to every decimation rule data label is saved in the data item that extracts in the corresponding row of database table.
Among Fig. 5, said step C2 specifically may further comprise the steps:
C2-1 selects an also original decimation rule;
C2-2 then directly reads out corresponding content of text according to the corresponding XPATH path of data item if this decimation rule does not write down the corresponding page label information, and this decimation rule is labeled as uses, and forwards step C2-8 to; If this decimation rule has record corresponding page label information, forward step C2-3 to;
C2-3 extracts corresponding text according to the corresponding XPATH path of this page-tag; If extract successfully, forward step C2-4 to; If extract failure, explain then that in current page the corresponding data item of this page-tag exists by situation default or displacement, then forwards step C2-7 to;
C2-4 compares the page-tag text that writes down in the text that extracts and this decimation rule; If coupling, the XPATH according to data recorded item in the decimation rule extracts corresponding data, and this decimation rule is labeled as uses, and forwards step C2-8 to; If do not match, explain then that in current page the corresponding data item of this page-tag exists by situation default or displacement, then forwards step C2-5 to;
Whether the C2-5 inspection text matees the page-tag in the original decimation rule of certain bar; If there is corresponding decimation rule, then this text will forward step C2-6 to as a page-tag, otherwise forward step C2-7 to;
C2-6 calculates when this text is page-tag the XPATH of corresponding data item according to taking out the page-tag that writes down in the rule and the XPATH of data item; And extraction corresponding data; If extract the data non-NULL, then the decimation rule of correspondence is labeled as and uses, forward step C2-7 to;
C2-7 carries out expanded search according to the XPATH path of original page-tag in the page, seek this page-tag; If finally do not find, explain that then the data item that exists in this label correspondence in the current page is by default situation; If find, then, calculate the XPATH of this page-tag corresponding data item, the extraction corresponding data according to taking out the page-tag and the XPATH of data item that writes down in the rule; At last former decimation rule is labeled as and uses, forward step C2-8 to;
C2-8 repeats above step, all is used up to all decimation rules.
Said step C2-3 is for realizing that the situation of the not unique or misspelling of attribute order appears in semi-structured data in the Web page, guarantees can not occur the situation of loss of data through an expanded search.
Another embodiment of the present invention, we select to adopt certain website as data source.The page is used for custom built forms as page template in detail, page general data zone sectional drawing such as accompanying drawing 6.
Suppose the part of the data that will extract as being surrounded by rectangle frame among the figure of the manual mark of user.
Then we can obtain following 10 decimation rules:
1. data label: position title;
Page-tag: sky;
Data item XPATH:/HTML/BODY [1]/DIV [2]/DIV [1]/DIV [2]/TABLE [3]/TBODY [1]/TR [3]/TD [2];
2. data label: recruitment company;
Page-tag: sky;
Data item XPAHT:/HTML/BODY [1]/DIV [2]/DIV [1]/DIV [2]/TABLE [1]/TBODY [1]/TR [2]/TD [1]/TABLE [1]/TBODY [1]/TR [1]/TD [1]/STRONG [1]
3. data label: date issued;
Page-tag: date issued;
Page-tag XPATH:/HTML/BODY [1]/DIV [2]/DIV [1]/DIV [2]/TABLE [3]/TBODY [1]/TR [1]/TD [1]
Data item XPATH:/HTML/BODY [1]/DIV [2]/DIV [1]/DIV [2]/TABLE [3]/TBODY [1]/TR [1]/TD [2]
4. data label: work place;
Page-tag: work place;
Page-tag XPATH:/HTML/BODY [1]/DIV [2]/DIV [1]/DIV [2]/TABLE [3]/TBODY [1]/TR [1]/TD [3]
Data item XPATH:/HTML/BODY [1]/DIV [2]/DIV [1]/DIV [2]/TABLE [3]/TBODY [1]/TR [1]/TD [4]
5. data label: the number of recruits;
Page-tag: the number of recruits;
Page-tag XPATH:/HTML/BODY [1]/DIV [2]/DIV [1]/DIV [2]/TABLE [3]/TBODY [1]/TR [1]/TD [5]
Data item XPATH:/HTML/BODY [1]/DIV [2]/DIV [1]/DIV [2]/TABLE [3]/TBODY [1]/TR [1]/TD [6]
6. data label: working experience;
Page-tag: length of service;
Page-tag XPATH:/HTML/BODY [1]/DIV [2]/DIV [1]/DIV [2]/TABLE [3]/TBODY [1]/TR [2]/TD [1]
Data item XPATH:/HTML/BODY [1]/DIV [2]/DIV [1]/DIV [2]/TABLE [3]/TBODY [1]/TR [2]/TD [2]
7. data label: language requirement;
Page-tag: language requirement;
Page-tag XPATH:/HTML/BODY [1]/DIV [2]/DIV [1]/DIV [2]/TABLE [3]/TBODY [1]/TR [2]/TD [3]
Data item XPATH:/HTML/BODY [1]/DIV [2]/DIV [1]/DIV [2]/TABLE [3]/TBODY [1]/TR [2]/TD [4]
8. data label: educational background;
Page-tag: educational requirement;
Page-tag XPATH:/HTML/BODY [1]/DIV [2]/DIV [1]/DIV [2]/TABLE [3]/TBODY [1]/TR [2]/TD [5]
Data item XPATH:/HTML/BODY [1]/DIV [2]/DIV [1]/DIV [2]/TABLE [3]/TBODY [1]/TR [2]/TD [6]
9. data label: level of salary;
Page-tag: salary scope;
Page-tag XPATH:/HTML/BODY [1]/DIV [2]/DIV [1]/DIV [2]/TABLE [3]/TBODY [1]/TR [2]/TD [5]
Data item XPATH:/HTML/BODY [1]/DIV [2]/DIV [1]/DIV [2]/TABLE [3]/TBODY [1]/TR [2]/TD [6]
The extraction template that utilizes these 9 decimation rules to constitute, we can carry out batch to the similar webpage that derives from this website.
Suppose that we extract the webpage (accompanying drawing 7) of same website:
We can find to lack in this page 2 data item that we will extract: language requirement and level of salary.Wherein we can find that 1~6 decimation rule effectively can directly utilize then through the page code analysis.When we use the 7th decimation rule " language requirement "; We can find that the locational text of current page respective labels XPATH is that the language requirement that writes down in educational background and the decimation rule is not inconsistent, but this page-tag of educational background exists in decimation rule 8; Therefore the data item after the educational background " junior college " is extracted; And in the page this page-tag of root expanded search " language requirement " owing to do not have this label in the page, therefore search less than.Though it is different to be extracted the structure of page structure and drawing template establishment like this, the data on the page still can and extract by correct identification.
Though the above-mentioned accompanying drawing specific embodiments of the invention that combines is described; But be not restriction to protection domain of the present invention; One of ordinary skill in the art should be understood that; On the basis of technical scheme of the present invention, those skilled in the art need not pay various modifications that creative work can make or distortion still in protection scope of the present invention.