CN102360368A

Movatterモバイル変換

Info

Publication number: CN102360368A
Application number: CN2011103017759A
Authority: CN
Inventors: 李庆忠; 闫中敏; 彭朝晖; 蔡益清
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2011-10-09
Filing date: 2011-10-09
Publication date: 2012-02-22
Anticipated expiration: 2031-10-09
Also published as: CN102360368B

Abstract

本发明公开了一种基于抽取模板可视化定制的Web数据抽取方法，它包括以下步骤A.模板页面预处理；B.抽取模板可视化定制；C.对页面批量抽取频率进行设置；D.页面批量抽取。所述步骤A模板页面预处理，即模板页面源代码的转换及展示；所述步骤B抽取模板可视化定制是指在用户界面上提供拖拽选中功能，由用户自行设定模板页面上的属性标签和数据值与领域模型中属性的对应关系，建立抽取模板。所述步骤C页面批量抽取频率设置按每隔8小时对爬取获得的HTML页面进行批量抽取一次。所述步骤D页面批量抽取是指使用相应的抽取模板对爬取获得的大量HTML页面进行批量抽取，将其中的半结构化数据转合成结构化数据保存至本地数据库。

The invention discloses a web data extraction method based on the visual customization of the extraction template, which comprises the following steps: A. preprocessing the template page; B. visual customization of the extraction template; C. setting the page batch extraction frequency; D. page batch extraction . The step A template page preprocessing, that is, the conversion and display of the template page source code; the step B extraction template visual customization refers to providing a drag and drop selection function on the user interface, and the user sets the attribute label on the template page by himself Create an extraction template based on the corresponding relationship between the data value and the attribute in the domain model. The page batch extraction frequency setting in step C is to extract the HTML pages obtained by crawling in batches once every 8 hours. The step D page batch extraction refers to using the corresponding extraction template to perform batch extraction of a large number of HTML pages obtained by crawling, converting the semi-structured data into structured data and saving it to the local database.

Description

Web data pick-up method based on the visual customization of extraction template

Technical field

The present invention relates to a kind of extraction of the Web of the relating to page, belong to computer application field, relate in particular to a kind of Web data pick-up method based on the visual customization of extraction template.

Background technology

Along with the develop rapidly of Internet technology, website that Web is last and webpage quantity is with volatile trend growth, thus make Web become one huge, data source widely distributes.Text, form and multimedia file such as picture, video etc. are the main forms of Web information; The Web data pick-up promptly is according to certain rule; From the Web data, extract semantic consistency, structurized numerical value knowledge; Set up numerical value knowledge unit storehouse, satisfy user data query, data analysis demand.For robotization ground changes into structural data with the Web page of importing, a lot of work have been launched in the data pick-up field.The Web data pick-up is mainly used in the generation structural data, and these structural datas are convenient to subsequent analysis and are excavated processing.The Web data pick-up has crucial effects and meaning for numerous Web data analyses with excavating to use.

A Web data pick-up task can be defined as input and output in form.Input can be a unstructured data, and for example free text also can be a ubiquitous semi-structured document in Web.

Because technical requirement more than existing is current aspect the extraction of Web page data, also has following weak point:

1 because Web goes up the isomerism and the disappearance of structure of data, causes using towards the Web data of analyzing and excavating, and for example market intelligence analysis etc. needs spend the Web data source that a large amount of costs removes to handle different-format.

The output of 2 one Web data pick-up tasks can be a data object that has the relation table of many records or have labyrinth.For some Web data pick-up tasks; Attribute can lack or in a record certain attribute have a plurality of property values; In addition, when there was the not unique or misspelling of attribute order in the semi-structured data in the Web page, it is complicated and difficult more that Web data pick-up task will become.

Summary of the invention

The object of the invention is exactly in order to address the above problem, and a kind of Web data pick-up method based on the visual customization of extraction template is provided, and it has visual, friendly user interactions ability advantage.

To achieve these goals, the present invention adopts following technical scheme:

A kind of Web data pick-up method based on the visual customization of extraction template may further comprise the steps:

A. template page pre-service.

B. the visual customization of extraction template.

C. the page extracts frequency configuration in batches.

D, the page extract in batches.

The pre-service of the said template page is the conversion and the displaying of template page source code: through analyzing the html source code of the template page, resolve its dom tree structure, and be translated into the XML form, and in user interface, show;

The visual customization of said extraction template is meant on user interface, to provide to pull chooses function, by the corresponding relation that the user sets up attribute in attribute tags and data value and the domain model on the template page on their own, sets up extraction template;

The said page extract in batches frequency configuration by at set intervals (as 8 hours) carry out batch and extract once climbing the html page of getting acquisition;

The said page extracts in batches and is meant and uses corresponding extraction template to carry out batch and extract climbing a large amount of html pages of getting acquisition, and wherein semi-structured data commentaries on classics composite structure data are saved to local data base.

The conversion of template page source code and displaying specifically may further comprise the steps in the said steps A:

A1. the template page that provides is carried out the html source code analysis, change into the pagefile that meets the XML standard.

A2. to its complete DOM Document Object Model DOM structure of page analysis, and be illustrated in user interface.

A3. to the page after transforming, under the condition of not destroying page original structure, add necessary Js control routine, in order to realize page mark.

The page that A4. will pass through the XML form that above step process crosses displays in user interface and offers the user and carry out the visual customization of template and use.

The visual customization of extraction template specifically may further comprise the steps among the said step B:

B1. the user opens after the template page, drags with mouse and chooses the data item that will extract, and program can be dragged the data item of selecting according to the user, analyzes the XPATH path of this data item and notes.

B2. if this data item also has the corresponding page label in the page, then this data label is also dragged and select, program can be noted the XPATH path of this data label and the content of text of this label, and unifies the bar decimation rule with the data item XPATH mutual group of selecting.If this data item does not have corresponding data label, then need not select.

B3. the user is according to domain model; For the decimation rule that forms through above-mentioned B1, B2 step back is selected an attribute tags; This label is included in the domain model of having set up in advance; And it is semantic to meet this decimation rule corresponding data item, and this attribute tags indicates the semanteme of the corresponding data item of this decimation rule, and it is in the nature has accomplished the mapping of page data item to being listed as in the tables of data.

B4. repeat above B1 to B3 and go on foot, marked out, will pass through the decimation rule set that above step obtains and save as a page extraction template up to all data that will extract.

The page extracts specifically in batches and may further comprise the steps among the said step C:

C1. the current page that will extract is changed into the XML file of standard.

C2. utilize the decimation rule that writes down in the extraction template, it is in the nature the XPATH path, extracts needed data item.

C3. corresponding according to every decimation rule data label is saved in the data item that extracts in the corresponding row of database table.

Wherein the C2 step can also be subdivided into following steps:

C2-1. select an also original decimation rule.

C2-2. if this decimation rule does not write down the corresponding page label information, then directly read out corresponding content of text, and this decimation rule is labeled as uses, forward step C2-8 to according to the corresponding XPATH path of data item.If this decimation rule has record corresponding page label information, forward step C2-3 to.

C2-3. extract corresponding text according to the corresponding XPATH path of this page-tag.If extract successfully, forward step C2-4 to.If extract failure, explain then that in current page the corresponding data item of this page-tag possibly forwarded to step C2-7 by default or displacement.

C2-4. the page-tag text that writes down in the text that extracts and this decimation rule is compared.If coupling, the XPATH according to data recorded item in the decimation rule extracts corresponding data, and this decimation rule is labeled as uses, and forwards step C2-8 to.If do not match, explain then that in current page the corresponding data item of this page-tag possibly then forwarded to step C2-5 by default or displacement.

C2-5. check whether the text matees the page-tag in the original decimation rule of certain bar.If there is corresponding decimation rule, then this text possibly also be a page-tag, forwards step C2-6 to, otherwise forwards step C2-7 to.

C2-6. according to taking out the page-tag that writes down in the rule and the XPATH of data item, calculate when this text is page-tag the XPATH of corresponding data item; And extraction corresponding data; If extract the data non-NULL, then the decimation rule of correspondence is labeled as and uses, forward step C2-7 to.

C2-7. in the page, carry out expanded search according to the XPATH path of original page-tag, seek this page-tag.If finally do not find, data item that then maybe this label is corresponding in current page is by default.If find, then, calculate the XPATH of this page-tag corresponding data item, the extraction corresponding data according to taking out the page-tag and the XPATH of data item that writes down in the rule.At last former decimation rule is labeled as and uses, forward step C2-8 to.

C2-8. repeat above step, all be used up to all decimation rules.

Step C2-3 is for preventing that from there is the situation of the not unique or misspelling of attribute order in semi-structured data in the Web page.Guarantee can not occur the situation of loss of data through an expanded search.

Beneficial effect of the present invention:

1, the present invention is directed to each data source, adopt visual user customizing method, design parameterization, configurable wrapper make it to possess visual, friendly user interactions ability, and the extensive Web page of gathering is implemented Automatic Extraction according to wrapper.

2, owing to the content and structure on the Web page often changes; The decimation rule that causes having produced lost efficacy; Adaptive ability to how improving the Web data pick-up is effectively studied, and enables to adjust automatically according to the variation that target web takes place, and upgrades corresponding decimation rule.

3, data pick-up method applicability of the present invention is strong, and precision is high, can change by self adaptive net, can improve extraction efficiency greatly.

Description of drawings

Fig. 1 is the Web data pick-up method flow based on the visual customization of extraction template;

Fig. 2 is a template page pretreatment process;

Fig. 3 is the visual customization flow process of page extraction template;

Fig. 4 extracts overall procedure for the page;

Fig. 5 is an extraction process refinement flow process;

Fig. 6 is that the detailed page in certain website is as the page template synoptic diagram;

Fig. 7 carries out the extraction process synoptic diagram for the webpage to the website.

Embodiment

Below in conjunction with accompanying drawing and embodiment the present invention is described further.

Among Fig. 1, a kind of Web data pick-up method based on the visual customization of extraction template, it may further comprise the steps

A. template page pre-service;

B. the visual customization of extraction template;

C. the page is extracted frequency setting in batches;

D, the page extract in batches.

The pre-service of the said steps A template page, i.e. the conversion of template page source code and displaying: it is resolved its dom tree structure, and is translated into the XML form, and in the user interface of display, show through analyzing the html source code of the template page in the internally stored program.

The visual customization of said step B extraction template is meant on user interface, to provide to pull chooses function, by the corresponding relation that the user sets up attribute in attribute tags and data value and the domain model on the template page on their own, sets up extraction template.

The said step C page extracts frequency configuration in batches by whenever carrying out the batch extraction once at a distance from 8 hours to climbing the html page of getting acquisition.

The said step D page extracts in batches and is meant and uses corresponding extraction template to carry out batch and extract climbing a large amount of html pages of getting acquisition, and wherein semi-structured data commentaries on classics composite structure data are saved to local data base.

Among Fig. 2, the conversion of template page source code and displaying specifically may further comprise the steps in the said steps A:

A1. the template page that provides is carried out the html source code analysis, change into the pagefile that meets the XML standard;

A2. to its complete DOM structure of page analysis, and be illustrated in user interface;

A3. to the page after transforming, under the condition that satisfies page original structure, add necessary Js control routine, in order to realize page mark;

The page that A4. will pass through above step process XML form displays in user interface and offers the user and carry out the visual customization of template and use.

Among Fig. 3, the visual customization of extraction template specifically may further comprise the steps among the said step B:

B1. the user opens after the template page that display shows, drags with mouse and chooses the data item that will extract, and program can be dragged the data item of selecting according to the user, analyzes the XPATH path of this data item and notes;

B2. if this data item also has the corresponding page label in the page, then this data label is also dragged and select, program can be noted the XPATH path of this data label and the content of text of this label, and unifies the bar decimation rule with the data item XPATH mutual group of selecting; If this data item does not have corresponding data label, then need not select;

B3. the user is according to domain model; For the decimation rule that forms through above-mentioned B2, B3 step back is selected an attribute tags; This label is included in the domain model of having set up in advance; And it is semantic to meet this decimation rule corresponding data item, and this attribute tags indicates the semanteme of the corresponding data item of this decimation rule, and its essence is exactly to have accomplished the mapping of page data item to being listed as in the tables of data;

B4. repeat above B2 to B4 and go on foot, marked out, will pass through the decimation rule set that above step obtains and save as a page extraction template up to all data that will extract.

Among Fig. 4, the visual customization of extraction template specifically may further comprise the steps among the said step C:

C1. the current page that will extract is changed into the XML file of standard;

C2. utilize the decimation rule that writes down in the extraction template, its essence is exactly the XPATH path, extracts needed data item;

Among Fig. 5, said step C2 specifically may further comprise the steps:

C2-1 selects an also original decimation rule;

C2-2 then directly reads out corresponding content of text according to the corresponding XPATH path of data item if this decimation rule does not write down the corresponding page label information, and this decimation rule is labeled as uses, and forwards step C2-8 to; If this decimation rule has record corresponding page label information, forward step C2-3 to;

C2-3 extracts corresponding text according to the corresponding XPATH path of this page-tag; If extract successfully, forward step C2-4 to; If extract failure, explain then that in current page the corresponding data item of this page-tag exists by situation default or displacement, then forwards step C2-7 to;

C2-4 compares the page-tag text that writes down in the text that extracts and this decimation rule; If coupling, the XPATH according to data recorded item in the decimation rule extracts corresponding data, and this decimation rule is labeled as uses, and forwards step C2-8 to; If do not match, explain then that in current page the corresponding data item of this page-tag exists by situation default or displacement, then forwards step C2-5 to;

Whether the C2-5 inspection text matees the page-tag in the original decimation rule of certain bar; If there is corresponding decimation rule, then this text will forward step C2-6 to as a page-tag, otherwise forward step C2-7 to;

C2-6 calculates when this text is page-tag the XPATH of corresponding data item according to taking out the page-tag that writes down in the rule and the XPATH of data item; And extraction corresponding data; If extract the data non-NULL, then the decimation rule of correspondence is labeled as and uses, forward step C2-7 to;

C2-7 carries out expanded search according to the XPATH path of original page-tag in the page, seek this page-tag; If finally do not find, explain that then the data item that exists in this label correspondence in the current page is by default situation; If find, then, calculate the XPATH of this page-tag corresponding data item, the extraction corresponding data according to taking out the page-tag and the XPATH of data item that writes down in the rule; At last former decimation rule is labeled as and uses, forward step C2-8 to;

C2-8 repeats above step, all is used up to all decimation rules.

Said step C2-3 is for realizing that the situation of the not unique or misspelling of attribute order appears in semi-structured data in the Web page, guarantees can not occur the situation of loss of data through an expanded search.

Another embodiment of the present invention, we select to adopt certain website as data source.The page is used for custom built forms as page template in detail, page general data zone sectional drawing such as accompanying drawing 6.

Suppose the part of the data that will extract as being surrounded by rectangle frame among the figure of the manual mark of user.

Then we can obtain following 10 decimation rules:

1. data label: position title;

Page-tag: sky;

Data item XPATH:/HTML/BODY [1]/DIV [2]/DIV [1]/DIV [2]/TABLE [3]/TBODY [1]/TR [3]/TD [2];

2. data label: recruitment company;

Page-tag: sky;

Data item XPAHT:/HTML/BODY [1]/DIV [2]/DIV [1]/DIV [2]/TABLE [1]/TBODY [1]/TR [2]/TD [1]/TABLE [1]/TBODY [1]/TR [1]/TD [1]/STRONG [1]

3. data label: date issued;

Page-tag: date issued;

Page-tag XPATH:/HTML/BODY [1]/DIV [2]/DIV [1]/DIV [2]/TABLE [3]/TBODY [1]/TR [1]/TD [1]

Data item XPATH:/HTML/BODY [1]/DIV [2]/DIV [1]/DIV [2]/TABLE [3]/TBODY [1]/TR [1]/TD [2]

4. data label: work place;

Page-tag: work place;

Page-tag XPATH:/HTML/BODY [1]/DIV [2]/DIV [1]/DIV [2]/TABLE [3]/TBODY [1]/TR [1]/TD [3]

Data item XPATH:/HTML/BODY [1]/DIV [2]/DIV [1]/DIV [2]/TABLE [3]/TBODY [1]/TR [1]/TD [4]

5. data label: the number of recruits;

Page-tag: the number of recruits;

Page-tag XPATH:/HTML/BODY [1]/DIV [2]/DIV [1]/DIV [2]/TABLE [3]/TBODY [1]/TR [1]/TD [5]

Data item XPATH:/HTML/BODY [1]/DIV [2]/DIV [1]/DIV [2]/TABLE [3]/TBODY [1]/TR [1]/TD [6]

6. data label: working experience;

Page-tag: length of service;

Page-tag XPATH:/HTML/BODY [1]/DIV [2]/DIV [1]/DIV [2]/TABLE [3]/TBODY [1]/TR [2]/TD [1]

Data item XPATH:/HTML/BODY [1]/DIV [2]/DIV [1]/DIV [2]/TABLE [3]/TBODY [1]/TR [2]/TD [2]

7. data label: language requirement;

Page-tag: language requirement;

Page-tag XPATH:/HTML/BODY [1]/DIV [2]/DIV [1]/DIV [2]/TABLE [3]/TBODY [1]/TR [2]/TD [3]

Data item XPATH:/HTML/BODY [1]/DIV [2]/DIV [1]/DIV [2]/TABLE [3]/TBODY [1]/TR [2]/TD [4]

8. data label: educational background;

Page-tag: educational requirement;

Page-tag XPATH:/HTML/BODY [1]/DIV [2]/DIV [1]/DIV [2]/TABLE [3]/TBODY [1]/TR [2]/TD [5]

Data item XPATH:/HTML/BODY [1]/DIV [2]/DIV [1]/DIV [2]/TABLE [3]/TBODY [1]/TR [2]/TD [6]

9. data label: level of salary;

Page-tag: salary scope;

The extraction template that utilizes these 9 decimation rules to constitute, we can carry out batch to the similar webpage that derives from this website.

Suppose that we extract the webpage (accompanying drawing 7) of same website:

We can find to lack in this page 2 data item that we will extract: language requirement and level of salary.Wherein we can find that 1～6 decimation rule effectively can directly utilize then through the page code analysis.When we use the 7th decimation rule " language requirement "; We can find that the locational text of current page respective labels XPATH is that the language requirement that writes down in educational background and the decimation rule is not inconsistent, but this page-tag of educational background exists in decimation rule 8; Therefore the data item after the educational background " junior college " is extracted; And in the page this page-tag of root expanded search " language requirement " owing to do not have this label in the page, therefore search less than.Though it is different to be extracted the structure of page structure and drawing template establishment like this, the data on the page still can and extract by correct identification.

Though the above-mentioned accompanying drawing specific embodiments of the invention that combines is described; But be not restriction to protection domain of the present invention; One of ordinary skill in the art should be understood that; On the basis of technical scheme of the present invention, those skilled in the art need not pay various modifications that creative work can make or distortion still in protection scope of the present invention.

Claims

1. Web data pick-up method based on the visual customization of extraction template is characterized in that it may further comprise the steps:

A. template page pre-service;

B. the visual customization of extraction template;

C. the page is extracted frequency setting in batches;

D. the page extracts in batches.

2. like claims 1 described Web data pick-up method based on the visual customization of extraction template; It is characterized in that; The pre-service of the said steps A template page, i.e. the conversion of template page source code and displaying: it resolves its dom tree structure through analyzing the html source code of the template page in the internally stored program; And be translated into the XML form, and in the user interface of display, show.

3. like claims 1 described Web data pick-up method based on the visual customization of extraction template; It is characterized in that; The visual customization of said step B extraction template is meant on user interface, to provide to pull chooses function; Set up the corresponding relation of attribute in attribute tags and data value and the domain model on the template page on their own by the user, set up extraction template.

4. like claims 1 described Web data pick-up method, it is characterized in that the said step C page extracts frequency configuration in batches by whenever carrying out the batch extraction once at a distance from 8 hours to climbing the HTM L page of getting acquisition based on the visual customization of extraction template.

5. like claims 1 described Web data pick-up method based on the visual customization of extraction template; It is characterized in that; The said step D page extracts in batches and is meant and uses corresponding extraction template to carry out batch and extract climbing a large amount of html pages of getting acquisition, and wherein semi-structured data commentaries on classics composite structure data are saved to local data base.

6. like claims 1 or 2 described Web data pick-up methods, it is characterized in that the conversion of template page source code and displaying specifically may further comprise the steps in the said steps A based on the visual customization of extraction template:

7. like claims 1 or 3 described Web data pick-up methods, it is characterized in that the visual customization of extraction template specifically may further comprise the steps among the said step B based on the visual customization of extraction template:

8. like claims 1 or 4 described Web data pick-up methods, it is characterized in that the visual customization of extraction template specifically may further comprise the steps among the said step C based on the visual customization of extraction template:

C3. root is saved in the data item that extracts in the corresponding row of database table according to the corresponding data label of every decimation rule.

9. like claims 8 described Web data pick-up methods, it is characterized in that said step C2 specifically may further comprise the steps based on the visual customization of extraction template:

C2-1 selects an also original decimation rule;

C2-8 repeats above step, all is used up to all decimation rules.

10. like claims 9 described Web data pick-up methods based on the visual customization of extraction template; It is characterized in that; Said step C2-3 is for realizing that the situation of the not unique or misspelling of attribute order appears in semi-structured data in the Web page, guarantees can not occur the situation of loss of data through an expanded search.