Movatterモバイル変換


[0]ホーム

URL:


CN102360368A - Web data extraction method based on visual customization of extraction template - Google Patents

Web data extraction method based on visual customization of extraction template
Download PDF

Info

Publication number
CN102360368A
CN102360368ACN2011103017759ACN201110301775ACN102360368ACN 102360368 ACN102360368 ACN 102360368ACN 2011103017759 ACN2011103017759 ACN 2011103017759ACN 201110301775 ACN201110301775 ACN 201110301775ACN 102360368 ACN102360368 ACN 102360368A
Authority
CN
China
Prior art keywords
page
data
template
extraction
data item
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2011103017759A
Other languages
Chinese (zh)
Other versions
CN102360368B (en
Inventor
李庆忠
闫中敏
彭朝晖
蔡益清
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong UniversityfiledCriticalShandong University
Priority to CN201110301775.9ApriorityCriticalpatent/CN102360368B/en
Publication of CN102360368ApublicationCriticalpatent/CN102360368A/en
Application grantedgrantedCritical
Publication of CN102360368BpublicationCriticalpatent/CN102360368B/en
Expired - Fee Relatedlegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Images

Landscapes

Abstract

Translated fromChinese

本发明公开了一种基于抽取模板可视化定制的Web数据抽取方法,它包括以下步骤A.模板页面预处理;B.抽取模板可视化定制;C.对页面批量抽取频率进行设置;D.页面批量抽取。所述步骤A模板页面预处理,即模板页面源代码的转换及展示;所述步骤B抽取模板可视化定制是指在用户界面上提供拖拽选中功能,由用户自行设定模板页面上的属性标签和数据值与领域模型中属性的对应关系,建立抽取模板。所述步骤C页面批量抽取频率设置按每隔8小时对爬取获得的HTML页面进行批量抽取一次。所述步骤D页面批量抽取是指使用相应的抽取模板对爬取获得的大量HTML页面进行批量抽取,将其中的半结构化数据转合成结构化数据保存至本地数据库。

Figure 201110301775

The invention discloses a web data extraction method based on the visual customization of the extraction template, which comprises the following steps: A. preprocessing the template page; B. visual customization of the extraction template; C. setting the page batch extraction frequency; D. page batch extraction . The step A template page preprocessing, that is, the conversion and display of the template page source code; the step B extraction template visual customization refers to providing a drag and drop selection function on the user interface, and the user sets the attribute label on the template page by himself Create an extraction template based on the corresponding relationship between the data value and the attribute in the domain model. The page batch extraction frequency setting in step C is to extract the HTML pages obtained by crawling in batches once every 8 hours. The step D page batch extraction refers to using the corresponding extraction template to perform batch extraction of a large number of HTML pages obtained by crawling, converting the semi-structured data into structured data and saving it to the local database.

Figure 201110301775

Description

Web data pick-up method based on the visual customization of extraction template
Technical field
The present invention relates to a kind of extraction of the Web of the relating to page, belong to computer application field, relate in particular to a kind of Web data pick-up method based on the visual customization of extraction template.
Background technology
Along with the develop rapidly of Internet technology, website that Web is last and webpage quantity is with volatile trend growth, thus make Web become one huge, data source widely distributes.Text, form and multimedia file such as picture, video etc. are the main forms of Web information; The Web data pick-up promptly is according to certain rule; From the Web data, extract semantic consistency, structurized numerical value knowledge; Set up numerical value knowledge unit storehouse, satisfy user data query, data analysis demand.For robotization ground changes into structural data with the Web page of importing, a lot of work have been launched in the data pick-up field.The Web data pick-up is mainly used in the generation structural data, and these structural datas are convenient to subsequent analysis and are excavated processing.The Web data pick-up has crucial effects and meaning for numerous Web data analyses with excavating to use.
A Web data pick-up task can be defined as input and output in form.Input can be a unstructured data, and for example free text also can be a ubiquitous semi-structured document in Web.
Because technical requirement more than existing is current aspect the extraction of Web page data, also has following weak point:
1 because Web goes up the isomerism and the disappearance of structure of data, causes using towards the Web data of analyzing and excavating, and for example market intelligence analysis etc. needs spend the Web data source that a large amount of costs removes to handle different-format.
The output of 2 one Web data pick-up tasks can be a data object that has the relation table of many records or have labyrinth.For some Web data pick-up tasks; Attribute can lack or in a record certain attribute have a plurality of property values; In addition, when there was the not unique or misspelling of attribute order in the semi-structured data in the Web page, it is complicated and difficult more that Web data pick-up task will become.
Summary of the invention
The object of the invention is exactly in order to address the above problem, and a kind of Web data pick-up method based on the visual customization of extraction template is provided, and it has visual, friendly user interactions ability advantage.
To achieve these goals, the present invention adopts following technical scheme:
A kind of Web data pick-up method based on the visual customization of extraction template may further comprise the steps:
A. template page pre-service.
B. the visual customization of extraction template.
C. the page extracts frequency configuration in batches.
D, the page extract in batches.
The pre-service of the said template page is the conversion and the displaying of template page source code: through analyzing the html source code of the template page, resolve its dom tree structure, and be translated into the XML form, and in user interface, show;
The visual customization of said extraction template is meant on user interface, to provide to pull chooses function, by the corresponding relation that the user sets up attribute in attribute tags and data value and the domain model on the template page on their own, sets up extraction template;
The said page extract in batches frequency configuration by at set intervals (as 8 hours) carry out batch and extract once climbing the html page of getting acquisition;
The said page extracts in batches and is meant and uses corresponding extraction template to carry out batch and extract climbing a large amount of html pages of getting acquisition, and wherein semi-structured data commentaries on classics composite structure data are saved to local data base.
The conversion of template page source code and displaying specifically may further comprise the steps in the said steps A:
A1. the template page that provides is carried out the html source code analysis, change into the pagefile that meets the XML standard.
A2. to its complete DOM Document Object Model DOM structure of page analysis, and be illustrated in user interface.
A3. to the page after transforming, under the condition of not destroying page original structure, add necessary Js control routine, in order to realize page mark.
The page that A4. will pass through the XML form that above step process crosses displays in user interface and offers the user and carry out the visual customization of template and use.
The visual customization of extraction template specifically may further comprise the steps among the said step B:
B1. the user opens after the template page, drags with mouse and chooses the data item that will extract, and program can be dragged the data item of selecting according to the user, analyzes the XPATH path of this data item and notes.
B2. if this data item also has the corresponding page label in the page, then this data label is also dragged and select, program can be noted the XPATH path of this data label and the content of text of this label, and unifies the bar decimation rule with the data item XPATH mutual group of selecting.If this data item does not have corresponding data label, then need not select.
B3. the user is according to domain model; For the decimation rule that forms through above-mentioned B1, B2 step back is selected an attribute tags; This label is included in the domain model of having set up in advance; And it is semantic to meet this decimation rule corresponding data item, and this attribute tags indicates the semanteme of the corresponding data item of this decimation rule, and it is in the nature has accomplished the mapping of page data item to being listed as in the tables of data.
B4. repeat above B1 to B3 and go on foot, marked out, will pass through the decimation rule set that above step obtains and save as a page extraction template up to all data that will extract.
The page extracts specifically in batches and may further comprise the steps among the said step C:
C1. the current page that will extract is changed into the XML file of standard.
C2. utilize the decimation rule that writes down in the extraction template, it is in the nature the XPATH path, extracts needed data item.
C3. corresponding according to every decimation rule data label is saved in the data item that extracts in the corresponding row of database table.
Wherein the C2 step can also be subdivided into following steps:
C2-1. select an also original decimation rule.
C2-2. if this decimation rule does not write down the corresponding page label information, then directly read out corresponding content of text, and this decimation rule is labeled as uses, forward step C2-8 to according to the corresponding XPATH path of data item.If this decimation rule has record corresponding page label information, forward step C2-3 to.
C2-3. extract corresponding text according to the corresponding XPATH path of this page-tag.If extract successfully, forward step C2-4 to.If extract failure, explain then that in current page the corresponding data item of this page-tag possibly forwarded to step C2-7 by default or displacement.
C2-4. the page-tag text that writes down in the text that extracts and this decimation rule is compared.If coupling, the XPATH according to data recorded item in the decimation rule extracts corresponding data, and this decimation rule is labeled as uses, and forwards step C2-8 to.If do not match, explain then that in current page the corresponding data item of this page-tag possibly then forwarded to step C2-5 by default or displacement.
C2-5. check whether the text matees the page-tag in the original decimation rule of certain bar.If there is corresponding decimation rule, then this text possibly also be a page-tag, forwards step C2-6 to, otherwise forwards step C2-7 to.
C2-6. according to taking out the page-tag that writes down in the rule and the XPATH of data item, calculate when this text is page-tag the XPATH of corresponding data item; And extraction corresponding data; If extract the data non-NULL, then the decimation rule of correspondence is labeled as and uses, forward step C2-7 to.
C2-7. in the page, carry out expanded search according to the XPATH path of original page-tag, seek this page-tag.If finally do not find, data item that then maybe this label is corresponding in current page is by default.If find, then, calculate the XPATH of this page-tag corresponding data item, the extraction corresponding data according to taking out the page-tag and the XPATH of data item that writes down in the rule.At last former decimation rule is labeled as and uses, forward step C2-8 to.
C2-8. repeat above step, all be used up to all decimation rules.
Step C2-3 is for preventing that from there is the situation of the not unique or misspelling of attribute order in semi-structured data in the Web page.Guarantee can not occur the situation of loss of data through an expanded search.
Beneficial effect of the present invention:
1, the present invention is directed to each data source, adopt visual user customizing method, design parameterization, configurable wrapper make it to possess visual, friendly user interactions ability, and the extensive Web page of gathering is implemented Automatic Extraction according to wrapper.
2, owing to the content and structure on the Web page often changes; The decimation rule that causes having produced lost efficacy; Adaptive ability to how improving the Web data pick-up is effectively studied, and enables to adjust automatically according to the variation that target web takes place, and upgrades corresponding decimation rule.
3, data pick-up method applicability of the present invention is strong, and precision is high, can change by self adaptive net, can improve extraction efficiency greatly.
Description of drawings
Fig. 1 is the Web data pick-up method flow based on the visual customization of extraction template;
Fig. 2 is a template page pretreatment process;
Fig. 3 is the visual customization flow process of page extraction template;
Fig. 4 extracts overall procedure for the page;
Fig. 5 is an extraction process refinement flow process;
Fig. 6 is that the detailed page in certain website is as the page template synoptic diagram;
Fig. 7 carries out the extraction process synoptic diagram for the webpage to the website.
Embodiment
Below in conjunction with accompanying drawing and embodiment the present invention is described further.
Among Fig. 1, a kind of Web data pick-up method based on the visual customization of extraction template, it may further comprise the steps
A. template page pre-service;
B. the visual customization of extraction template;
C. the page is extracted frequency setting in batches;
D, the page extract in batches.
The pre-service of the said steps A template page, i.e. the conversion of template page source code and displaying: it is resolved its dom tree structure, and is translated into the XML form, and in the user interface of display, show through analyzing the html source code of the template page in the internally stored program.
The visual customization of said step B extraction template is meant on user interface, to provide to pull chooses function, by the corresponding relation that the user sets up attribute in attribute tags and data value and the domain model on the template page on their own, sets up extraction template.
The said step C page extracts frequency configuration in batches by whenever carrying out the batch extraction once at a distance from 8 hours to climbing the html page of getting acquisition.
The said step D page extracts in batches and is meant and uses corresponding extraction template to carry out batch and extract climbing a large amount of html pages of getting acquisition, and wherein semi-structured data commentaries on classics composite structure data are saved to local data base.
Among Fig. 2, the conversion of template page source code and displaying specifically may further comprise the steps in the said steps A:
A1. the template page that provides is carried out the html source code analysis, change into the pagefile that meets the XML standard;
A2. to its complete DOM structure of page analysis, and be illustrated in user interface;
A3. to the page after transforming, under the condition that satisfies page original structure, add necessary Js control routine, in order to realize page mark;
The page that A4. will pass through above step process XML form displays in user interface and offers the user and carry out the visual customization of template and use.
Among Fig. 3, the visual customization of extraction template specifically may further comprise the steps among the said step B:
B1. the user opens after the template page that display shows, drags with mouse and chooses the data item that will extract, and program can be dragged the data item of selecting according to the user, analyzes the XPATH path of this data item and notes;
B2. if this data item also has the corresponding page label in the page, then this data label is also dragged and select, program can be noted the XPATH path of this data label and the content of text of this label, and unifies the bar decimation rule with the data item XPATH mutual group of selecting; If this data item does not have corresponding data label, then need not select;
B3. the user is according to domain model; For the decimation rule that forms through above-mentioned B2, B3 step back is selected an attribute tags; This label is included in the domain model of having set up in advance; And it is semantic to meet this decimation rule corresponding data item, and this attribute tags indicates the semanteme of the corresponding data item of this decimation rule, and its essence is exactly to have accomplished the mapping of page data item to being listed as in the tables of data;
B4. repeat above B2 to B4 and go on foot, marked out, will pass through the decimation rule set that above step obtains and save as a page extraction template up to all data that will extract.
Among Fig. 4, the visual customization of extraction template specifically may further comprise the steps among the said step C:
C1. the current page that will extract is changed into the XML file of standard;
C2. utilize the decimation rule that writes down in the extraction template, its essence is exactly the XPATH path, extracts needed data item;
C3. corresponding according to every decimation rule data label is saved in the data item that extracts in the corresponding row of database table.
Among Fig. 5, said step C2 specifically may further comprise the steps:
C2-1 selects an also original decimation rule;
C2-2 then directly reads out corresponding content of text according to the corresponding XPATH path of data item if this decimation rule does not write down the corresponding page label information, and this decimation rule is labeled as uses, and forwards step C2-8 to; If this decimation rule has record corresponding page label information, forward step C2-3 to;
C2-3 extracts corresponding text according to the corresponding XPATH path of this page-tag; If extract successfully, forward step C2-4 to; If extract failure, explain then that in current page the corresponding data item of this page-tag exists by situation default or displacement, then forwards step C2-7 to;
C2-4 compares the page-tag text that writes down in the text that extracts and this decimation rule; If coupling, the XPATH according to data recorded item in the decimation rule extracts corresponding data, and this decimation rule is labeled as uses, and forwards step C2-8 to; If do not match, explain then that in current page the corresponding data item of this page-tag exists by situation default or displacement, then forwards step C2-5 to;
Whether the C2-5 inspection text matees the page-tag in the original decimation rule of certain bar; If there is corresponding decimation rule, then this text will forward step C2-6 to as a page-tag, otherwise forward step C2-7 to;
C2-6 calculates when this text is page-tag the XPATH of corresponding data item according to taking out the page-tag that writes down in the rule and the XPATH of data item; And extraction corresponding data; If extract the data non-NULL, then the decimation rule of correspondence is labeled as and uses, forward step C2-7 to;
C2-7 carries out expanded search according to the XPATH path of original page-tag in the page, seek this page-tag; If finally do not find, explain that then the data item that exists in this label correspondence in the current page is by default situation; If find, then, calculate the XPATH of this page-tag corresponding data item, the extraction corresponding data according to taking out the page-tag and the XPATH of data item that writes down in the rule; At last former decimation rule is labeled as and uses, forward step C2-8 to;
C2-8 repeats above step, all is used up to all decimation rules.
Said step C2-3 is for realizing that the situation of the not unique or misspelling of attribute order appears in semi-structured data in the Web page, guarantees can not occur the situation of loss of data through an expanded search.
Another embodiment of the present invention, we select to adopt certain website as data source.The page is used for custom built forms as page template in detail, page general data zone sectional drawing such as accompanying drawing 6.
Suppose the part of the data that will extract as being surrounded by rectangle frame among the figure of the manual mark of user.
Then we can obtain following 10 decimation rules:
1. data label: position title;
Page-tag: sky;
Data item XPATH:/HTML/BODY [1]/DIV [2]/DIV [1]/DIV [2]/TABLE [3]/TBODY [1]/TR [3]/TD [2];
2. data label: recruitment company;
Page-tag: sky;
Data item XPAHT:/HTML/BODY [1]/DIV [2]/DIV [1]/DIV [2]/TABLE [1]/TBODY [1]/TR [2]/TD [1]/TABLE [1]/TBODY [1]/TR [1]/TD [1]/STRONG [1]
3. data label: date issued;
Page-tag: date issued;
Page-tag XPATH:/HTML/BODY [1]/DIV [2]/DIV [1]/DIV [2]/TABLE [3]/TBODY [1]/TR [1]/TD [1]
Data item XPATH:/HTML/BODY [1]/DIV [2]/DIV [1]/DIV [2]/TABLE [3]/TBODY [1]/TR [1]/TD [2]
4. data label: work place;
Page-tag: work place;
Page-tag XPATH:/HTML/BODY [1]/DIV [2]/DIV [1]/DIV [2]/TABLE [3]/TBODY [1]/TR [1]/TD [3]
Data item XPATH:/HTML/BODY [1]/DIV [2]/DIV [1]/DIV [2]/TABLE [3]/TBODY [1]/TR [1]/TD [4]
5. data label: the number of recruits;
Page-tag: the number of recruits;
Page-tag XPATH:/HTML/BODY [1]/DIV [2]/DIV [1]/DIV [2]/TABLE [3]/TBODY [1]/TR [1]/TD [5]
Data item XPATH:/HTML/BODY [1]/DIV [2]/DIV [1]/DIV [2]/TABLE [3]/TBODY [1]/TR [1]/TD [6]
6. data label: working experience;
Page-tag: length of service;
Page-tag XPATH:/HTML/BODY [1]/DIV [2]/DIV [1]/DIV [2]/TABLE [3]/TBODY [1]/TR [2]/TD [1]
Data item XPATH:/HTML/BODY [1]/DIV [2]/DIV [1]/DIV [2]/TABLE [3]/TBODY [1]/TR [2]/TD [2]
7. data label: language requirement;
Page-tag: language requirement;
Page-tag XPATH:/HTML/BODY [1]/DIV [2]/DIV [1]/DIV [2]/TABLE [3]/TBODY [1]/TR [2]/TD [3]
Data item XPATH:/HTML/BODY [1]/DIV [2]/DIV [1]/DIV [2]/TABLE [3]/TBODY [1]/TR [2]/TD [4]
8. data label: educational background;
Page-tag: educational requirement;
Page-tag XPATH:/HTML/BODY [1]/DIV [2]/DIV [1]/DIV [2]/TABLE [3]/TBODY [1]/TR [2]/TD [5]
Data item XPATH:/HTML/BODY [1]/DIV [2]/DIV [1]/DIV [2]/TABLE [3]/TBODY [1]/TR [2]/TD [6]
9. data label: level of salary;
Page-tag: salary scope;
Page-tag XPATH:/HTML/BODY [1]/DIV [2]/DIV [1]/DIV [2]/TABLE [3]/TBODY [1]/TR [2]/TD [5]
Data item XPATH:/HTML/BODY [1]/DIV [2]/DIV [1]/DIV [2]/TABLE [3]/TBODY [1]/TR [2]/TD [6]
The extraction template that utilizes these 9 decimation rules to constitute, we can carry out batch to the similar webpage that derives from this website.
Suppose that we extract the webpage (accompanying drawing 7) of same website:
We can find to lack in this page 2 data item that we will extract: language requirement and level of salary.Wherein we can find that 1~6 decimation rule effectively can directly utilize then through the page code analysis.When we use the 7th decimation rule " language requirement "; We can find that the locational text of current page respective labels XPATH is that the language requirement that writes down in educational background and the decimation rule is not inconsistent, but this page-tag of educational background exists in decimation rule 8; Therefore the data item after the educational background " junior college " is extracted; And in the page this page-tag of root expanded search " language requirement " owing to do not have this label in the page, therefore search less than.Though it is different to be extracted the structure of page structure and drawing template establishment like this, the data on the page still can and extract by correct identification.
Though the above-mentioned accompanying drawing specific embodiments of the invention that combines is described; But be not restriction to protection domain of the present invention; One of ordinary skill in the art should be understood that; On the basis of technical scheme of the present invention, those skilled in the art need not pay various modifications that creative work can make or distortion still in protection scope of the present invention.

Claims (10)

1. Web data pick-up method based on the visual customization of extraction template is characterized in that it may further comprise the steps:
A. template page pre-service;
B. the visual customization of extraction template;
C. the page is extracted frequency setting in batches;
D. the page extracts in batches.
2. like claims 1 described Web data pick-up method based on the visual customization of extraction template; It is characterized in that; The pre-service of the said steps A template page, i.e. the conversion of template page source code and displaying: it resolves its dom tree structure through analyzing the html source code of the template page in the internally stored program; And be translated into the XML form, and in the user interface of display, show.
3. like claims 1 described Web data pick-up method based on the visual customization of extraction template; It is characterized in that; The visual customization of said step B extraction template is meant on user interface, to provide to pull chooses function; Set up the corresponding relation of attribute in attribute tags and data value and the domain model on the template page on their own by the user, set up extraction template.
4. like claims 1 described Web data pick-up method, it is characterized in that the said step C page extracts frequency configuration in batches by whenever carrying out the batch extraction once at a distance from 8 hours to climbing the HTM L page of getting acquisition based on the visual customization of extraction template.
5. like claims 1 described Web data pick-up method based on the visual customization of extraction template; It is characterized in that; The said step D page extracts in batches and is meant and uses corresponding extraction template to carry out batch and extract climbing a large amount of html pages of getting acquisition, and wherein semi-structured data commentaries on classics composite structure data are saved to local data base.
6. like claims 1 or 2 described Web data pick-up methods, it is characterized in that the conversion of template page source code and displaying specifically may further comprise the steps in the said steps A based on the visual customization of extraction template:
A1. the template page that provides is carried out the html source code analysis, change into the pagefile that meets the XML standard;
A2. to its complete DOM structure of page analysis, and be illustrated in user interface;
A3. to the page after transforming, under the condition that satisfies page original structure, add necessary JS control routine, in order to realize page mark;
The page that A4. will pass through above step process XML form displays in user interface and offers the user and carry out the visual customization of template and use.
7. like claims 1 or 3 described Web data pick-up methods, it is characterized in that the visual customization of extraction template specifically may further comprise the steps among the said step B based on the visual customization of extraction template:
B1. the user opens after the template page that display shows, drags with mouse and chooses the data item that will extract, and program can be dragged the data item of selecting according to the user, analyzes the XPATH path of this data item and notes;
B2. if this data item also has the corresponding page label in the page, then this data label is also dragged and select, program can be noted the XPATH path of this data label and the content of text of this label, and unifies the bar decimation rule with the data item XPATH mutual group of selecting; If this data item does not have corresponding data label, then need not select;
B3. the user is according to domain model; For the decimation rule that forms through above-mentioned B2, B3 step back is selected an attribute tags; This label is included in the domain model of having set up in advance; And it is semantic to meet this decimation rule corresponding data item, and this attribute tags indicates the semanteme of the corresponding data item of this decimation rule, and its essence is exactly to have accomplished the mapping of page data item to being listed as in the tables of data;
B4. repeat above B2 to B4 and go on foot, marked out, will pass through the decimation rule set that above step obtains and save as a page extraction template up to all data that will extract.
8. like claims 1 or 4 described Web data pick-up methods, it is characterized in that the visual customization of extraction template specifically may further comprise the steps among the said step C based on the visual customization of extraction template:
C1. the current page that will extract is changed into the XML file of standard;
C2. utilize the decimation rule that writes down in the extraction template, its essence is exactly the XPATH path, extracts needed data item;
C3. root is saved in the data item that extracts in the corresponding row of database table according to the corresponding data label of every decimation rule.
9. like claims 8 described Web data pick-up methods, it is characterized in that said step C2 specifically may further comprise the steps based on the visual customization of extraction template:
C2-1 selects an also original decimation rule;
C2-2 then directly reads out corresponding content of text according to the corresponding XPATH path of data item if this decimation rule does not write down the corresponding page label information, and this decimation rule is labeled as uses, and forwards step C2-8 to; If this decimation rule has record corresponding page label information, forward step C2-3 to;
C2-3 extracts corresponding text according to the corresponding XPATH path of this page-tag; If extract successfully, forward step C2-4 to; If extract failure, explain then that in current page the corresponding data item of this page-tag exists by situation default or displacement, then forwards step C2-7 to;
C2-4 compares the page-tag text that writes down in the text that extracts and this decimation rule; If coupling, the XPATH according to data recorded item in the decimation rule extracts corresponding data, and this decimation rule is labeled as uses, and forwards step C2-8 to; If do not match, explain then that in current page the corresponding data item of this page-tag exists by situation default or displacement, then forwards step C2-5 to;
Whether the C2-5 inspection text matees the page-tag in the original decimation rule of certain bar; If there is corresponding decimation rule, then this text will forward step C2-6 to as a page-tag, otherwise forward step C2-7 to;
C2-6 calculates when this text is page-tag the XPATH of corresponding data item according to taking out the page-tag that writes down in the rule and the XPATH of data item; And extraction corresponding data; If extract the data non-NULL, then the decimation rule of correspondence is labeled as and uses, forward step C2-7 to;
C2-7 carries out expanded search according to the XPATH path of original page-tag in the page, seek this page-tag; If finally do not find, explain that then the data item that exists in this label correspondence in the current page is by default situation; If find, then, calculate the XPATH of this page-tag corresponding data item, the extraction corresponding data according to taking out the page-tag and the XPATH of data item that writes down in the rule; At last former decimation rule is labeled as and uses, forward step C2-8 to;
C2-8 repeats above step, all is used up to all decimation rules.
10. like claims 9 described Web data pick-up methods based on the visual customization of extraction template; It is characterized in that; Said step C2-3 is for realizing that the situation of the not unique or misspelling of attribute order appears in semi-structured data in the Web page, guarantees can not occur the situation of loss of data through an expanded search.
CN201110301775.9A2011-10-092011-10-09Web data extraction method based on visual customization of extraction templateExpired - Fee RelatedCN102360368B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201110301775.9ACN102360368B (en)2011-10-092011-10-09Web data extraction method based on visual customization of extraction template

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201110301775.9ACN102360368B (en)2011-10-092011-10-09Web data extraction method based on visual customization of extraction template

Publications (2)

Publication NumberPublication Date
CN102360368Atrue CN102360368A (en)2012-02-22
CN102360368B CN102360368B (en)2014-07-02

Family

ID=45585697

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201110301775.9AExpired - Fee RelatedCN102360368B (en)2011-10-092011-10-09Web data extraction method based on visual customization of extraction template

Country Status (1)

CountryLink
CN (1)CN102360368B (en)

Cited By (24)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN103020189A (en)*2012-12-032013-04-03深圳中兴网信科技有限公司Data processing device and method
CN103116448A (en)*2013-01-302013-05-22浪潮电子信息产业股份有限公司Extract method for visualizing information
CN104182412A (en)*2013-05-242014-12-03中国移动通信集团安徽有限公司Webpage crawling method and webpage crawling system
CN104350493A (en)*2012-06-082015-02-11微软公司Transforming data into consumable content
CN105447184A (en)*2015-12-152016-03-30北京百分点信息科技有限公司Information capturing method and device
CN106021485A (en)*2016-05-192016-10-12中国传媒大学Multi-element attribute movie data visualization system
CN106202348A (en)*2016-07-042016-12-07中山大学A kind of web page form information extraction method
US9595298B2 (en)2012-07-182017-03-14Microsoft Technology Licensing, LlcTransforming data to create layouts
CN106648677A (en)*2016-12-282017-05-10中国科学院南京地理与湖泊研究所Visualized customization method for integrated template of water environment area model
CN106980921A (en)*2017-03-022017-07-25上海歌略软件科技有限公司A kind of self-defined risk analysis method
CN107437158A (en)*2016-05-262017-12-05北京京东尚科信息技术有限公司Data query method and apparatus based on browser plug-in
CN107609144A (en)*2017-09-212018-01-19浪潮软件股份有限公司A kind of analysis result processing method, apparatus and system
CN107608949A (en)*2017-10-162018-01-19北京神州泰岳软件股份有限公司A kind of Text Information Extraction method and device based on semantic model
CN108121743A (en)*2016-11-302018-06-05中移(苏州)软件技术有限公司A kind of generation of generic web pages masterplate and application method, system
CN108334634A (en)*2018-02-272018-07-27北京中关村科金技术有限公司A kind of method, apparatus, equipment and the storage medium of extraction data information
CN108984683A (en)*2018-06-292018-12-11北京百度网讯科技有限公司Extracting method, system, equipment and the storage medium of structural data
CN109753596A (en)*2018-12-292019-05-14中国科学院计算技术研究所 Source management and configuration method and system for large-scale network data collection
US10380228B2 (en)2017-02-102019-08-13Microsoft Technology Licensing, LlcOutput generation based on semantic expressions
CN110309364A (en)*2018-03-022019-10-08腾讯科技(深圳)有限公司A kind of information extraction method and device
TWI682287B (en)*2018-10-252020-01-11財團法人資訊工業策進會Knowledge graph generating apparatus, method, and computer program product thereof
CN111782737A (en)*2020-08-122020-10-16中国工商银行股份有限公司Information processing method, device, equipment and storage medium
CN112199960A (en)*2020-11-122021-01-08北京三维天地科技股份有限公司Standard knowledge element granularity analysis system
CN116628303A (en)*2023-04-262023-08-22中国科学院信息工程研究所 A method and system for extracting attribute values of semi-structured web pages based on hint learning
CN116701741A (en)*2023-06-192023-09-05鼎富智能科技有限公司Website data acquisition method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN1588371A (en)*2004-09-082005-03-02孟小峰Forming method for package device
CN101582075A (en)*2009-06-242009-11-18大连海事大学Web information extraction system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN1588371A (en)*2004-09-082005-03-02孟小峰Forming method for package device
CN101582075A (en)*2009-06-242009-11-18大连海事大学Web information extraction system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李 朝等: "基于DOM 树的可适应性Web 信息抽取", 《计算机科学》*
郝爱峰: "网页结构化信息抽取技术方法研究", 《山西电子技术》*

Cited By (38)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN104350493A (en)*2012-06-082015-02-11微软公司Transforming data into consumable content
CN104350493B (en)*2012-06-082017-08-25微软技术许可有限责任公司 Transform data into consumable content
US9595298B2 (en)2012-07-182017-03-14Microsoft Technology Licensing, LlcTransforming data to create layouts
CN103020189A (en)*2012-12-032013-04-03深圳中兴网信科技有限公司Data processing device and method
CN103020189B (en)*2012-12-032016-08-10深圳中兴网信科技有限公司Data processing equipment and data processing method
CN103116448A (en)*2013-01-302013-05-22浪潮电子信息产业股份有限公司Extract method for visualizing information
CN104182412A (en)*2013-05-242014-12-03中国移动通信集团安徽有限公司Webpage crawling method and webpage crawling system
CN104182412B (en)*2013-05-242017-08-04中国移动通信集团安徽有限公司 A web crawling method and system
CN105447184B (en)*2015-12-152019-06-11北京百分点信息科技有限公司 Information capture method and device
CN105447184A (en)*2015-12-152016-03-30北京百分点信息科技有限公司Information capturing method and device
CN106021485B (en)*2016-05-192019-05-14中国传媒大学A kind of polynary attribute cinematic data visualization system
CN106021485A (en)*2016-05-192016-10-12中国传媒大学Multi-element attribute movie data visualization system
CN107437158B (en)*2016-05-262021-08-10北京京东尚科信息技术有限公司Data query method, device and computer readable storage medium
CN107437158A (en)*2016-05-262017-12-05北京京东尚科信息技术有限公司Data query method and apparatus based on browser plug-in
CN106202348A (en)*2016-07-042016-12-07中山大学A kind of web page form information extraction method
CN108121743A (en)*2016-11-302018-06-05中移(苏州)软件技术有限公司A kind of generation of generic web pages masterplate and application method, system
CN106648677B (en)*2016-12-282019-08-02中国科学院南京地理与湖泊研究所A kind of water environment domain model integrates the visible customization method of template
CN106648677A (en)*2016-12-282017-05-10中国科学院南京地理与湖泊研究所Visualized customization method for integrated template of water environment area model
US10380228B2 (en)2017-02-102019-08-13Microsoft Technology Licensing, LlcOutput generation based on semantic expressions
CN106980921A (en)*2017-03-022017-07-25上海歌略软件科技有限公司A kind of self-defined risk analysis method
CN107609144A (en)*2017-09-212018-01-19浪潮软件股份有限公司A kind of analysis result processing method, apparatus and system
CN107608949B (en)*2017-10-162019-04-16北京神州泰岳软件股份有限公司A kind of Text Information Extraction method and device based on semantic model
CN107608949A (en)*2017-10-162018-01-19北京神州泰岳软件股份有限公司A kind of Text Information Extraction method and device based on semantic model
CN108334634A (en)*2018-02-272018-07-27北京中关村科金技术有限公司A kind of method, apparatus, equipment and the storage medium of extraction data information
CN110309364B (en)*2018-03-022023-03-28腾讯科技(深圳)有限公司Information extraction method and device
CN110309364A (en)*2018-03-022019-10-08腾讯科技(深圳)有限公司A kind of information extraction method and device
CN108984683A (en)*2018-06-292018-12-11北京百度网讯科技有限公司Extracting method, system, equipment and the storage medium of structural data
TWI682287B (en)*2018-10-252020-01-11財團法人資訊工業策進會Knowledge graph generating apparatus, method, and computer program product thereof
US11250035B2 (en)2018-10-252022-02-15Institute For Information IndustryKnowledge graph generating apparatus, method, and non-transitory computer readable storage medium thereof
CN109753596B (en)*2018-12-292021-05-25中国科学院计算技术研究所Information source management and configuration method and system for large-scale network data acquisition
CN109753596A (en)*2018-12-292019-05-14中国科学院计算技术研究所 Source management and configuration method and system for large-scale network data collection
CN111782737A (en)*2020-08-122020-10-16中国工商银行股份有限公司Information processing method, device, equipment and storage medium
CN111782737B (en)*2020-08-122024-05-28中国工商银行股份有限公司Information processing method, device, equipment and storage medium
CN112199960A (en)*2020-11-122021-01-08北京三维天地科技股份有限公司Standard knowledge element granularity analysis system
CN112199960B (en)*2020-11-122021-05-25北京三维天地科技股份有限公司Standard knowledge element granularity analysis system
CN116628303A (en)*2023-04-262023-08-22中国科学院信息工程研究所 A method and system for extracting attribute values of semi-structured web pages based on hint learning
CN116628303B (en)*2023-04-262025-03-14中国科学院信息工程研究所 A method and system for extracting attribute values from semi-structured web pages based on prompt learning
CN116701741A (en)*2023-06-192023-09-05鼎富智能科技有限公司Website data acquisition method and device

Also Published As

Publication numberPublication date
CN102360368B (en)2014-07-02

Similar Documents

PublicationPublication DateTitle
CN102360368A (en)Web data extraction method based on visual customization of extraction template
Liu et al.Vide: A vision-based approach for deep web data extraction
JP7723803B2 (en) System and method for creating and editing textual content in a website building system
CN104881488B (en)Configurable information extraction method based on relation table
TWI290698B (en)System and method for updating and displaying patent citation information
CN104021198A (en)Relational database information retrieval method and device based on ontology semantic index
CN102651055A (en)Method and system for generating file based on medical image
CN104142985A (en) A semi-automatic vertical crawler generation tool and method
TW200407736A (en)System and method for classifying patents and displaying patent classification
US10776351B2 (en)Automatic core data service view generator
CN103514292A (en)Webpage data extraction method based on semi-supervised learning of small sample
CN103914488A (en)Document collection, identification, association, search and display system
WO2017017663A1 (en)System and method for the creation and use of visually- diverse high-quality dynamic visual data structures
Nadee et al.Towards data extraction of dynamic content from JavaScript Web applications
JP2001014166A (en) Ontology association information generation device
CN104408101B (en)A kind of full range Web information extracts integrated approach
Della Penna et al.Visual extraction of information from web pages
CN109948015B (en)Meta search list result extraction method and system
RU2613026C1 (en)Method of preparing documents in markup languages while implementing user interface for working with information system data
CN112711404A (en)Method for generating special topic webpage template once and automatically releasing special topic webpage
CN103116448A (en)Extract method for visualizing information
Haw et al.XMapDB-Sim: Performance evalaution on model-based XML to Relational Database mapping choices
Stange et al.When experts collaborate: Sharing search and domain expertise within an organization
CN113094382B (en)Semi-automatic data acquisition and updating method for multi-source data management
BilenkoThe narrative explorer

Legal Events

DateCodeTitleDescription
C06Publication
PB01Publication
C10Entry into substantive examination
SE01Entry into force of request for substantive examination
C14Grant of patent or utility model
GR01Patent grant
CF01Termination of patent right due to non-payment of annual fee

Granted publication date:20140702

CF01Termination of patent right due to non-payment of annual fee

[8]ページ先頭

©2009-2025 Movatter.jp