Movatterモバイル変換


[0]ホーム

URL:


CN102360368B - Web data extraction method based on visual customization of extraction template - Google Patents

Web data extraction method based on visual customization of extraction template
Download PDF

Info

Publication number
CN102360368B
CN102360368BCN201110301775.9ACN201110301775ACN102360368BCN 102360368 BCN102360368 BCN 102360368BCN 201110301775 ACN201110301775 ACN 201110301775ACN 102360368 BCN102360368 BCN 102360368B
Authority
CN
China
Prior art keywords
page
data
template
extraction
data item
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201110301775.9A
Other languages
Chinese (zh)
Other versions
CN102360368A (en
Inventor
李庆忠
闫中敏
彭朝晖
蔡益清
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong UniversityfiledCriticalShandong University
Priority to CN201110301775.9ApriorityCriticalpatent/CN102360368B/en
Publication of CN102360368ApublicationCriticalpatent/CN102360368A/en
Application grantedgrantedCritical
Publication of CN102360368BpublicationCriticalpatent/CN102360368B/en
Expired - Fee Relatedlegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Images

Landscapes

Abstract

Translated fromChinese

本发明公开了一种基于抽取模板可视化定制的Web数据抽取方法,它包括以下步骤A.模板页面预处理;B.抽取模板可视化定制;C.对页面批量抽取频率进行设置;D.页面批量抽取。所述步骤A模板页面预处理,即模板页面源代码的转换及展示;所述步骤B抽取模板可视化定制是指在用户界面上提供拖拽选中功能,由用户自行设定模板页面上的属性标签和数据值与领域模型中属性的对应关系,建立抽取模板。所述步骤C页面批量抽取频率设置按每隔8小时对爬取获得的HTML页面进行批量抽取一次。所述步骤D页面批量抽取是指使用相应的抽取模板对爬取获得的大量HTML页面进行批量抽取,将其中的半结构化数据转合成结构化数据保存至本地数据库。

The invention discloses a web data extraction method based on the visual customization of the extraction template, which comprises the following steps: A. preprocessing the template page; B. visual customization of the extraction template; C. setting the page batch extraction frequency; D. page batch extraction . The step A template page preprocessing, that is, the conversion and display of the template page source code; the step B extraction template visual customization refers to providing a drag and drop selection function on the user interface, and the user sets the attribute label on the template page by himself Create an extraction template based on the correspondence between data values and attributes in the domain model. The page batch extraction frequency setting in step C is to extract the HTML pages obtained by crawling in batches once every 8 hours. The step D page batch extraction refers to using the corresponding extraction template to perform batch extraction of a large number of HTML pages obtained by crawling, converting the semi-structured data into structured data and saving it to the local database.

Description

Translated fromChinese
基于抽取模板可视化定制的Web数据抽取方法Web Data Extraction Method Based on Visual Customization of Extraction Template

技术领域technical field

本发明涉及一种涉及Web页面的抽取,属于计算机应用领域,尤其涉及一种基于抽取模板可视化定制的Web数据抽取方法。The invention relates to the extraction of Web pages, which belongs to the field of computer applications, in particular to a method for extracting Web data based on the visualization and customization of an extraction template.

背景技术Background technique

随着互联网技术的飞速发展,Web上的网站和网页数量以爆炸性的趋势增长,从而使Web成为一个巨大的、分布广泛的数据源。文本、表格和多媒体文件如图片、视频等是Web信息的主要表现形式,Web数据抽取即是按照一定的规则,从Web数据中抽取语义一致性的、结构化的数值知识,建立数值知识元库,满足用户数据查询、数据分析需求。为了自动化地将输入的Web页面转化成结构化数据,在数据抽取领域已经展开了很多工作。Web数据抽取主要用于产生结构化数据,这些结构化数据便于后续分析和挖掘处理。Web数据抽取对于众多Web数据分析和挖掘应用具有至关重要的作用和意义。With the rapid development of Internet technology, the number of websites and web pages on the Web has grown explosively, making the Web a huge and widely distributed data source. Text, tables, and multimedia files such as pictures and videos are the main manifestations of Web information. Web data extraction is to extract semantically consistent and structured numerical knowledge from Web data according to certain rules, and establish a numerical knowledge repository. , to meet user data query and data analysis needs. In order to automatically transform the input web pages into structured data, a lot of work has been done in the field of data extraction. Web data extraction is mainly used to generate structured data, which is convenient for subsequent analysis and mining processing. Web data extraction plays a vital role and significance for many Web data analysis and mining applications.

一个Web数据抽取任务在形式上可以定义为输入和输出。输入可以是非结构化数据,例如自由文本,也可以是在Web中普遍存在的半结构化文档。A web data extraction task can be formally defined as input and output. The input can be unstructured data, such as free text, or semi-structured documents ubiquitous in the Web.

由于存在以上技术上的要求,当前在Web页面数据抽取方面,还存在以下不足之处:Due to the above technical requirements, there are still the following deficiencies in the current web page data extraction:

1由于Web上数据的异构性和结构的缺失,导致面向分析和挖掘的Web数据应用,例如市场情报分析等,需要花费大量的代价去处理不同格式的Web数据源。1 Due to the heterogeneity and lack of structure of data on the Web, Web data applications for analysis and mining, such as market intelligence analysis, need to spend a lot of money to process Web data sources in different formats.

2一个Web数据抽取任务的输出可以是一个具有多条记录的关系表或者是具有复杂结构的数据对象。对于一些Web数据抽取任务,属性可以缺失或者在一条记录中某个属性具有多个属性值,另外,当Web页面中的半结构化数据存在属性顺序不唯一或拼写错误的时候,Web数据抽取任务将变得更加复杂和困难。2 The output of a Web data extraction task can be a relational table with multiple records or a data object with complex structure. For some web data extraction tasks, attributes can be missing or a certain attribute in a record has multiple attribute values. In addition, when the semi-structured data in the web page has a non-unique attribute sequence or spelling errors, the web data extraction task will become more complicated and difficult.

发明内容Contents of the invention

本发明的目的就是为了解决上述问题,提供一种基于抽取模板可视化定制的Web数据抽取方法,它具有可视化、友好的用户交互能力优点。The purpose of the present invention is to solve the above problems, and provide a web data extraction method based on the visualization and customization of the extraction template, which has the advantages of visualization and friendly user interaction capabilities.

为了实现上述目的,本发明采用如下技术方案:In order to achieve the above object, the present invention adopts following technical scheme:

一种基于抽取模板可视化定制的Web数据抽取方法,它包括以下步骤:A method for extracting web data based on the visualization and customization of an extraction template, comprising the following steps:

A.模板页面预处理;A. Template page preprocessing;

B.抽取模板可视化定制;B. Visual customization of extraction templates;

C.对页面批量抽取频率进行设置;C. Set the page batch extraction frequency;

D.页面批量抽取;D. Page batch extraction;

所述步骤A模板页面预处理,即模板页面源代码的转换及展示:它通过分析内存程序中模板页面的HTML源代码,解析其DOM树结构,并将其转化为XML格式,并在显示器的用户界面中展示;所述步骤A中模板页面源代码的转换及展示具体包括以下步骤:Said step A template page preprocessing, that is, the conversion and display of the template page source code: it analyzes the DOM tree structure of the template page in the memory program by analyzing the HTML source code, and converts it into XML format, and displays it in the display display in the user interface; the conversion and display of the source code of the template page in the step A specifically includes the following steps:

A1.对提供的模板页面进行HTML源代码分析,转化成符合XML规范的页面文件;A1. Analyze the HTML source code of the provided template page and convert it into a page file that conforms to the XML specification;

A2.对页面分析其完整的DOM结构,并展示在用户界面;A2. Analyze the complete DOM structure of the page and display it on the user interface;

A3.对转化后的页面,在满足页面原有结构的条件下,添加必要的JS控制代码,用以实现页面标注;A3. For the converted page, under the condition that the original structure of the page is satisfied, add necessary JS control codes to realize page annotation;

A4.将经过以上步骤处理过的XML格式的页面在用户界面中展示出来提供给用户进行模板可视化定制使用。A4. Display the page in XML format processed through the above steps in the user interface and provide it to the user for visual customization of the template.

所述步骤B抽取模板可视化定制是指在用户界面上提供拖拽选中功能,由用户自行设定模板页面上的属性标签和数据值与领域模型中属性的对应关系,建立抽取模板;所述步骤B中抽取模板可视化定制具体包括以下步骤:Said step B extracts the visual customization of the template and refers to providing a drag-and-drop selection function on the user interface, and the corresponding relationship between the attribute label and the data value on the template page and the attribute in the domain model is set by the user, and the extraction template is established; the steps The visual customization of the extracted template in B specifically includes the following steps:

B1.用户打开显示器显示的模板页面之后,用鼠标拖选中要抽取的数据项,程序会根据用户拖选出的数据项,分析这个数据项的XPATH路径并记录下来;B1. After the user opens the template page displayed on the monitor, drag and select the data item to be extracted with the mouse, and the program will analyze the XPATH path of the data item according to the data item selected by the user and record it;

B2.若该数据项在页面中还有对应的页面标签,则将该页面标签也拖选出,程序会记录下该页面标签的XPATH路径和该标签的文本内容,并与选出的数据项的XPATH路径共同组合一条抽取规则;若该数据项没有对应的页面标签,则不用选择;B2. If the data item has a corresponding page label in the page, drag and select the page label, and the program will record the XPATH path of the page label and the text content of the label, and compare it with the selected data item The XPATH path of the data together combine an extraction rule; if the data item does not have a corresponding page label, do not need to select;

B3.用户依据领域模型,为通过上述B1、B2步后形成的抽取规则选择一个属性标签,这个属性标签是包含在事先已经建立好的领域模型中,且符合这条抽取规则对应数据项语义,该属性标签标示这条抽取规则对应的数据项的语义,就是完成了页面数据项对数据表中列的映射;B3. According to the domain model, the user selects an attribute label for the extraction rule formed after the above steps B1 and B2. This attribute label is included in the domain model that has been established in advance and conforms to the semantics of the data item corresponding to this extraction rule. This attribute label marks the semantics of the data item corresponding to this extraction rule, which means that the mapping of the page data item to the column in the data table is completed;

B4.重复以上B1至B3步,直到所有要抽取的数据被标注出来,将经过以上步骤得到的抽取规则集合保存为一个页面抽取模板。B4. Repeat steps B1 to B3 above until all the data to be extracted are marked, and save the extraction rule set obtained through the above steps as a page extraction template.

所述步骤C页面批量抽取频率设置按每隔8小时对爬取获得的HTML页面进行批量抽取一次。The page batch extraction frequency setting in step C is to extract the HTML pages obtained by crawling in batches once every 8 hours.

所述步骤D页面批量抽取是指使用相应的抽取模板对爬取获得的大量HTML页面进行批量抽取,将其中的半结构化数据转合成结构化数据保存至本地数据库;The step D page batch extraction refers to the use of corresponding extraction templates to extract a large number of HTML pages obtained by crawling in batches, and convert the semi-structured data into structured data and save it to the local database;

所述步骤D中页面批量抽取具体包括以下步骤:The page batch extraction in the step D specifically includes the following steps:

D1.将当前要抽取的页面转化成规范的XML文件;D1. Convert the page currently to be extracted into a standardized XML file;

D2.利用抽取模板中记录的抽取规则,就是XPATH路径,抽取出所需要的数据项;D2. Use the extraction rules recorded in the extraction template, which is the XPATH path, to extract the required data items;

D3.根照每条抽取规则对应的数据标签,将抽取出的数据项保存到数据库表相应的列中;D3. According to the data label corresponding to each extraction rule, save the extracted data item in the corresponding column of the database table;

所述步骤D2具体包括以下步骤:The step D2 specifically includes the following steps:

D2-1选择一条还未使用过的抽取规则;D2-1 Select an extraction rule that has not been used;

D2-2若这条抽取规则没有记录对应的页面标签信息,则根据数据项对应的XPATH路径直接读取出对应的文本内容,并将这条抽取规则标记为已使用,转到步骤D2-8;若这条抽取规则有记录对应的页面标签信息,转到步骤D2-3;D2-2 If the extraction rule does not record the corresponding page label information, read the corresponding text content directly according to the XPATH path corresponding to the data item, mark this extraction rule as used, and go to step D2-8 ; If this extraction rule has record corresponding page label information, go to step D2-3;

D2-3根据该页面标签对应的XPATH路径抽取出对应的文本;若抽取成功,转到步骤D2-4;若抽取失败,则说明在当前页面中,该页面标签对应的数据项存在被缺省或移位的情况,则转到步骤D2-7;D2-3 Extract the corresponding text according to the XPATH path corresponding to the page label; if the extraction is successful, go to step D2-4; if the extraction fails, it means that in the current page, the data item corresponding to the page label exists and is defaulted or displacement, then go to step D2-7;

D2-4将抽取出的文本与这条抽取规则中记录的页面标签文本进行比对;若匹配,根据抽取规则中记录的数据项的XPATH,抽取出对应数据,并将这条抽取规则标记为已使用,转到步骤D2-8;若不匹配,则说明在当前页面中,该页面标签对应的数据项存在被缺省或移位的情况,则转到步骤D2-5;D2-4 Compare the extracted text with the page label text recorded in this extraction rule; if they match, extract the corresponding data according to the XPATH of the data item recorded in the extraction rule, and mark this extraction rule as If it has been used, go to step D2-8; if it does not match, it means that in the current page, the data item corresponding to the page tag has been defaulted or shifted, then go to step D2-5;

D2-5检查该文本是否匹配某条未使用过的抽取规则中的页面标签;如果存在对应的抽取规则,则这个文本将作为一个页面标签,转到步骤D2-6,否则转到步骤D2-7;D2-5 Check whether the text matches a page label in an unused extraction rule; if there is a corresponding extraction rule, this text will be used as a page label, and go to step D2-6, otherwise go to step D2- 7;

D2-6根据抽取规则中记录的页面标签与数据项的XPATH,计算出当这个文本为页面标签时,对应数据项的XPATH,并抽取相应数据,若抽取出数据非空,则将对应的抽取规则标记为已使用,转到步骤D2-7;D2-6 According to the XPATH of the page label and data item recorded in the extraction rules, calculate the XPATH of the corresponding data item when the text is a page label, and extract the corresponding data. If the extracted data is not empty, the corresponding extracted The rule is marked as used, go to step D2-7;

D2-7根据原有的页面标签的XPATH路径在页面中进行扩展搜索,寻找该页面标签;若最终没有找到,则说明存在在当前页中该标签对应的数据项被缺省的情况;若找到,则根据抽规则中记录的页面标签与数据项的XPATH,计算出该页面标签对应数据项的XPATH,抽取相应数据;最后将原抽取规则标记为已使用,转到步骤D2-8;D2-7 Perform an extended search in the page according to the XPATH path of the original page label to find the page label; if it is not found in the end, it means that the data item corresponding to the label is defaulted in the current page; if found , then calculate the XPATH of the data item corresponding to the page tag according to the XPATH of the page label and data item recorded in the extraction rule, and extract the corresponding data; finally mark the original extraction rule as used, and go to step D2-8;

D2-8重复以上步骤,直到所有的抽取规则都被使用。D2-8 Repeat the above steps until all the extraction rules are used.

所述步骤D2-3是为实现当Web页面中的半结构化数据出现属性顺序不唯一或者拼写错误的情况,通过一个扩展搜索保证不会出现数据丢失的情况。The step D2-3 is to realize that when the semi-structured data in the Web page has a non-unique attribute sequence or spelling error, an extended search can be used to ensure that no data loss occurs.

本发明的有益效果:Beneficial effects of the present invention:

1、本发明针对每个数据源,采用可视化用户定制方法,设计参数化、可配置的包装器,使之具备可视化的、友好的用户交互能力,对采集的大规模Web页面依据包装器实施自动抽取。1. For each data source, the present invention adopts a visual user customization method to design a parameterized and configurable wrapper, so that it has a visualized and friendly user interaction capability, and implements automatic processing of large-scale web pages collected according to the wrapper. extract.

2、由于Web页面上的内容和结构经常发生变化,导致已产生的抽取规则失效,对如何有效地提高Web数据抽取的自适应能力进行了研究,使之能够根据目标网页发生的变化自动做出调整,更新相应的抽取规则。2. Due to frequent changes in the content and structure of the Web page, the existing extraction rules are invalidated. Research is conducted on how to effectively improve the self-adaptive ability of Web data extraction, so that it can be automatically made according to the changes in the target web page. Adjust and update the corresponding extraction rules.

3、本发明的数据抽取方法适用性强,精度高,能够自适应网页变化,可大大提高抽取效率。3. The data extraction method of the present invention has strong applicability, high precision, can adapt to web page changes, and can greatly improve extraction efficiency.

附图说明Description of drawings

图1为基于抽取模板可视化定制的Web数据抽取方法流程;Fig. 1 is the process flow of the web data extraction method based on the visual customization of the extraction template;

图2为模板页面预处理流程;Figure 2 is the template page preprocessing flow;

图3为页面抽取模板可视化定制流程;Figure 3 shows the visual customization process of the page extraction template;

图4为页面抽取总体流程;Figure 4 is the overall process of page extraction;

图5为抽取过程细化流程;Figure 5 is the refinement process of the extraction process;

图6为某网站详细页面作为模板页示意图;Fig. 6 is a schematic diagram of a detailed page of a website as a template page;

图7为对网站的网页进行抽取过程示意图。FIG. 7 is a schematic diagram of a process of extracting web pages of a website.

具体实施方式Detailed ways

下面结合附图与实施例对本发明作进一步说明。The present invention will be further described below in conjunction with the accompanying drawings and embodiments.

图1中,一种基于抽取模板可视化定制的Web数据抽取方法,它包括以下步骤In Fig. 1, a web data extraction method based on extraction template visualization customization, it includes the following steps

A.模板页面预处理;A. Template page preprocessing;

B.抽取模板可视化定制;B. Visual customization of extraction templates;

C.对页面批量抽取频率进行设置;C. Set the page batch extraction frequency;

D、页面批量抽取。D. Page batch extraction.

所述步骤A模板页面预处理,即模板页面源代码的转换及展示:它通过分析内存程序中模板页面的HTML源代码,解析其DOM树结构,并将其转化为XML格式,并在显示器的用户界面中展示。Said step A template page preprocessing, that is, the conversion and display of the template page source code: it analyzes the DOM tree structure of the template page in the memory program by analyzing the HTML source code, and converts it into XML format, and displays it in the display displayed in the user interface.

所述步骤B抽取模板可视化定制是指在用户界面上提供拖拽选中功能,由用户自行设定模板页面上的属性标签和数据值与领域模型中属性的对应关系,建立抽取模板。The visual customization of the extraction template in step B refers to providing a drag-and-drop selection function on the user interface, and the user sets the corresponding relationship between the attribute label and data value on the template page and the attribute in the domain model, and establishes the extraction template.

所述步骤C页面批量抽取频率设置按每隔8小时对爬取获得的HTML页面进行批量抽取一次。The page batch extraction frequency setting in step C is to extract the HTML pages obtained by crawling in batches once every 8 hours.

所述步骤D页面批量抽取是指使用相应的抽取模板对爬取获得的大量HTML页面进行批量抽取,将其中的半结构化数据转合成结构化数据保存至本地数据库。The step D page batch extraction refers to using the corresponding extraction template to perform batch extraction of a large number of HTML pages obtained by crawling, converting the semi-structured data into structured data and saving it to the local database.

图2中,所述步骤A中模板页面源代码的转换及展示具体包括以下步骤:In Fig. 2, the conversion and display of the source code of the template page in the step A specifically includes the following steps:

A1.对提供的模板页面进行HTML源代码分析,转化成符合XML规范的页面文件;A1. Analyze the HTML source code of the provided template page and convert it into a page file that conforms to the XML specification;

A2.对页面分析其完整的DOM结构,并展示在用户界面;A2. Analyze the complete DOM structure of the page and display it on the user interface;

A3.对转化后的页面,在满足页面原有结构的条件下,添加必要的JS控制代码,用以实现页面标注;A3. For the converted page, under the condition that the original structure of the page is satisfied, add necessary JS control codes to realize page annotation;

A4.将经过以上步骤处理过的XML格式的页面在用户界面中展示出来提供给用户进行模板可视化定制使用。A4. Display the page in XML format processed through the above steps in the user interface and provide it to the user for visual customization of the template.

图3中,所述步骤B中抽取模板可视化定制具体包括以下步骤:In Fig. 3, the visual customization of the extracted template in the step B specifically includes the following steps:

B1.用户打开显示器显示的模板页面之后,用鼠标拖选中要抽取的数据项,程序会根据用户拖选出的数据项,分析这个数据项的XPATH路径并记录下来;B1. After the user opens the template page displayed on the monitor, drag and select the data item to be extracted with the mouse, and the program will analyze the XPATH path of the data item according to the data item selected by the user and record it;

B2.若该数据项在页面中还有对应的页面标签,则将该数据标签也拖选出,程序会记录下该数据标签的XPATH路径和该标签的文本内容,并与选出的数据项XPATH共同组合一条抽取规则;若该数据项没有对应的数据标签,则不用选择;B2. If the data item has a corresponding page label in the page, drag and select the data label, and the program will record the XPATH path of the data label and the text content of the label, and compare it with the selected data item XPATH together combine an extraction rule; if the data item does not have a corresponding data label, no selection is required;

B3.用户依据领域模型,为通过上述B1、B2步后形成的抽取规则选择一个属性标签,这个标签是包含在事先已经建立好的领域模型中,且符合这条抽取规则对应数据项语义,该属性标签标示这条抽取规则对应的数据项的语义,其实质就是完成了页面数据项对数据表中列的映射;B3. According to the domain model, the user selects an attribute label for the extraction rule formed after the above steps B1 and B2. This label is included in the domain model that has been established in advance and conforms to the semantics of the data item corresponding to this extraction rule. The attribute label marks the semantics of the data item corresponding to this extraction rule, and its essence is to complete the mapping of the page data item to the column in the data table;

B4.重复以上B1至B3步,直到所有要抽取的数据被标注出来,将经过以上步骤得到的抽取规则集合保存为一个页面抽取模板。B4. Repeat steps B1 to B3 above until all the data to be extracted are marked, and save the extraction rule set obtained through the above steps as a page extraction template.

图4中,所述步骤D中抽取模板可视化定制具体包括以下步骤:In Fig. 4, the visual customization of the extracted template in the step D specifically includes the following steps:

D1.将当前要抽取的页面转化成规范的XML文件;D1. Convert the page currently to be extracted into a standardized XML file;

D2.利用抽取模板中记录的抽取规则,其本质就是XPATH路径,抽取出所需要的数据项;D2. Use the extraction rules recorded in the extraction template, the essence of which is the XPATH path, to extract the required data items;

D3.根据每条抽取规则对应的数据标签,将抽取出的数据项保存到数据库表相应的列中。D3. According to the data label corresponding to each extraction rule, save the extracted data item into the corresponding column of the database table.

图5中,所述步骤D2具体包括以下步骤:In Fig. 5, described step D2 specifically comprises the following steps:

D2-1选择一条还未使用过的抽取规则;D2-1 Select an extraction rule that has not been used;

D2-2若这条抽取规则没有记录对应的页面标签信息,则根据数据项对应的XPATH路径直接读取出对应的文本内容,并将这条抽取规则标记为已使用,转到步骤D2-8;若这条抽取规则有记录对应的页面标签信息,转到步骤D2-3;D2-2 If the extraction rule does not record the corresponding page label information, read the corresponding text content directly according to the XPATH path corresponding to the data item, mark this extraction rule as used, and go to step D2-8 ; If this extraction rule has record corresponding page label information, go to step D2-3;

D2-3根据该页面标签对应的XPATH路径抽取出对应的文本;若抽取成功,转到步骤D2-4;若抽取失败,则说明在当前页面中,该页面标签对应的数据项存在被缺省或移位的情况,则转到步骤D2-7;D2-3 Extract the corresponding text according to the XPATH path corresponding to the page label; if the extraction is successful, go to step D2-4; if the extraction fails, it means that in the current page, the data item corresponding to the page label exists and is defaulted or displacement, then go to step D2-7;

D2-4将抽取出的文本与这条抽取规则中记录的页面标签文本进行比对;若匹配,根据抽取规则中记录的数据项的XPATH,抽取出对应数据,并将这条抽取规则标记为已使用,转到步骤D2-8;若不匹配,则说明在当前页面中,该页面标签对应的数据项存在被缺省或移位的情况,则转到步骤D2-5;D2-4 Compare the extracted text with the page label text recorded in this extraction rule; if they match, extract the corresponding data according to the XPATH of the data item recorded in the extraction rule, and mark this extraction rule as If it has been used, go to step D2-8; if it does not match, it means that in the current page, the data item corresponding to the page tag has been defaulted or shifted, then go to step D2-5;

D2-5检查该文本是否匹配某条未使用过的抽取规则中的页面标签;如果存在对应的抽取规则,则这个文本将作为一个页面标签,转到步骤D2-6,否则转到步骤D2-7;D2-5 Check whether the text matches a page label in an unused extraction rule; if there is a corresponding extraction rule, this text will be used as a page label, and go to step D2-6, otherwise go to step D2- 7;

D2-6根据抽规则中记录的页面标签与数据项的XPATH,计算出当这个文本为页面标签时,对应数据项的XPATH,并抽取相应数据,若抽取出数据非空,则将对应的抽取规则标记为已使用,转到步骤D2-7;D2-6 According to the XPATH of the page label and data item recorded in the extraction rules, calculate the XPATH of the corresponding data item when the text is a page label, and extract the corresponding data. If the extracted data is not empty, the corresponding extracted The rule is marked as used, go to step D2-7;

D2-7根据原有的页面标签的XPATH路径在页面中进行扩展搜索,寻找该页面标签;若最终没有找到,则说明存在在当前页中该标签对应的数据项被缺省的情况;若找到,则根据抽规则中记录的页面标签与数据项的XPATH,计算出该页面标签对应数据项的XPATH,抽取相应数据;最后将原抽取规则标记为已使用,转到步骤D2-8;D2-7 Perform an extended search in the page according to the XPATH path of the original page label to find the page label; if it is not found in the end, it means that the data item corresponding to the label is defaulted in the current page; if found , then calculate the XPATH of the data item corresponding to the page tag according to the XPATH of the page label and data item recorded in the extraction rule, and extract the corresponding data; finally mark the original extraction rule as used, and go to step D2-8;

D2-8重复以上步骤,直到所有的抽取规则都被使用。D2-8 Repeat the above steps until all the extraction rules are used.

所述步骤D2-3是为实现当Web页面中的半结构化数据出现属性顺序不唯一或者拼写错误的情况,通过一个扩展搜索保证不会出现数据丢失的情况。The step D2-3 is to realize that when the semi-structured data in the Web page has a non-unique attribute sequence or spelling error, an extended search can be used to ensure that no data loss occurs.

本发明的另一个实施实例,我们选择采用某网站作为数据源。详细页面作为模板页,用于定制模板,页面主要数据区域截图如附图6。In another implementation example of the present invention, we choose to adopt a certain website as a data source. The detailed page is used as a template page for customizing the template. The screenshot of the main data area of the page is shown in Figure 6.

假设用户手工标注的要抽取的数据如图中被矩形框包围的部分。Assume that the data to be extracted manually marked by the user is surrounded by a rectangular frame in the figure.

则我们可以获得以下10条抽取规则:Then we can get the following 10 extraction rules:

1.数据标签:职位名称;1. Data label: job title;

页面标签:空;page_label: empty;

数据项XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[3]/TD[2];Data item XPATH: /HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[3]/TD[2];

2.数据标签:招聘公司;2. Data label: recruitment company;

页面标签:空;page_label: empty;

数据项XPAHT:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[1]/TBODY[1]/TR[2]/TD[1]/TABLE[1]/TBODY[1]/TR[1]/TD[1]/STRONG[1]Data item XPAHT: /HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[1]/TBODY[1]/TR[2]/TD[1]/TABLE[1 ]/TBODY[1]/TR[1]/TD[1]/STRONG[1]

3.数据标签:发布日期;3. Data label: release date;

页面标签:发布日期;page label: publish_date;

页面标签XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[1]/TD[1]Page tag XPATH: /HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[1]/TD[1]

数据项XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[1]/TD[2]Data item XPATH: /HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[1]/TD[2]

4.数据标签:工作地点;4. Data label: work place;

页面标签:工作地点;Page label: work place;

页面标签XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[1]/TD[3]Page tag XPATH: /HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[1]/TD[3]

数据项XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[1]/TD[4]Data item XPATH: /HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[1]/TD[4]

5.数据标签:招聘人数;5. Data label: number of recruits;

页面标签:招聘人数;Page label: number of recruits;

页面标签XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[1]/TD[5]Page tag XPATH: /HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[1]/TD[5]

数据项XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[1]/TD[6]Data item XPATH: /HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[1]/TD[6]

6.数据标签:工作经验;6. Data label: work experience;

页面标签:工作年限;Page label: working years;

页面标签XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[2]/TD[1]Page tag XPATH: /HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[2]/TD[1]

数据项XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[2]/TD[2]Data item XPATH: /HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[2]/TD[2]

7.数据标签:语言要求;7. Data label: language requirements;

页面标签:语言要求;Page label: language requirements;

页面标签XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[2]/TD[3]Page tag XPATH: /HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[2]/TD[3]

数据项XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[2]/TD[4]Data item XPATH: /HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[2]/TD[4]

8.数据标签:学历;8. Data label: education;

页面标签:学历要求;Page label: academic requirements;

页面标签XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[2]/TD[5]Page tag XPATH: /HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[2]/TD[5]

数据项XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[2]/TD[6]Data item XPATH: /HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[2]/TD[6]

9.数据标签:薪金水平;9. Data label: salary level;

页面标签:薪水范围;Page Labels: Salary Range;

页面标签XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[2]/TD[5]Page tag XPATH: /HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[2]/TD[5]

数据项XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[2]/TD[6]Data item XPATH: /HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[2]/TD[6]

利用这9条抽取规则构成的抽取模板,我们可以对来源于该网站的同类网页进行批量。Using the extraction template composed of these 9 extraction rules, we can batch similar web pages from this website.

假设我们对同一网站的网页(附图7)进行抽取:Suppose we extract the web pages of the same website (Figure 7):

我们可以发现这页中缺少我们要抽取的2个数据项:语言要求和薪金水平。其中通过页面代码分析我们可以发现1~6条抽取规则然后有效可以直接利用。当我们使用第7条抽取规则“语言要求”时,我们会发现当前页相应标签XPATH位置上的文本是学历,和抽取规则中记录的语言要求不符,但是学历这个页面标签在抽取规则8中存在,因此将学历后的数据项“大专”抽取出来,并在页面中根扩展搜索“语言要求”这个页面标签,由于页面中不存在该标签,因此搜索不到。这样虽然被抽取页面结构与创建模板的结构有所不同,但是页面上的数据依然会被正确的识别并抽取出来。We can find that 2 data items we want to extract are missing in this page: language requirements and salary level. Among them, through page code analysis, we can find 1~6 extraction rules, which can be directly used effectively. When we use the 7th extraction rule "Language Requirements", we will find that the text in the XPATH position of the corresponding label on the current page is academic qualifications, which does not match the language requirements recorded in the extraction rules, but the page label of academic qualifications exists in extraction rule 8 , so the data item "college" after academic qualifications is extracted, and the page label "language requirements" is searched for as the root of the page. Since this label does not exist in the page, it cannot be searched. In this way, although the structure of the extracted page is different from the structure of the created template, the data on the page will still be correctly identified and extracted.

上述虽然结合附图对本发明的具体实施方式进行了描述,但并非对本发明保护范围的限制,所属领域技术人员应该明白,在本发明的技术方案的基础上,本领域技术人员不需要付出创造性劳动即可做出的各种修改或变形仍在本发明的保护范围以内。Although the specific implementation of the present invention has been described above in conjunction with the accompanying drawings, it does not limit the protection scope of the present invention. Those skilled in the art should understand that on the basis of the technical solution of the present invention, those skilled in the art do not need to pay creative work Various modifications or variations that can be made are still within the protection scope of the present invention.

Claims (1)

CN201110301775.9A2011-10-092011-10-09Web data extraction method based on visual customization of extraction templateExpired - Fee RelatedCN102360368B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201110301775.9ACN102360368B (en)2011-10-092011-10-09Web data extraction method based on visual customization of extraction template

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201110301775.9ACN102360368B (en)2011-10-092011-10-09Web data extraction method based on visual customization of extraction template

Publications (2)

Publication NumberPublication Date
CN102360368A CN102360368A (en)2012-02-22
CN102360368Btrue CN102360368B (en)2014-07-02

Family

ID=45585697

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201110301775.9AExpired - Fee RelatedCN102360368B (en)2011-10-092011-10-09Web data extraction method based on visual customization of extraction template

Country Status (1)

CountryLink
CN (1)CN102360368B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
TWI682287B (en)2018-10-252020-01-11財團法人資訊工業策進會Knowledge graph generating apparatus, method, and computer program product thereof

Families Citing this family (23)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US8990140B2 (en)*2012-06-082015-03-24Microsoft Technology Licensing, LlcTransforming data into consumable content
US9595298B2 (en)2012-07-182017-03-14Microsoft Technology Licensing, LlcTransforming data to create layouts
CN103020189B (en)*2012-12-032016-08-10深圳中兴网信科技有限公司Data processing equipment and data processing method
CN103116448A (en)*2013-01-302013-05-22浪潮电子信息产业股份有限公司Extract method for visualizing information
CN104182412B (en)*2013-05-242017-08-04中国移动通信集团安徽有限公司 A web crawling method and system
CN105447184B (en)*2015-12-152019-06-11北京百分点信息科技有限公司 Information capture method and device
CN106021485B (en)*2016-05-192019-05-14中国传媒大学A kind of polynary attribute cinematic data visualization system
CN107437158B (en)*2016-05-262021-08-10北京京东尚科信息技术有限公司Data query method, device and computer readable storage medium
CN106202348A (en)*2016-07-042016-12-07中山大学A kind of web page form information extraction method
CN108121743A (en)*2016-11-302018-06-05中移(苏州)软件技术有限公司A kind of generation of generic web pages masterplate and application method, system
CN106648677B (en)*2016-12-282019-08-02中国科学院南京地理与湖泊研究所A kind of water environment domain model integrates the visible customization method of template
US10380228B2 (en)2017-02-102019-08-13Microsoft Technology Licensing, LlcOutput generation based on semantic expressions
CN106980921B (en)*2017-03-022021-01-26上海歌略软件科技有限公司User-defined risk analysis method
CN107609144A (en)*2017-09-212018-01-19浪潮软件股份有限公司A kind of analysis result processing method, apparatus and system
CN107608949B (en)*2017-10-162019-04-16北京神州泰岳软件股份有限公司A kind of Text Information Extraction method and device based on semantic model
CN108334634A (en)*2018-02-272018-07-27北京中关村科金技术有限公司A kind of method, apparatus, equipment and the storage medium of extraction data information
CN110309364B (en)*2018-03-022023-03-28腾讯科技(深圳)有限公司Information extraction method and device
CN108984683B (en)*2018-06-292021-06-25北京百度网讯科技有限公司 Structured data extraction method, system, device and storage medium
CN109753596B (en)*2018-12-292021-05-25中国科学院计算技术研究所Information source management and configuration method and system for large-scale network data acquisition
CN111782737B (en)*2020-08-122024-05-28中国工商银行股份有限公司Information processing method, device, equipment and storage medium
CN112199960B (en)*2020-11-122021-05-25北京三维天地科技股份有限公司Standard knowledge element granularity analysis system
CN116628303B (en)*2023-04-262025-03-14中国科学院信息工程研究所 A method and system for extracting attribute values from semi-structured web pages based on prompt learning
CN116701741A (en)*2023-06-192023-09-05鼎富智能科技有限公司Website data acquisition method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN1588371A (en)*2004-09-082005-03-02孟小峰Forming method for package device
CN101582075A (en)*2009-06-242009-11-18大连海事大学Web information extraction system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN1588371A (en)*2004-09-082005-03-02孟小峰Forming method for package device
CN101582075A (en)*2009-06-242009-11-18大连海事大学Web information extraction system

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
基于DOM 树的可适应性Web 信息抽取;李 朝等;《计算机科学》;20090731;第36卷(第7期);203-203*
李 朝等.基于DOM 树的可适应性Web 信息抽取.《计算机科学》.2009,第36卷(第7期),203-203.
网页结构化信息抽取技术方法研究;郝爱峰;《山西电子技术》;20080430(第4期);第2部分*
郝爱峰.网页结构化信息抽取技术方法研究.《山西电子技术》.2008,(第4期),第2部分.

Cited By (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
TWI682287B (en)2018-10-252020-01-11財團法人資訊工業策進會Knowledge graph generating apparatus, method, and computer program product thereof
US11250035B2 (en)2018-10-252022-02-15Institute For Information IndustryKnowledge graph generating apparatus, method, and non-transitory computer readable storage medium thereof

Also Published As

Publication numberPublication date
CN102360368A (en)2012-02-22

Similar Documents

PublicationPublication DateTitle
CN102360368B (en)Web data extraction method based on visual customization of extraction template
CN104881488B (en)Configurable information extraction method based on relation table
CN103886046B (en)Automatic semanteme extraction method for Web data exchange
US8407585B2 (en)Context-aware content conversion and interpretation-specific views
US7886225B2 (en)Method and apparatus for the creation, location and formatting of digital content
TWI290698B (en)System and method for updating and displaying patent citation information
CN108121739B (en)Data collection method and data collection system
CN101751382B (en)Data acquisition method based on labels and system thereof
CN108446368A (en)A kind of construction method and equipment of Packaging Industry big data knowledge mapping
CN104021198B (en) Relational Database Information Retrieval Method and Device Based on Ontology Semantic Index
CN106202292A (en)A kind of standard information based on structural data model analyzes method
JPWO2006085455A1 (en) Document processing apparatus and document processing method
US20200133945A1 (en)Blended retrieval of data in transformed, normalized data models
CN104142985A (en) A semi-automatic vertical crawler generation tool and method
CN103136258B (en)The extracting method of knowledge entry and device
Gao et al.Are dense labels always necessary for 3d object detection from point cloud?
HarlowData Munging tools in preparation for RDF: Catmandu and LODRefine
Steiner et al.A digital archive of cultural heritage objects: standardized metadata and annotation categories
Kerne et al.Meta-metadata: a metadata semantics language for collection representation applications
CN116975367A (en)Data relationship processing method and device, electronic equipment and storage medium
CN105574016A (en)Method for half-structured Web information extraction technology
CN116610730B (en) Spatiotemporal big data in-depth analysis method and system based on knowledge graph
Shakya et al.StYLiD: Social information sharing with free creation of structured linked data.
Najafian et al.From Semantic To Instance: A Semi-Self-Supervised Learning Approach
CN118797929A (en) A structured analysis method and system for requirement model based on ReqIF format

Legal Events

DateCodeTitleDescription
C06Publication
PB01Publication
C10Entry into substantive examination
SE01Entry into force of request for substantive examination
C14Grant of patent or utility model
GR01Patent grant
CF01Termination of patent right due to non-payment of annual fee
CF01Termination of patent right due to non-payment of annual fee

Granted publication date:20140702


[8]ページ先頭

©2009-2025 Movatter.jp