Movatterモバイル変換


[0]ホーム

URL:


CN104572874A - Webpage information extraction method and device - Google Patents

Webpage information extraction method and device
Download PDF

Info

Publication number
CN104572874A
CN104572874ACN201410804430.9ACN201410804430ACN104572874ACN 104572874 ACN104572874 ACN 104572874ACN 201410804430 ACN201410804430 ACN 201410804430ACN 104572874 ACN104572874 ACN 104572874A
Authority
CN
China
Prior art keywords
url
webpage
information
extracting information
template
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410804430.9A
Other languages
Chinese (zh)
Other versions
CN104572874B (en
Inventor
刘雄伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Ruian Technology Co Ltd
Original Assignee
Beijing Ruian Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Ruian Technology Co LtdfiledCriticalBeijing Ruian Technology Co Ltd
Priority to CN201410804430.9ApriorityCriticalpatent/CN104572874B/en
Publication of CN104572874ApublicationCriticalpatent/CN104572874A/en
Application grantedgrantedCritical
Publication of CN104572874BpublicationCriticalpatent/CN104572874B/en
Expired - Fee Relatedlegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本发明实施例公开了一种网页信息的抽取方法及装置,所述网页信息的抽取方法包括:获取欲抽取信息网页的统一资源定位器URL;根据欲抽取信息网页的URL选择预先设定的模板;使用所选择的预先设定的模板抽取网页信息。从而提高了抽取网页信息的准确率。

The embodiment of the present invention discloses a method and device for extracting webpage information. The method for extracting webpage information includes: obtaining the Uniform Resource Locator URL of the webpage to extract information; selecting a preset template according to the URL of the webpage to extract information ; Use the selected preset template to extract web page information. Therefore, the accuracy rate of extracting webpage information is improved.

Description

Translated fromChinese
一种网页信息的抽取方法及装置Method and device for extracting webpage information

技术领域technical field

本发明涉及信息技术领域,尤其涉及一种网页信息的抽取方法及装置。The invention relates to the field of information technology, in particular to a method and device for extracting web page information.

背景技术Background technique

随着互联网的快速发展,网络媒体作为一种新的信息传播形式,已深入人们的日常生活。文本信息抽取技术是一种精确、高效的信息获取方法。它是从一个或多个网页中抽取指定的实体、关系及事件等用户需要的信息,并形成结构化的数据,呈现给用户。这种方法具有内容精确、冗余度小、组织规范等优点。With the rapid development of the Internet, network media, as a new form of information dissemination, has penetrated into people's daily life. Text information extraction technology is an accurate and efficient method of information acquisition. It extracts information needed by users such as specified entities, relationships, and events from one or more web pages, and forms structured data to present to users. This method has the advantages of accurate content, small redundancy, and standardized organization.

在现有技术中,有多种技术方法可用于多记录网页的抽取。如传统的方法中可采用编写规则来进行抽取。该方法能够准确快速地从特定的数据源中抽取出记录信息。随着网络信息量的日益增长,以及网页内容的不断更新,面对海量千变万化的数据,仅通过单一的人工配置模板来抽取网页的相关信息,必然会降低抽取的准确率。即使仅用于同一领域站点网页信息的抽取,由于其网页数目较多,布局风格多样且多变,现有的技术方法仍不能有效地提高抽取信息的准确率。In the prior art, there are various technical methods available for extracting multi-record webpages. For example, in the traditional method, rules can be written for extraction. The method can accurately and quickly extract record information from a specific data source. With the increasing amount of network information and the continuous updating of webpage content, in the face of massive and ever-changing data, extracting relevant information from webpages only through a single manual configuration template will inevitably reduce the accuracy of extraction. Even if it is only used to extract web page information in the same field, the existing technical methods still cannot effectively improve the accuracy of information extraction due to the large number of web pages and various and changeable layout styles.

发明内容Contents of the invention

有鉴于此,本发明实施例提出一种网页信息的抽取方法及装置,以提高抽取网页信息的准确率。In view of this, the embodiments of the present invention propose a method and device for extracting webpage information, so as to improve the accuracy of extracting webpage information.

第一方面,本发明实施例提供了一种网页信息的抽取方法,所述方法包括:In a first aspect, an embodiment of the present invention provides a method for extracting web page information, the method comprising:

获取欲抽取信息网页的统一资源定位器URL;Obtain the Uniform Resource Locator URL of the web page to extract information;

根据欲抽取信息网页的URL选择预先设定的模板;Select a pre-set template according to the URL of the information web page to be extracted;

使用所选择的预先设定的模板抽取网页信息。Use the selected pre-set template to extract web page information.

第二方面,本发明实施例提供了一种网页信息的抽取装置,所述装置包括:In a second aspect, an embodiment of the present invention provides a device for extracting web page information, the device comprising:

URL获取单元,用于获取欲抽取信息网页的统一资源定位器URL;URL obtaining unit, used to obtain the URL of the Uniform Resource Locator of the information webpage to be extracted;

模板选择单元,用于根据欲抽取信息网页的URL选择预先设定的模板;A template selection unit is used to select a preset template according to the URL of the information webpage to be extracted;

网页信息抽取单元,用于使用所选择的预先设定的模板抽取网页信息。The web page information extraction unit is used to extract web page information using a selected preset template.

本发明实施例提供的网页信息的抽取方法和装置,通过获取欲抽取信息网页的统一资源定位器URL;根据欲抽取信息网页的URL选择预先设定的模板;使用所选择的预先设定的模板抽取网页信息。从而提高了抽取网页信息的准确率。The method and device for extracting webpage information provided by the embodiments of the present invention obtain the URL of the URL of the information webpage to be extracted; select a preset template according to the URL of the webpage to extract information; use the selected preset template Extract web page information. Therefore, the accuracy rate of extracting webpage information is improved.

附图说明Description of drawings

通过阅读参照以下附图所作的对非限制性实施例所作的详细描述,本发明的其它特征、目的和优点将会变得更明显:Other characteristics, objects and advantages of the present invention will become more apparent by reading the detailed description of non-limiting embodiments made with reference to the following drawings:

图1是本发明第一实施例提供的网页信息的抽取方法的流程图;Fig. 1 is a flow chart of the method for extracting webpage information provided by the first embodiment of the present invention;

图2是本发明第一实施例提供的网页信息的抽取方法的示意图;2 is a schematic diagram of a method for extracting webpage information provided by the first embodiment of the present invention;

图3是本发明第二实施例提供的网页信息的抽取方法的流程图;3 is a flow chart of a method for extracting webpage information provided by a second embodiment of the present invention;

图4是本发明第二实施例提供的网页信息的抽取方法的示意图;4 is a schematic diagram of a method for extracting webpage information provided by a second embodiment of the present invention;

图5是本发明第三实施例提供的网页信息的抽取方法的流程图;5 is a flowchart of a method for extracting webpage information provided by a third embodiment of the present invention;

图6是本发明第四实施例提供的网页信息的抽取方法的流程图;FIG. 6 is a flow chart of a method for extracting webpage information provided by a fourth embodiment of the present invention;

图7是本发明第五实施例提供的网页信息的抽取方法的流程图;FIG. 7 is a flow chart of a method for extracting webpage information provided by a fifth embodiment of the present invention;

图8是本发明第六实施例提供的网页信息的抽取结构的结构图。FIG. 8 is a structural diagram of an extraction structure of webpage information provided by the sixth embodiment of the present invention.

具体实施方式Detailed ways

下面结合附图和实施例对本发明作进一步的详细说明。可以理解的是,此处所描述的具体实施例仅用于解释本发明,而非对本发明的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与本发明相关的部分而非全部内容。The present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, but not to limit the present invention. In addition, it should be noted that, for the convenience of description, only parts related to the present invention are shown in the drawings but not all content.

图1及图2示出了本发明的第一实施例。1 and 2 show a first embodiment of the present invention.

图1是本发明第一实施例提供的网页信息的抽取方法的流程图;图2为所述网页信息的抽取方法的示意图,所述网页信息的抽取方法包括:Fig. 1 is a flowchart of a method for extracting web page information provided in the first embodiment of the present invention; Fig. 2 is a schematic diagram of a method for extracting web page information, and the method for extracting web page information includes:

步骤S101,获取欲抽取信息网页的统一资源定位器(URL)。Step S101, obtaining the Uniform Resource Locator (URL) of the webpage for information to be extracted.

统一资源定位器(Uniform Resoure Locator,URL)是对可以从互联网上所获取到资源的位置和访问方法的一种简洁的表示,是互联网上标准资源的地址。互联网上的每个文件都有唯一的URL,它包含的信息能够指出文件的位置以及浏览器对该文件的处理方法。Uniform Resource Locator (Uniform Resource Locator, URL) is a concise representation of the location and access method of resources that can be obtained from the Internet, and is the address of standard resources on the Internet. Every file on the Internet has a unique URL, which contains information that indicates where the file is located and what the browser should do with it.

另外,URL也可以用来作为万维网的地址。对于在互联网上能够被访问的网页,都存在具有统一资源定位器URL。因此,对于欲抽取信息的网页,应该首先获取该网页的URL。例如欲抽取网易首页的信息,则需要先获取网易首页的URL(即http://www.163.com/)。In addition, a URL can also be used as an address for the World Wide Web. For web pages that can be accessed on the Internet, there is a URL with a Uniform Resource Locator. Therefore, for a webpage whose information is to be extracted, the URL of the webpage should be obtained first. For example, if you want to extract the information on the homepage of NetEase, you need to first obtain the URL of the homepage of NetEase (ie http://www.163.com/).

步骤S102,根据欲抽取信息的网页的URL选择预先设定的模板。Step S102, selecting a preset template according to the URL of the webpage to extract information from.

不同的网站会预先设置不同的模板,这是因为不同的网站所展示的信息具有很大的不同,例如新浪网与淘宝网。新浪网作为综合性门户网站,其展示的信息以新闻为主;而淘宝则以商品展示为主。对于上述两个网站,所采用的抽取信息模板必然存在较大差异。如采用同一抽取模板,由于抽取模板所抽取信息的正则表达式只对具有相应设定的字符串起作用,必然会降低准确率。因此,可以通过获取对欲抽取信息网页的URL来选择相应的预先设定的模板,从而提高提取网页信息的准确率。Different websites will preset different templates, because the information displayed by different websites is very different, such as Sina.com and Taobao.com. As a comprehensive portal website, Sina.com mainly displays news, while Taobao mainly displays products. For the above two websites, there must be a big difference in the information extraction templates adopted. If the same extraction template is used, since the regular expression of the information extracted by the extraction template only works on the character strings with corresponding settings, the accuracy rate will inevitably be reduced. Therefore, it is possible to select a corresponding pre-set template by acquiring the URL of the webpage to extract information, thereby improving the accuracy of extracting webpage information.

步骤S103,使用选择的预先设定的模板抽取网页信息。Step S103, using the selected preset template to extract web page information.

根据步骤S102所选择的预设的模板,对网页信息进行抽取,所述的模板可以是一组正则表达式。正则表达式是对字符串操作的一种逻辑公式,就是指用事先定义好的一些特定字符及这些特定字符的组合,组成一个“规则字符串”,这个“规则字符串”可以用来表达对字符串的一种过滤逻辑。The web page information is extracted according to the preset template selected in step S102, and the template may be a set of regular expressions. A regular expression is a logical formula for character string operations. It refers to using some specific characters defined in advance and the combination of these specific characters to form a "rule string". This "rule string" can be used to express the A filter logic for strings.

给定一个正则表达式和另一个字符串,可以达到如下的目的:判定字符串是否符合正则表达式的过滤逻辑(称作“匹配”);也可以通过正则表达式,从字符串中获取我们想要的特定部分。Given a regular expression and another string, the following goals can be achieved: to determine whether the string conforms to the filter logic of the regular expression (called "match"); it is also possible to get our string from the regular expression the specific part you want.

通过设定的正则表达式,可以从网页内容中识别和抽取网页中的相关内容,去除无关内容,并将抽取到的信息存入指定的数据库中,从而方便进行查询和查看。Through the set regular expression, you can identify and extract relevant content in the webpage from the webpage content, remove irrelevant content, and store the extracted information in the specified database, so as to facilitate query and viewing.

本发明实施例通过获取欲抽取信息网页的统一资源定位器(URL),并根据欲抽取信息网页的URL选择预先设定的模板,及使用选择的预先设定的模板抽取网页信息,从而提高抽取信息的准确率。The embodiment of the present invention obtains the Uniform Resource Locator (URL) of the information webpage to be extracted, and selects a preset template according to the URL of the information webpage to be extracted, and uses the selected preset template to extract webpage information, thereby improving the extraction rate. accuracy of information.

实施例二Embodiment two

图3及图4示出了本发明的第二实施例。3 and 4 show a second embodiment of the present invention.

图3是本发明第二实施例提供的网页信息的抽取方法的流程图,图4为本发明第二实施例提供的网页信息的抽取方法所述网页信息的抽取方法的示意图。所述的网页信息的抽取方法以第一实施例为基础,进一步的,将获取欲抽取信息的网页的统一资源定位器(URL)具体优化为:获取欲抽取信息网页的URL及欲抽取信息网页所包括的URL;将根据欲抽取信息网页的URL选择预先设定的模板具体优化为:根据欲抽取信息网页的URL及欲抽取信息网页所包括的URL选择预先设定的模板。FIG. 3 is a flow chart of the method for extracting webpage information provided in the second embodiment of the present invention, and FIG. 4 is a schematic diagram of the method for extracting webpage information in the method for extracting webpage information provided in the second embodiment of the present invention. The extraction method of described web page information is based on the first embodiment, and further, the Uniform Resource Locator (URL) of the web page that obtains information to be extracted is specifically optimized as: obtaining the URL of the web page that desires to extract information and the web page that desires to extract information Included URL: selecting a pre-set template according to the URL of the information web page to be extracted is specifically optimized as: selecting a pre-set template according to the URL of the information web page to be extracted and the URL included in the information web page to be extracted.

参见图3及图4,所述的网页信息的抽取方法包括:Referring to Fig. 3 and Fig. 4, the extraction method of described webpage information comprises:

步骤S201,获取欲抽取信息网页的URL及欲抽取信息网页所包括的URL。Step S201, obtaining the URL of the webpage to extract information and the URL included in the webpage to extract information.

欲抽取信息的网页内部可能包括多个链接。例如,欲抽取信息的网页为某门户网站的入口网页。如网易首页,在其首页上包括若干子单元的链接,例如论坛、新闻、财经等。可以通过网络爬虫获取相应的链接及链接所指向的网页内容。网络爬虫是一个自动提取网页的程序,可从一个或若干初始网页的URL开始,获得初始网页上的URL,在抓取网页的过程中,不断从当前页面上抽取新的URL放入队列。The webpage from which information is to be extracted may contain multiple links. For example, the webpage from which information is to be extracted is an entry webpage of a portal website. For example, the homepage of NetEase includes links to several sub-units on its homepage, such as forums, news, finance and so on. The corresponding link and the content of the webpage pointed to by the link can be obtained through a web crawler. A web crawler is a program that automatically extracts web pages. It can start from the URL of one or several initial web pages to obtain the URLs on the initial web pages. During the process of crawling web pages, it continuously extracts new URLs from the current page and puts them in the queue.

步骤S202,根据欲抽取信息网页的URL及欲抽取信息网页所包括的URL选择预先设定的模板。Step S202, selecting a preset template according to the URL of the information webpage to be extracted and the URL included in the information webpage to be extracted.

欲抽取信息的网页内部可能包括多个链接。例如某门户网站首页,在其首页上包括若干子单元的链接,如论坛、新闻、财经等。各个子单元由于内容差异极大,需要根据各个子单元的URL选取对应的预先设定的模板,模板可以由一组正则表达式组成。The webpage from which information is to be extracted may contain multiple links. For example, the home page of a portal website includes links to several subunits, such as forums, news, and finance, on the home page. Since the content of each sub-unit is very different, it is necessary to select a corresponding preset template according to the URL of each sub-unit. The template can be composed of a set of regular expressions.

步骤S203,使用所选择的预先设定的模板抽取网页信息。Step S203, using the selected preset template to extract web page information.

本发明实施例通过将获取欲抽取信息网页的统一资源定位器(URL)具体优化为:获取欲抽取信息网页的URL及欲抽取信息网页所包括的URL;将根据欲抽取信息网页的URL选择预先设定的模板具体优化为:根据欲抽取信息网页的URL及欲抽取信息网页所包括的URL,选择预先设定的模板。可以使用网络爬虫获取网页所包括的URL及URL所指向的网页内容,并根据包括的URL选择合适的模板对网页信息进行抽取。这样就可以在保证准确率的情况下,自动快速的完成多个网页信息的抽取。In the embodiment of the present invention, by specifically optimizing the Uniform Resource Locator (URL) for obtaining the information webpage to be extracted: obtaining the URL of the information webpage to be extracted and the URL included in the information webpage to be extracted; The specific optimization of the set template is as follows: according to the URL of the information webpage to be extracted and the URL included in the information webpage to be extracted, a preset template is selected. A web crawler can be used to obtain the URL included in the webpage and the webpage content pointed to by the URL, and an appropriate template is selected according to the included URL to extract the webpage information. In this way, the extraction of multiple web page information can be automatically and quickly completed while ensuring accuracy.

实施例三Embodiment Three

图5示出了本发明的第三实施例。Fig. 5 shows a third embodiment of the present invention.

图5是本发明第三实施例提供的网页信息的抽取方法的流程图,所述的网页信息的抽取方法以第一实施例为基础,进一步的,在获取欲抽取信息网页的统一资源定位器(URL)之后,增加如下步骤:对页面进行分块,将所述的根据欲抽取信息网页的URL选择预先设定的模板具体优化为:根据欲抽取信息的网页的URL及分块信息选择预先设定的模板;将所述的使用选择的预先设定的模板抽取网页信息具体包括:使用根据欲抽取信息网页的URL及分块信息所选择的预先设定模板对网页信息进行抽取。Fig. 5 is a flow chart of the method for extracting webpage information provided by the third embodiment of the present invention. The method for extracting webpage information is based on the first embodiment. (URL), increase the following steps: the page is divided into blocks, and the template is specifically optimized according to the URL of the webpage to be extracted according to the URL and the block information of the webpage to be extracted. The set template; the said extracting the webpage information by using the selected preset template specifically includes: extracting the webpage information by using the preset template selected according to the URL of the information webpage to be extracted and the block information.

参见图5,所述的网页信息的抽取方法包括:Referring to Fig. 5, the extraction method of described web page information comprises:

步骤S301,获取欲抽取信息网页的统一资源定位器(URL)。Step S301, obtaining the Uniform Resource Locator (URL) of the information webpage to be extracted.

步骤S302,对页面进行分块。Step S302, divide the page into blocks.

抽取信息的页面通过布局,对页面的文字、图形或表格进行格式设置,使得页面上包括多个块,例如信息块、图像块、广告块等。可以根据每一块的具体内容来对网页分块,也可以对内容简单的网页设定区域范围来分块。The page from which the information is extracted is formatted through the layout to format the text, graphics or tables of the page, so that the page includes multiple blocks, such as information blocks, image blocks, advertisement blocks, and the like. The web page can be divided into blocks according to the specific content of each block, and the area range can also be set for the simple content of the web page to be divided into blocks.

步骤S303,根据欲抽取信息的网页及分块信息选择预先设定的模板。Step S303, selecting a preset template according to the webpage and segment information to be extracted.

对于已经分块的页面,可以根据其网页的URL和该块在页面的位置从模板数据库中选择合适的预先设定的模板。For a page that has been divided into blocks, an appropriate pre-set template can be selected from the template database according to the URL of the web page and the position of the block on the page.

步骤S304,使用根据欲抽取信息网页的URL及分块信息所选择的预先设定模板对网页信息进行抽取。Step S304, using the preset template selected according to the URL of the information webpage to be extracted and the segment information to extract the webpage information.

根据步骤S303所选择的模板对网页的分块内的信息进行抽取。The information in the block of the webpage is extracted according to the template selected in step S303.

本发明实施例通过在获取欲抽取信息网页的统一资源定位器(URL)之后,增加如下步骤:对页面进行分块,将所述的根据欲抽取信息网页的URL选择预先设定的模板具体优化为:根据欲抽取信息网页的URL及分块信息选择预先设定的模板;将所述的使用选择的预先设定的模板抽取网页信息具体包括:使用根据欲抽取信息网页的URL及分块信息所选择的预先设定的模板对网页信息进行抽取。将抽取信息的网页进行分块,根据分块信息和网页URL选取合适的模板对网页信息进行抽取,从而加快了抽取速度,也进一步的增强了抽取信息的准确度。In the embodiment of the present invention, after obtaining the Uniform Resource Locator (URL) of the information webpage to be extracted, the following steps are added: the page is divided into blocks, and the template is specifically optimized according to the URL of the information webpage to be extracted. To: select a pre-set template according to the URL and block information of the information webpage to be extracted; and extracting the web page information by using the pre-set template of the selection specifically includes: using the URL and the block information of the web page according to the information to be extracted The selected preset template extracts web page information. The webpage for extracting information is divided into blocks, and an appropriate template is selected according to the block information and the webpage URL to extract the webpage information, thereby speeding up the extraction speed and further enhancing the accuracy of the extracted information.

实施例四Embodiment four

图6示出了本发明的第四实施例。Fig. 6 shows a fourth embodiment of the present invention.

图6是本发明第四实施例提供的网页信息的抽取方法的流程图,所述的网页信息的抽取方法以第三实施例为基础,进一步的,将对页面进行分块具体优化为:遍历页面所有标签,确定连续标签所构成的分块区域。Fig. 6 is a flow chart of the method for extracting webpage information provided by the fourth embodiment of the present invention. The method for extracting webpage information is based on the third embodiment. Further, the specific optimization of dividing the page into blocks is as follows: traversal All tags on the page, determine the block area formed by consecutive tags.

参见图6,所述的网页信息的抽取方法包括:Referring to Fig. 6, the method for extracting web page information includes:

步骤S401,获取欲抽取信息网页的统一资源定位器(URL)。Step S401, obtaining the Uniform Resource Locator (URL) of the webpage for information to be extracted.

步骤S402,遍历页面所有分隔标签。Step S402, traversing all separator tags on the page.

抽取信息的页面内根据不同的内容会采用相应的标签进行标记,例如在页面的超文本标记语言(HyperText Mark-up Language,HTML;)。文本文件采用标签对信息块进行描述,例如<bcginTag></beginTag>、<endTag></endTag>和<divideTag></divideTag>,其中<bcginTag></bcginTag>和<.endTag></endTag>用来表示信息块的起始位置,根据它们可以在Html页面源文件中找到信息块。<divideTag></divideTag>用来表示信息块之内起分割作用的标识。根据抽取信息的页面的HTML文本文件可以遍历到该页面的所有标签。The pages to extract information will be marked with corresponding tags according to different contents, for example, in the hypertext markup language (HyperText Mark-up Language, HTML;) of the page. Text files use tags to describe information blocks, such as <bcginTag></beginTag>, <endTag></endTag> and <divideTag></divideTag>, where <bcginTag></bcginTag> and <.endTag></ endTag> is used to indicate the starting position of the information block, and the information block can be found in the Html page source file according to them. <divideTag></divideTag> is used to represent the mark that divides within the information block. According to the HTML text file of the page from which the information is extracted, all tags of the page can be traversed.

步骤S403,确定连续标签所构成的分块区域。Step S403, determining the block area formed by the continuous labels.

根据步骤S402遍历页面所有标签的结果,可以寻找到连续标签。例如<bcginTag></beginTag>、<endTag></endTag>,该标签内所包括的内容即该段分块内的信息。信息块内部是由多个内容、形式相同的部分组成。而<divideTag></divideTag>用来表示信息块之内起分割作用的标识,即用来区分信息大块中的各个信息子块。According to the result of traversing all tags on the page in step S402, continuous tags can be found. For example <bcginTag></beginTag>, <endTag></endTag>, the content included in the tag is the information in the block. The interior of the information block is composed of multiple parts with the same content and form. And <divideTag></divideTag> is used to indicate the mark that divides the information block, that is, it is used to distinguish each information sub-block in the information block.

步骤S404,根据欲抽取信息网页的及分块信息选择预先设定的模板。Step S404, selecting a preset template according to the information of the information webpage to be extracted and the segment information.

步骤S405,使用根据欲抽取信息网页的URL及分块信息所选择的预先设定模板对网页信息进行抽取。Step S405, using the preset template selected according to the URL of the information webpage to be extracted and the segment information to extract the webpage information.

本发明实施例通过将对页面进行分块具体优化为:遍历页面所有标签,确定连续标签所构成的分块区域。能够准确的根据网页中的内容进行准确的分块,进一步的提高了抽取信息的准确性。In the embodiment of the present invention, by specifically optimizing the block of the page as follows: traversing all the tags on the page, and determining the block area formed by the continuous tags. It can accurately divide into blocks according to the content in the webpage, further improving the accuracy of information extraction.

实施例五Embodiment five

图7是本发明第五实施例提供的网页信息的抽取方法的流程图,所述的网页信息的抽取方法以第四实施例为基础,进一步的,将所述的确定连续标签所构成的分块区域具体优化为:根据设定的分隔标签权值计算分隔标签之间所构成分块区域的权值;确定权值大于预设值的分隔标签之间所构成的分块区域。Fig. 7 is a flow chart of the method for extracting web page information provided by the fifth embodiment of the present invention. The method for extracting web page information is based on the fourth embodiment. The specific optimization of the block area is as follows: calculating the weight of the block area formed between the separation tags according to the set separation tag weight; determining the block area formed between the separation tags whose weight is greater than the preset value.

参见图7,所述的网页信息的抽取方法包括:Referring to Fig. 7, the method for extracting web page information includes:

步骤S501,获取欲抽取信息网页的统一资源定位器(URL)。Step S501, obtaining the Uniform Resource Locator (URL) of the information webpage to be extracted.

步骤S502,遍历页面所有标签。Step S502, traversing all tags on the page.

步骤S503,根据设定的分隔标签权值计算分隔标签之间所构成分块的权值。Step S503, calculating the weight of the blocks formed between the separator labels according to the set separator label weights.

分隔标签之间所限定的网页分块存在很大的差异,有的分块可能有很多信息内容,有的分块可能只有寥寥几个字。特别是链接分块,很明显,这些链接分块并不是需要进行抽取的。如果按照原有的方法,对这些链接分块也需要通过模板进行抽取会浪费相当大的资源,所以需要对分隔标签之间所构成分块进行考量,判断其是否需要通过模板进行抽取。There is a big difference in the webpage blocks defined between the separator tags, some blocks may have a lot of information content, and some blocks may only have a few words. Especially link blocks, obviously, these link blocks do not need to be extracted. If the original method is used, it will waste a lot of resources to extract these link blocks through templates. Therefore, it is necessary to consider the blocks formed between the separator tags to determine whether they need to be extracted through templates.

在本实施例中,通过预设设定的分割标签之间所构成的间隔分块阈值对分隔分块进行判断。可采用如下程序实现:In this embodiment, the partition block is judged by the interval block threshold formed between the partition tags set in advance. It can be realized by the following procedure:

n:=0;k:=0:TagSeg:=Φ;n:=0; k:=0: TagSeg:=Φ;

While Not Doc文件结束While Not Doc end of file

K:=k+lK:=k+l

:从Doe中提取的第k个HTML标签: the kth HTML tag extracted from Doe

If Blank(,),//存在连续HTML标签If Blank(,),//Continuous HTML tags exist

If∈S//存在连续的分隔标签If∈S//There are consecutive separator labels

→TagSeg→TagSeg

End IfEnd If

Endend

If//在分隔标签段If // in the delimiter label segment

计算分隔标签段所对应的分割权值,Calculate the segmentation weight corresponding to the separation label segment,

EndElseEnd Else

EndWhileEndWhile

步骤S504,确定权值大于预设值的分隔标签之间所构成分块区域。Step S504, determining the block regions formed between the separation tags whose weights are greater than a preset value.

根据步骤S503的计算结果,可以将符合设定阈值的分块区域放入同一集合中,该集合中的分块区域即权值大于预设值的分隔标签之间所构成分块区域。其实现代码如下:According to the calculation result of step S503, the block regions meeting the set threshold can be put into the same set, and the block regions in this set are the block regions formed between the separation labels whose weights are greater than the preset value. Its implementation code is as follows:

IfSws≥S′//分隔标签段构成间隔IfSws ≥ S'//Separate label segments to form intervals

<Bn,TagSegn>→Q<Bn ,TagSegn >→Q

EndIfEnd If

EndIfEnd If

//清空分隔标签集合 // Clear the delimited label collection

步骤S505,根据欲抽取信息的网页的及分块信息选择预先设定的模板。Step S505, selecting a pre-set template according to the information of the webpage to be extracted and the block information.

步骤S506,使用根据欲抽取信息的网页的URL及分块信息选择的预先设定的模板对网页信息进行抽取。Step S506, using a preset template selected according to the URL of the webpage to be extracted and the segment information to extract the webpage information.

本发明实施例通过将所述确定的连续标签所构成的分块区域具体优化为:根据设定的分隔标签权值计算分隔标签之间所构成分块区域的权值;确定权值大于预设值的分隔标签之间所构成分块区域。能够对页面的分块区域进行判断,去除不必抽取的分块区域,减少选择模板及使用模板抽取分块区域信息的工作,降低了抽取信息的工作量,加快了抽取信息的速度,同时也增强了抽取信息的准确度。In the embodiment of the present invention, by specifically optimizing the block area formed by the determined continuous tags: calculating the weight of the block area formed between the separated tags according to the set separated tag weight; determining that the weight is greater than the preset The chunking area is formed between the delimiting tags of the value. It can judge the block area of the page, remove the block area that does not need to be extracted, reduce the work of selecting templates and using templates to extract block area information, reduce the workload of extracting information, speed up the speed of information extraction, and enhance the accuracy of the extracted information.

使用本实施例提供的网页信息抽取方法,对新浪、搜狐、腾讯三大网站中的上市公司财务数据报表信息进行抽取,结果如下:Using the webpage information extraction method provided in this embodiment, the financial data report information of listed companies in the three major websites of Sina, Sohu, and Tencent is extracted, and the results are as follows:

实施例六Embodiment six

图8示出本发明第六实施例。Fig. 8 shows a sixth embodiment of the present invention.

图8是本发明第六实施例提供的网页信息的抽取装置的结构图。FIG. 8 is a structural diagram of an apparatus for extracting webpage information provided by a sixth embodiment of the present invention.

由图8可以看出,所述的网页信息的抽取装置包括:URL获取单元610、模板选择单元620和网页信息抽取单元630。It can be seen from FIG. 8 that the apparatus for extracting webpage information includes: a URL acquisition unit 610 , a template selection unit 620 and a webpage information extraction unit 630 .

其中,所述URL获取单元,用于获取欲抽取信息网页的统一资源定位器Wherein, the URL acquiring unit is used to acquire the Uniform Resource Locator of the information webpage to be extracted

(URL);(URL);

模板选择单元,用于根据欲抽取信息网页的URL选择预先设定的模板;A template selection unit is used to select a preset template according to the URL of the information webpage to be extracted;

网页信息抽取单元,用于使用所选择的预先设定的模板抽取网页信息。The web page information extraction unit is used to extract web page information using a selected preset template.

进一步的,所述的URL获取单元具体用于:Further, the URL acquisition unit is specifically used for:

获取欲抽取信息网页的URL及欲抽取信息网页所包括的的URL;Obtain the URL of the webpage to extract information and the URL included in the webpage to extract information;

所述模板选择单元具体用于:The template selection unit is specifically used for:

根据欲抽取信息网页的URL及欲抽取信息网页所包括的URL选择预先设定的模板。A preset template is selected according to the URL of the information webpage to be extracted and the URL included in the information webpage to be extracted.

进一步的,所述的网页信息的抽取装置还包括分块单元640。Further, the apparatus for extracting webpage information further includes a block unit 640 .

所述分块单元,用于对页面进行分块;The block unit is used to block the page;

所述的模板选择单元具体用于:The template selection unit is specifically used for:

根据欲抽取信息的网页的URL及分块信息选择预先设定的模板;Select a pre-set template according to the URL and block information of the webpage to extract information;

所述的网页信息抽取单元具体用于:The web page information extraction unit is specifically used for:

使用根据欲抽取信息的网页的URL及分块信息所选择的预先设定模板对网页信息进行抽取。The webpage information is extracted by using a preset template selected according to the URL of the webpage to be extracted and the block information.

进一步的,所述的分块单元还包括:遍历单元641和分块区域确定单元642。Further, the block unit further includes: a traversal unit 641 and a block area determination unit 642 .

其中,所述遍历单元用于遍历页面所有分隔标签;Wherein, the traversal unit is used for traversing all separating tags of the page;

分块区域确定单元用于确定连续分隔标签所构成的分块区域。The block area determination unit is used to determine the block area formed by the continuous separation labels.

进一步的,所述的分块区域确定单元包括:权值计算单元6421和第二区域确定单元6422。Further, the block area determination unit includes: a weight calculation unit 6421 and a second area determination unit 6422 .

其中,所述权值计算单元用于根据设定的分隔标签权值计算分隔标签之间所构成区域的权值;Wherein, the weight calculation unit is used to calculate the weight of the area formed between the separation labels according to the set separation label weight;

第二区域确定单元用于确定权值大于预设值的分隔标签之间所构成分块区域。The second area determination unit is configured to determine the block area formed between the separation labels whose weight is greater than a preset value.

上述网页信息的抽取装置可执行本发明实施例所提供的网页信息的抽取方法,具备执行方法相应的功能模块和有益效果。The above web page information extraction device can execute the web page information extraction method provided by the embodiment of the present invention, and has corresponding functional modules and beneficial effects for executing the method.

上述本发明实施例序号仅仅为了描述,不代表实施例的优劣。The serial numbers of the above embodiments of the present invention are for description only, and do not represent the advantages and disadvantages of the embodiments.

本领域普通技术人员应该明白,上述的本发明的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个计算装置上,或者分布在多个计算装置所组成的网络上,可选地,他们可以用计算机装置可执行的程序代码来实现,从而可以将它们存储在存储装置中由计算装置来执行,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本发明不限制于任何特定的硬件和软件的结合。Those of ordinary skill in the art should understand that each module or each step of the present invention described above can be realized by a general-purpose computing device, and they can be concentrated on a single computing device, or distributed on a network formed by multiple computing devices. Optionally, they can be implemented with executable program codes of computer devices, so that they can be stored in storage devices and executed by computing devices, or they can be made into individual integrated circuit modules, or a plurality of modules in them Or the steps are fabricated into a single integrated circuit module to realize. As such, the present invention is not limited to any specific combination of hardware and software.

本说明书中的各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间的相同或相似的部分互相参见即可。Each embodiment in this specification is described in a progressive manner, each embodiment focuses on the difference from other embodiments, and the same or similar parts between the various embodiments can be referred to each other.

以上所述仅为本发明的优选实施例,并不用于限制本发明,对于本领域技术人员而言,本发明可以有各种改动和变化。凡在本发明的精神和原理之内所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included within the protection scope of the present invention.

Claims (10)

CN201410804430.9A2014-12-192014-12-19A kind of abstracting method and device of webpage informationExpired - Fee RelatedCN104572874B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201410804430.9ACN104572874B (en)2014-12-192014-12-19A kind of abstracting method and device of webpage information

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201410804430.9ACN104572874B (en)2014-12-192014-12-19A kind of abstracting method and device of webpage information

Publications (2)

Publication NumberPublication Date
CN104572874Atrue CN104572874A (en)2015-04-29
CN104572874B CN104572874B (en)2019-03-05

Family

ID=53088936

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201410804430.9AExpired - Fee RelatedCN104572874B (en)2014-12-192014-12-19A kind of abstracting method and device of webpage information

Country Status (1)

CountryLink
CN (1)CN104572874B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN105160209A (en)*2015-08-312015-12-16佛山市恒南微科技有限公司System for investigating and managing regional enterprise software copyright announcement
CN106815273A (en)*2015-12-022017-06-09北京国双科技有限公司Date storage method and device
CN109933717A (en)*2019-01-172019-06-25华南理工大学 A Recommendation System for Academic Conference Based on Hybrid Recommendation Algorithm
CN110020236A (en)*2017-08-292019-07-16北京国双科技有限公司Web analysis method, apparatus, storage medium, processor and equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN101192234A (en)*2007-06-072008-06-04腾讯科技(深圳)有限公司Searching system and method based on web page extraction
CN101916285A (en)*2010-08-202010-12-15北京新岸线网络技术有限公司Internet webpage content analysis method and device
CN102591971A (en)*2011-12-312012-07-18北京百度网讯科技有限公司Method and device for extracting webpage information
CN102651002A (en)*2011-02-282012-08-29腾讯科技(深圳)有限公司Webpage information extracting method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN101192234A (en)*2007-06-072008-06-04腾讯科技(深圳)有限公司Searching system and method based on web page extraction
CN101916285A (en)*2010-08-202010-12-15北京新岸线网络技术有限公司Internet webpage content analysis method and device
CN102651002A (en)*2011-02-282012-08-29腾讯科技(深圳)有限公司Webpage information extracting method and system
CN102591971A (en)*2011-12-312012-07-18北京百度网讯科技有限公司Method and device for extracting webpage information

Cited By (6)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN105160209A (en)*2015-08-312015-12-16佛山市恒南微科技有限公司System for investigating and managing regional enterprise software copyright announcement
CN106815273A (en)*2015-12-022017-06-09北京国双科技有限公司Date storage method and device
CN106815273B (en)*2015-12-022020-07-31北京国双科技有限公司 Data storage method and device
CN110020236A (en)*2017-08-292019-07-16北京国双科技有限公司Web analysis method, apparatus, storage medium, processor and equipment
CN109933717A (en)*2019-01-172019-06-25华南理工大学 A Recommendation System for Academic Conference Based on Hybrid Recommendation Algorithm
CN109933717B (en)*2019-01-172021-05-14华南理工大学Academic conference recommendation system based on hybrid recommendation algorithm

Also Published As

Publication numberPublication date
CN104572874B (en)2019-03-05

Similar Documents

PublicationPublication DateTitle
US10248662B2 (en)Generating descriptive text for images in documents using seed descriptors
CN103514234B (en)A kind of page info extracting method and device
CN102663023B (en)Implementation method for extracting web content
TWI695277B (en) Automatic website data collection method
CN102841920B (en)Method and device for extracting webpage frame information
US20150067476A1 (en)Title and body extraction from web page
JP6203374B2 (en) Web page style address integration
CN102270206A (en)Method and device for capturing valid web page contents
CN106503211B (en) Method for automatic generation of mobile version of information publishing website
CN102591612B (en)General webpage text extraction method based on punctuation continuity and system thereof
CN103294781A (en)Method and equipment used for processing page data
CN110457579B (en)Webpage denoising method and system based on cooperative work of template and classifier
CN103838785A (en)Vertical search engine in patent field
CN104133870B (en)A kind of webpage similarity calculating method and device
CN104572934B (en) A method for extracting key content of web pages based on DOM
CN104331438B (en)To novel web page contents selectivity abstracting method and device
CN103853834A (en)Text structure analysis-based Web document abstract generation method
Uzun et al.An effective and efficient Web content extractor for optimizing the crawling process
JP2015144011A (en) Search result ranking apparatus and method using reliability of representative
CN104572874A (en)Webpage information extraction method and device
CN105740355A (en)Aggregated text density based webpage body text extraction method and apparatus
JP6749865B2 (en) INFORMATION COLLECTION DEVICE AND INFORMATION COLLECTION METHOD
CN106446139A (en)Webpage content extracting method and device
CN105095206A (en)Information processing method and information processing device
JP2004220251A (en) Information extraction rule making system, information extraction rule making method and information extraction rule making program

Legal Events

DateCodeTitleDescription
C06Publication
PB01Publication
C10Entry into substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant
CF01Termination of patent right due to non-payment of annual fee

Granted publication date:20190305

CF01Termination of patent right due to non-payment of annual fee

[8]ページ先頭

©2009-2025 Movatter.jp