技术领域technical field
本发明涉及信息抽取技术领域,更具体地,涉及一种在多记录网页中所采用的信息抽取系统和抽取方法,能应用于传统多记录网页(如搜索引擎结果页面等)和新式多记录网页(如微博记录网页、论坛帖子网页、产品评论网页等),适用于多种不同媒介和不同领域。The present invention relates to the technical field of information extraction, and more specifically, relates to an information extraction system and extraction method used in multi-record webpages, which can be applied to traditional multi-record webpages (such as search engine result pages, etc.) and new multi-record webpages (Such as Weibo record pages, forum post pages, product review pages, etc.), applicable to many different media and fields.
背景技术Background technique
在现有技术中,有很多技术方法可用于多记录网页抽取。传统的信息抽取方法采用编写规则的方法,该方法能够准确快速地从特定的数据源中抽取出记录信息。但是当数据源规模增长成百上千个时,再依靠人工编写规则,会耗费大量的时间和精力,无法满足现在信息极速膨胀的处理需求。另一方面,各个数据源的网页模板不是一成不变的,一旦页面模板更新,就需要人工重新修改规则,造成巨大的维护成本。还有一些通过人工标注训练集来生成规则的方法,因为需要人为参与同样不适合抽取海量多变的多记录网页。In the prior art, there are many technical methods available for multi-record web page extraction. The traditional method of information extraction adopts the method of writing rules, which can accurately and quickly extract record information from specific data sources. However, when the scale of data sources grows to hundreds or thousands, relying on manual writing of rules will consume a lot of time and energy, which cannot meet the current processing needs of rapidly expanding information. On the other hand, the web page templates of each data source are not static. Once the page template is updated, the rules need to be manually re-modified, resulting in huge maintenance costs. There are also some methods of manually labeling training sets to generate rules, because human participation is also not suitable for extracting massive and changeable multi-record web pages.
在现有技术中,存在着一些针对传统多记录网页的自动抽取技术方法。传统多记录网页是由服务器的cgi程序从数据库检索出记录,然后以制定好的模板动态生成。由于有固定的模板,所以每条记录的结构相似度极高,十分规整。自动抽取方法能够根据一个或一类网页的特征自动抽取网页中相似的数据记录。在这些技术中,典型地使用记录结构相似度(Structure Similarity),并根据计算的相似度值确定记录区域。In the prior art, there are some technical methods for automatic extraction of traditional multi-record webpages. The traditional multi-record webpage is retrieved from the database by the server's cgi program, and then dynamically generated with a prepared template. Due to the fixed template, the structural similarity of each record is very high and very regular. The automatic extraction method can automatically extract similar data records in a web page according to the characteristics of one or a type of web page. In these techniques, typically record structure similarity (Structure Similarity) is used, and the record region is determined according to the calculated similarity value.
在现有技术中,还存在着一些针对新式多记录网页的自动抽取技术方法。新式多记录网页主体内容由网民自我创作,有很高的灵活性,记录外部结构相似,从网页看是一条一条记录,但记录内部结构差异性大,以微博记录为例,有些微博是原创微博,只有原创内容,而有些微博是转发微博,除原创内容外,还内嵌一条被转发的记录。自动抽取方法能够根据一个或一类网页的特征自动抽取网页中相似的数据记录。在这些技术中,典型地使用领域知识,利用每条记录中均出现且易于识别的元素来确定记录区域。In the prior art, there are also some automatic extraction techniques for new multi-record webpages. The main content of the new multi-record web page is created by netizens themselves, which has high flexibility. The external structure of the record is similar. From the web page, it is a record one by one, but the internal structure of the record is very different. Taking Weibo records as an example, some microblogs are Original microblogs only have original content, and some microblogs are reposted microblogs. In addition to original content, there is also a record of being reposted. The automatic extraction method can automatically extract similar data records in a web page according to the characteristics of one or a type of web page. In these techniques, domain knowledge is typically used to determine record regions using easily identifiable elements that occur in each record.
然而,新式多记录网页有其自身的特点,与传统多记录网页有所不同。针对传统多记录网页的抽取方法在计算新式网页结构相似度时得到的值普遍偏低,使得其不能正确识别记录区域;另外,现有针对新式多记录网页的抽取方法往往只关注某一媒介,拓展性不足。However, new multi-record pages have their own characteristics, which are different from traditional multi-record pages. The traditional multi-record web page extraction methods generally get low values when calculating the structural similarity of new web pages, which makes it impossible to correctly identify the record area; in addition, the existing multi-record web page extraction methods often only focus on a certain medium, Insufficient scalability.
现有的多记录网页抽取方法没有充分考虑新式多记录网页的结构特点,而且只能适用于某个媒介。随着近年来微博、论坛等社交媒介消息的不断产生,新式多记录网页已经拥有大量的数据资源,并需要通过数据挖掘技术来发现其中的热点话题、意见领袖等信息,这就对多记录信息抽取技术提出了一个挑战:如何构建一个统一有效的信息抽取系统来满足不同媒介的信息抽取需要。因此,迫切需要有一种高效准确的多记录抽取方法,该方法应能够自动定位网页中的记录区域,并将记录区域中的记录进行分割,同时能够在不同媒介、不同领域方便地使用。The existing multi-record webpage extraction methods do not fully consider the structural characteristics of the new multi-record webpage, and can only be applied to a certain medium. With the continuous generation of social media messages such as Weibo and forums in recent years, the new multi-record webpage already has a large amount of data resources, and it is necessary to use data mining technology to discover information such as hot topics and opinion leaders. Information extraction technology poses a challenge: how to build a unified and effective information extraction system to meet the information extraction needs of different media. Therefore, there is an urgent need for an efficient and accurate multi-record extraction method, which should be able to automatically locate the record area in the webpage, and divide the records in the record area, and can be used conveniently in different media and different fields.
发明内容Contents of the invention
本发明的目的在于克服现有技术的不足,提供一种多记录网页的信息抽取系统及方法,该系统及方法能够高效、准确地对传统和新式多记录网页进行信息抽取,抽取速度快、准确度高,通用性强,适用范围广。The purpose of the present invention is to overcome the deficiencies of the prior art and provide a system and method for extracting information from multi-record webpages. The system and method can efficiently and accurately extract information from traditional and new multi-record webpages, and the extraction speed is fast and accurate. High precision, strong versatility, wide application range.
为了实现上述目的,本发明的技术方案是:一种多记录网页的信息抽取系统,包括:一个网页预处理模块,用于将HTML网页转换为XHTML网页,并过滤网页中用来渲染显示效果的标签,然后根据标签的嵌套结构,构建文档次序树;一个记录区域定位模块,用于接收待抽取文档的文档次序树,并利用横向层次分析法在所述文档次序树中定位出记录区域的位置;一个记录分隔符识别模块,用于利用双向搜索方法从所述记录区域中找到记录之间的分隔符并进行存储;以及一个记录输出模块,用于将记录区域里所有文本节点按层次顺序遍历输出,在碰到分隔符时输出分隔线,得到最终的抽取结果。In order to achieve the above object, the technical solution of the present invention is: an information extraction system for multi-record webpages, including: a webpage preprocessing module for converting HTML webpages into XHTML webpages, and filtering webpages used to render display effects tags, and then construct a document sequence tree according to the nested structure of the tags; a record area positioning module is used to receive the document sequence tree of the document to be extracted, and use the horizontal analytic hierarchy process to locate the location of the record area in the document sequence tree position; a record delimiter recognition module, used to find and store the delimiter between records from the record area by using a bidirectional search method; and a record output module, used to arrange all text nodes in the record area in hierarchical order Traverse the output, output the separator line when encountering the separator, and get the final extraction result.
进一步地,所述网页预处理模块包括SAX解析器,用于对XHTML网页代码进行解析,以构建文档次序树。Further, the web page preprocessing module includes a SAX parser for parsing XHTML web page codes to build a document sequence tree.
进一步地,所述SAX解析器包括4个事件处理器,分别为startDocument事件处理器、endDocument事件处理器、startElement事件处理器、endElement事件处理器;所述4个事件处理器分别包含了预先定义好的一系列操作,所述4个事件处理器按照解析标签的顺序依次被触发、执行。Further, the SAX parser includes 4 event handlers, which are startDocument event handler, endDocument event handler, startElement event handler, endElement event handler; the 4 event handlers respectively include predefined A series of operations, the four event handlers are triggered and executed sequentially in the order of parsing tags.
进一步地,所述记录区域定位模块利用横向层次分析法在所述文档次序树中定位出记录区域的位置,包括以下步骤:步骤a1:先序遍历文档次序树,统计并记录每个节点相同子节点个数,找到相同子节点个数最多的节点,并将以该节点为根的子树确定为候选记录区域1且该节点要满足其子树内某个文本节点字符数大于文档次序树文本节点平均字符数;步骤a2:在对文档次序树进行遍历的同时,确定文档次序树中位于中间子树且字符数最多的文本节点位置;确定了文本节点位置即确定了从该节点到文档次序树根节点的唯一路径,选择该路径上相同子节点数最多的节点,将以该节点为根的子树确定为候选记录区域2;步骤a3:如果候选记录区域1的根节点和候选记录区域2的根节点相同,则候选记录区域1和候选记录区域2所代表的子树为最终记录区域,否则计算两棵子树根节点相同子节点个数的商值,引入阈值θ ,如果商值大等于阈值,说明记录条数少,需要依靠最多字符数文本节点来准确定位记录区域,选择候选记录区域2为最终记录区域;如果商值小于阈值,说明记录条数多,选择候选记录区域1作为最终记录区域。Further, the recording area locating module locates the position of the recording area in the document order tree by using the horizontal analytic hierarchy process, including the following steps: Step a1: Traversing the document order tree in order, counting and recording the same children of each node The number of nodes, find the node with the largest number of identical child nodes, and determine the subtree rooted at this node as the candidate record area 1, and the node must satisfy that the number of characters of a text node in its subtree is greater than that of the document sequence tree text The average number of characters of a node; step a2: While traversing the document order tree, determine the position of the text node in the middle subtree with the largest number of characters in the document order tree; determining the position of the text node means determining the order from the node to the document The only path to the root node of the tree, select the node with the largest number of identical child nodes on the path, and determine the subtree rooted at this node as the candidate record area 2; Step a3: If the root node of the candidate record area 1 and the candidate record area 2 have the same root node, then the subtree represented by the candidate record area 1 and the candidate record area 2 is the final record area, otherwise, calculate the quotient of the number of the same child nodes at the root nodes of the two subtrees, and introduce a thresholdθ , if the quotient is large It is equal to the threshold, indicating that the number of records is small, and it is necessary to rely on the text node with the largest number of characters to accurately locate the recording area, and select candidate recording area 2 as the final recording area; if the quotient value is less than the threshold, it indicates that there are many records, and select candidate recording area 1 as Final record area.
进一步地,所述记录分隔符识别模块从所述记录区域中查找记录之间的分隔符,包括以下步骤:步骤b1:将根节点的所有子节点定义为记录区域节点块,在记录区域节点块中利用双向搜索方法查找能覆盖最多节点的非重叠重复子块;利用双向搜索方法查找重复子块的方法为:首先确定记录区域内字符数最多的文本节点,然后以该文本节点为基准,双向拓展来寻找重复子块;步骤b2:在记录区域节点块中进行重复子块的匹配,在所有匹配到的第一个节点位置插入分隔符,以对记录区域内记录进行分割。Further, the record delimiter identification module searches for the delimiter between records in the record area, including the following steps: Step b1: define all child nodes of the root node as a record area node block, and in the record area node block Use the bidirectional search method to find the non-overlapping repeated sub-blocks that can cover the most nodes; the method to use the bidirectional search method to find the repeated sub-blocks is: first determine the text node with the largest number of characters in the recording area, and then use this text node as a benchmark, bidirectional Expand to find repeated sub-blocks; step b2: match repeated sub-blocks in the node block of the record area, and insert separators at all matched first node positions to divide the records in the record area.
本发明还提供一种多记录网页的信息抽取方法,包括以下步骤:步骤1:由网页预处理模块将HTML网页转换为XHTML网页,并过滤网页中用来渲染显示效果的标签,然后根据标签的嵌套结构,构建文档次序树;步骤2:由记录区域定位模块接收来自预处理模块的文档次序树,通过对所述文档次序树的横向层次分析在所述文档次序树中定位出记录区域块的位置;步骤3:由记录分隔符识别模块接收来自记录区域定位模块的记录区域块位置,利用双向搜索方法从记录区域块中找到记录之间的分隔符并进行存储;步骤4:由记录输出模块将记录区域里所有文本节点按层次顺序遍历输出,在碰到分隔符时输出分隔线,得到最终的抽取结果。The present invention also provides a method for extracting information from multi-record webpages, comprising the following steps: Step 1: converting HTML webpages into XHTML webpages by the webpage preprocessing module, and filtering the tags used to render display effects in the webpages, and then according to the tags Nested structure, building a document sequence tree; step 2: receiving the document sequence tree from the preprocessing module by the record area positioning module, and locating the record area block in the document sequence tree by horizontal hierarchical analysis of the document sequence tree The position of; Step 3: The recording area block position from the recording area positioning module is received by the recording separator recognition module, and the separator between the records is found from the recording area block using a bidirectional search method and stored; Step 4: Output by the record The module traverses and outputs all the text nodes in the recording area in hierarchical order, and outputs the separator line when encountering a separator to obtain the final extraction result.
进一步地,在步骤1中,所述采用SAX解析器对网页文档进行解析,以构建文档次序树。Further, in step 1, the webpage document is parsed by using a SAX parser to construct a document sequence tree.
进一步地,所述SAX解析器包括4个事件处理器,分别为startDocument事件处理器、endDocument事件处理器、startElement事件处理器、endElement事件处理器;所述4个事件处理器分别包含了预先定义好的一系列操作,所述4个处理器按照解析标签的顺序依次被触发、执行。Further, the SAX parser includes 4 event handlers, which are startDocument event handler, endDocument event handler, startElement event handler, endElement event handler; the 4 event handlers respectively include predefined A series of operations, the four processors are triggered and executed sequentially in the order of parsing tags.
进一步地,在步骤2中,通过横向层次分析在所述文档次序树中定位出记录区域的位置,包括以下步骤:步骤a1:先序遍历文档次序树,统计并记录每个节点相同子节点个数,找到相同子节点个数最多的节点,并将以该节点为根的子树确定为候选记录区域1且该节点要满足其子树内某个文本节点字符数大于文档次序树文本节点平均字符数;步骤a2:在对文档次序树进行遍历的同时,确定文档次序树中位于中间子树且字符数最多的文本节点位置;确定了文本节点位置即确定了从该节点到文档次序树根节点的唯一路径,选择该路径上相同子节点数最多的节点,将以该节点为根的子树确定为候选记录区域2;步骤a3:如果候选记录区域1的根节点和候选记录区域2的根节点相同,则候选记录区域1和候选记录区域2所代表的子树为最终记录区域,否则计算两棵子树根节点相同子节点个数的商值,引入阈值θ ,如果商值大等于阈值,说明记录条数少,需要依靠最多字符数文本节点来准确定位记录区域,选择候选记录区域2为最终记录区域;如果商值小于阈值,说明记录条数多,选择候选记录区域1作为最终记录区域。Further, in step 2, the position of the recording area is located in the document order tree through horizontal hierarchical analysis, including the following steps: Step a1: Traversing the document order tree in advance, counting and recording the number of identical child nodes of each node Find the node with the largest number of identical child nodes, and determine the subtree rooted at this node as the candidate record area 1, and the node must satisfy that the number of characters of a text node in its subtree is greater than the average number of text nodes in the document order tree number of characters; step a2: while traversing the document order tree, determine the position of the text node in the middle subtree with the largest number of characters in the document order tree; determining the position of the text node means determining from this node to the root of the document order tree The only path of a node, select the node with the largest number of identical child nodes on the path, and determine the subtree rooted at this node as the candidate record area 2; Step a3: If the root node of the candidate record area 1 and the candidate record area 2 If the root nodes are the same, the subtrees represented by the candidate record area 1 and the candidate record area 2 are the final record area, otherwise, calculate the quotient of the same number of child nodes at the root nodes of the two subtrees, and introduce a thresholdθ , if the quotient is greater than the threshold , indicating that the number of records is small, and it is necessary to rely on the text node with the largest number of characters to accurately locate the record area, and select the candidate record area 2 as the final record area; if the quotient value is less than the threshold, it means that the number of records is large, and select the candidate record area 1 as the final record area.
进一步地,在步骤3中,从记录区域中查找记录之间的分隔符,包括以下步骤:步骤b1:将根节点的所有子节点定义为记录区域节点块,在记录区域节点块中利用双向搜索方法查找能覆盖最多节点的非重叠重复子块;利用双向搜索方法查找重复子块的方法为:首先确定记录区域内字符数最多的文本节点,然后以该文本节点为基准,双向拓展来寻找重复子块;步骤b2:在记录区域节点块中进行重复子块的匹配,在所有匹配到的第一个节点位置插入分隔符,以对记录区域内记录进行分割。Further, in step 3, searching for the delimiter between the records from the record area includes the following steps: Step b1: define all child nodes of the root node as a record area node block, and use bidirectional search in the record area node block The method finds non-overlapping repeated sub-blocks that can cover the most nodes; the method of using the bidirectional search method to find repeated sub-blocks is as follows: first determine the text node with the largest number of characters in the recording area, and then use the text node as a benchmark to expand bidirectionally to find duplicates Sub-block; Step b2: Perform repeated sub-block matching in the node block of the record area, and insert a delimiter at all matched first node positions, so as to divide the records in the record area.
相较于现有技术,本发明的有益效果是可以高效、准确地对传统多记录网页(如搜索引擎结果页面等)和新式多记录网页(如微博记录网页、论坛帖子网页、产品评论网页等)进行信息抽取,克服了现有的多记录网页抽取方法对新式多记录网页适用性差的问题,不仅抽取速度快,准确度高,稳定性高,而且通用性强,适用范围广,能够在不同媒介、不同领域方便地应用,具有很强的实用性和广阔的应用前景。Compared with the prior art, the beneficial effect of the present invention is that the traditional multi-record web pages (such as search engine result pages, etc.) and new multi-record web pages (such as Weibo record web pages, forum post web pages, product review web etc.) for information extraction, which overcomes the problem of poor applicability of the existing multi-record webpage extraction method to new multi-record webpages. It not only has fast extraction speed, high accuracy and high stability, but also has strong versatility and wide application range. It is conveniently applied in different media and different fields, and has strong practicability and broad application prospects.
附图说明Description of drawings
下面结合附图及具体实施例对本发明作进一步说明。The present invention will be further described below in conjunction with the accompanying drawings and specific embodiments.
图1是本发明实施例的系统结构示意图。Fig. 1 is a schematic diagram of the system structure of the embodiment of the present invention.
图2是本发明实施例中记录分隔实例示意图。Fig. 2 is a schematic diagram of an example of record separation in the embodiment of the present invention.
具体实施方式detailed description
如图1所示,本发明多记录网页的信息抽取系统,包括:As shown in Figure 1, the information extraction system of multi-record web pages of the present invention includes:
(1)网页预处理模块,用于将HTML网页转换为XHTML网页,并过滤网页中用来渲染显示效果的标签,如<script>等,然后根据标签的嵌套结构,构建文档次序树,以记录标签父子节点关系;网页预处理模块采用SAX机制进行网页解析,网页预处理模块对网页的处理结果用文档次序树进行组织;(1) The webpage preprocessing module is used to convert HTML webpages into XHTML webpages, and filter the tags used to render display effects in the webpages, such as <script>, etc., and then build a document sequence tree according to the nested structure of the tags, to Record the label parent-child node relationship; the webpage preprocessing module uses the SAX mechanism to analyze the webpage, and the webpage preprocessing module organizes the processing results of the webpage with the document sequence tree;
(2)记录区域定位模块,用于接收待抽取文档的文档次序树,并利用横向层次分析法在文档次序树中定位出记录区域的位置;(2) The record area positioning module is used to receive the document sequence tree of the document to be extracted, and use the horizontal analytic hierarchy process to locate the position of the record area in the document sequence tree;
(3)记录分隔符识别模块,用于从记录区域中找到记录之间的分隔符并进行存储;(3) Record delimiter identification module, used to find and store the delimiter between records from the record area;
(4)记录输出模块,用于将记录区域里所有文本节点按层次顺序遍历输出,在碰到分隔符时输出分隔线,得到最终的抽取结果。(4) The record output module is used to traverse and output all the text nodes in the record area in hierarchical order, and output the separator when a separator is encountered to obtain the final extraction result.
下面分别详细描述各模块的实现方案。The implementation scheme of each module is described in detail below.
(1)网页预处理模块(1) Web page preprocessing module
首先,描述网页预处理模块中的SAX解析器如何进行网页解析,即,如何把网页HTML代码转换成文档次序树。First, describe how the SAX parser in the webpage preprocessing module performs webpage parsing, that is, how to convert the HTML code of the webpage into a document sequence tree.
目前,网页代码的处理机制主要有两种类型:通过构建DOM树来进行网页代码分析的DOM解析器,以及通过定义事件来进行网页代码分析的SAX解析器。为了能够处理大文档,本发明中,采用SAX解析器对网页进行解析。At present, there are mainly two types of processing mechanisms for webpage codes: a DOM parser that analyzes webpage codes by building a DOM tree, and a SAX parser that analyzes webpage codes by defining events. In order to be able to process large documents, in the present invention, a SAX parser is used to parse the webpage.
网页预处理模块包括SAX解析器,用于对XHTML网页代码进行解析,以构建文档次序树。SAX解析器包括4个事件处理器:startDocument事件处理器、endDocument事件处理器、startElement事件处理器、endElement事件处理器,4个事件处理器分别包含了预先定义好的一系列操作,4个事件处理器按照解析标签的顺序依次被触发、执行。当解析到文档第一个标签如<xml>时触发事件处理器startDocument(),然后会执行处理器里预先定义好的一系列操作,执行好后继续进行解析,当解析到开始标签如<head>时触发事件处理器startElement(),当解析到结束标签如</head>时触发事件处理器endElement()。这样依次解析下去直到解析到最后一个标签</xml>触发事件处理器endDocument()。这样在通读一遍网页后,对它的所有解析工作也由事件处理器中预先定义好的操作顺序完成。The web page preprocessing module includes a SAX parser for parsing XHTML web page codes to build a document sequence tree. The SAX parser includes 4 event handlers: startDocument event handler, endDocument event handler, startElement event handler, endElement event handler, 4 event handlers contain a series of predefined operations, 4 event handlers The triggers are triggered and executed sequentially in the order in which the tags are parsed. When the first tag of the document such as <xml> is parsed, the event handler startDocument() is triggered, and then a series of operations pre-defined in the handler will be executed. After execution, the parsing will continue. When the start tag such as <head is parsed > triggers the event handler startElement(), and triggers the event handler endElement() when parsing to an end tag such as </head>. In this way, the parsing continues until the last tag </xml> triggers the event handler endDocument(). In this way, after the web page is read through, all the parsing work on it is also completed by the pre-defined operation sequence in the event handler.
为了能构造类似于DOM树的结构,在SAX处理机制下,利用文档次序索引构建标签树。即按照遍历标签的顺序依次编号,并记录下编号所对应标签的父子关系,从而构建文档次序树。In order to construct a structure similar to the DOM tree, under the SAX processing mechanism, the document order index is used to construct the tag tree. That is, number the tags in sequence according to the order of traversing the tags, and record the parent-child relationship of the tags corresponding to the numbers, so as to build a document sequence tree.
(2)记录区域定位模块(2) Recording area positioning module
其次,描述记录区域定位模块是如何确定记录区域块的。主要思想是对文档次序树进行横向层次分析,将同一个父节点下同层节点看成一个大的节点块,将记录区域定位的问题转换为寻找节点块的相似子块。Next, describe how the recording area positioning module determines the recording area block. The main idea is to perform horizontal hierarchical analysis on the document sequence tree, regard the same-level nodes under the same parent node as a large node block, and convert the problem of recording area positioning into finding similar sub-blocks of the node block.
我们提出以下2个假设:We make the following 2 hypotheses:
假设1:包含相似节点数多的节点块更有可能是记录区域节点块。Hypothesis 1: A node block containing a large number of similar nodes is more likely to be a record area node block.
假设2:字符数多的文本节点更有可能是记录区域内的文本节点。Hypothesis 2: A text node with a large number of characters is more likely to be a text node in the record area.
根据假设对文档次序树进行先序遍历,统计并记录每个节点相同子节点个数,将相同子节点个数最多的节点作为候选记录区域1的根节点且该节点要满足其子树内某个文本节点字符数大于文档次序树文本节点平均字符数。增加该判断是为了排除网页菜单栏等其它非记录区域节点块的干扰。因为菜单栏节点块也会有大量相似节点,但它的文本节点字符数通常较少,所以能利用字符数多少进行区分。According to the assumptions, the document sequence tree is traversed in order, and the number of identical child nodes of each node is counted and recorded. The node with the largest number of identical child nodes is used as the root node of the candidate record area 1 and the node must satisfy a certain requirement in its subtree. The number of characters of a text node is greater than the average number of characters of a document order tree text node. This judgment is added to eliminate the interference of other non-recording area node blocks such as the menu bar of the webpage. Because the menu bar node block also has a large number of similar nodes, but its text node characters are usually less, so it can be distinguished by the number of characters.
在对树进行遍历的同时确定文档次序树中位于中间子树字符数最多的文本节点位置。确定了文本节点位置即确定了从该节点到文档次序树根节点的唯一路径,选择该路径上相同子节点数最多的节点,以该节点为根的子树为候选记录区域2。对最多字符数文本节点位置进行判断的目的是为了排除非记录区域内的长文本节点的干扰,如网页尾部的版权申明等。因为要抽取的记录文本通常位于整个HTML代码的中部,在DOM树中则是中间子树的位置。While traversing the tree, determine the position of the text node with the most characters in the middle subtree in the document sequence tree. Determine the position of the text node means determine the unique path from the node to the root node of the document sequence tree, select the node with the largest number of identical child nodes on the path, and use the node as the root of the subtree as the candidate record area 2. The purpose of judging the position of the text node with the maximum number of characters is to eliminate the interference of long text nodes in the non-record area, such as the copyright statement at the end of the web page. Because the record text to be extracted is usually located in the middle of the entire HTML code, in the DOM tree it is the position of the middle subtree.
如果候选记录区域1的根节点和候选记录区域2的根节点相同,则它们所代表的子树为最终记录区域。否则计算两棵子树根节点相同子节点个数的商值(子树1相同子节点个数比子树2相同子节点个数),引入阈值θ ,如果商值大等于阈值说明记录条数少,需要依靠最多字符数文本节点来更加准确定位记录区域,所以选择候选记录区域2为最终记录区域;当商值小于阈值时,说明记录条数多,这时选择候选记录区域1作为最终记录区域更可靠。通常阈值θ 取0.3时,对大多数记录网页抽取效果较好,可以根据记录条数实际情况调节阀值θ 来获得最佳抽取效果。If the root node of the candidate recording area 1 and the root node of the candidate recording area 2 are the same, the subtree they represent is the final recording area. Otherwise, calculate the quotient of the number of identical child nodes in the root nodes of the two subtrees (the number of identical child nodes in subtree 1 is greater than the number of identical child nodes in subtree 2), and introduce a thresholdθ . If the quotient is greater than the threshold, it means that the number of records is small , it is necessary to rely on the text node with the largest number of characters to locate the recording area more accurately, so the candidate recording area 2 is selected as the final recording area; when the quotient value is less than the threshold, it means that there are many records, and the candidate recording area 1 is selected as the final recording area more reliable. Usually, when the thresholdθ is set to 0.3, the extraction effect for most recorded web pages is better, and the thresholdθ can be adjusted according to the actual situation of the number of records to obtain the best extraction effect.
(3)记录分隔符识别模块(3) Record separator identification module
再次,描述记录分隔符识别模块是如何识别记录之间的分隔符的。将根节点的所有子节点定义为记录区域节点块。第一步,在记录区域节点块中利用后缀数组找到能覆盖最多节点的非重叠重复子块。为了简化查找重复子块的过程,可以首先确定记录区域内字符数最多的文本节点,以该文本节点为基准双向拓展来寻找重复子块。第二步,在记录区域节点块中进行重复子块的匹配,在所有匹配到的第一个节点位置插入分隔符,这样就能对记录区域内记录进行分割。Again, describe how the record delimiter recognition module recognizes delimiters between records. Defines all child nodes of the root node as record area node blocks. The first step is to use the suffix array to find the non-overlapping repeated sub-blocks that can cover the most nodes in the node block of the record area. In order to simplify the process of finding repeated sub-blocks, the text node with the largest number of characters in the recording area can be determined first, and the repeated sub-blocks can be found by bidirectional expansion based on the text node. In the second step, the repeated sub-blocks are matched in the node block of the record area, and separators are inserted in all matched first node positions, so that the records in the record area can be divided.
(4)记录输出模块(4) Record output module
最后,描述记录输出模块。记录输出模块将记录区域里所有文本节点按层次顺序遍历输出,在碰到分隔符时输出分隔线,得到最终的抽取结果。Finally, the record output module is described. The record output module traverses and outputs all the text nodes in the record area in hierarchical order, and outputs the separator when a separator is encountered to obtain the final extraction result.
本发明最大的创新点是对文档次序树进行横向层次分析,将同一个父节点下同层节点看成一个大的节点块,将寻找相似子树的问题转换为寻找节点块的相似子块。如图1所示的文档次序树中第一层有1个节点块为{<head>、<body>},第二层有1个节点块为{<p>、<ul>、<a>},第三层有1个节点块{<li>、<li>、<li>、<li>、<li>},第四层有5个节点块按照从左到右的顺序分别为{<a>、<T>、<a>}、{<T>、<a>、<a>}、{<T>、<T>}、{<T>}、{<T>},其中<T>表示的是文本节点。如果虚线框内的节点组成记录区域块,则我们将该区域块的根节点称为记录区域根节点(图中为<ul>),根节点所有子节点所构成的节点块称为记录区域节点块(图中为{<li>、<li>、<li>、<li>、<li>})。The biggest innovation of the present invention is to perform horizontal hierarchical analysis on the document sequence tree, regard the same layer nodes under the same parent node as a large node block, and convert the problem of finding similar subtrees into finding similar subblocks of node blocks. In the document sequence tree shown in Figure 1, there is a node block in the first layer as {<head>, <body>}, and a node block in the second layer as {<p>, <ul>, <a> }, the third layer has 1 node block {<li>, <li>, <li>, <li>, <li>}, and the fourth layer has 5 node blocks in order from left to right as { <a>, <T>, <a>}, {<T>, <a>, <a>}, {<T>, <T>}, {<T>}, {<T>}, where <T> represents a text node. If the nodes in the dotted box form a record area block, then we call the root node of the area block the root node of the record area (<ul> in the figure), and the node block composed of all child nodes of the root node is called the record area node block ({<li>, <li>, <li>, <li>, <li>} in the diagram).
据此,本发明提出了多记录网页的信息抽取方法,包括以下步骤:Accordingly, the present invention proposes an information extraction method for multi-record web pages, comprising the following steps:
步骤1:由网页预处理模块将HTML网页转换为XHTML网页,并过滤网页中用来渲染显示效果的标签,如<script>等,然后根据标签的嵌套结构,构建文档次序树;Step 1: Convert the HTML webpage into an XHTML webpage by the webpage preprocessing module, and filter the tags used to render the display effect in the webpage, such as <script>, etc., and then build a document sequence tree according to the nested structure of the tags;
步骤2:由记录区域定位模块接收来自预处理模块的文档次序树,通过对所述文档次序树的横向层次分析在所述文档次序树中定位出记录区域块的位置;Step 2: The document sequence tree from the preprocessing module is received by the record area positioning module, and the position of the record area block is located in the document sequence tree through horizontal hierarchical analysis of the document sequence tree;
步骤3:由记录分隔符识别模块接收来自记录区域定位模块的记录区域块位置,利用双向搜索方法从记录区域块中找到记录之间的分隔符并进行存储;Step 3: The recording area block position from the recording area positioning module is received by the recording separator recognition module, and the separator between the records is found from the recording area block by using a bidirectional search method and stored;
步骤4:由记录输出模块将记录区域里所有文本节点按层次顺序遍历输出,在碰到分隔符时输出分隔线,得到最终的抽取结果。Step 4: The record output module traverses and outputs all the text nodes in the record area in hierarchical order, and outputs the separator line when encountering a separator to obtain the final extraction result.
在步骤1中,所述采用SAX解析器对网页文档进行解析,以构建文档次序树。所述SAX解析器包括4个事件处理器,分别为startDocument事件处理器、endDocument事件处理器、startElement事件处理器、endElement事件处理器;所述4个事件处理器分别包含了预先定义好的一系列操作,所述4个事件处理器按照解析标签的顺序依次被触发、执行。In step 1, the webpage document is parsed by using a SAX parser to construct a document sequence tree. The SAX parser includes 4 event handlers, respectively startDocument event handler, endDocument event handler, startElement event handler, endElement event handler; the 4 event handlers respectively include a series of pre-defined Operation, the four event handlers are triggered and executed sequentially in the order of parsing tags.
在步骤2中,通过横向层次分析在所述文档次序树中定位出记录区域的位置,包括以下步骤:In step 2, the location of the record area is located in the document sequence tree through horizontal hierarchical analysis, including the following steps:
步骤a1:先序遍历文档次序树,统计并记录每个节点相同子节点个数,找到相同子节点个数最多的节点,并将以该节点为根的子树确定为候选记录区域1且该节点要满足其子树内某个文本节点字符数大于文档次序树文本节点平均字符数;Step a1: Traverse the document sequence tree in order, count and record the number of identical child nodes of each node, find the node with the largest number of identical child nodes, and determine the subtree rooted at this node as the candidate record area 1 and the The node must satisfy that the number of characters of a certain text node in its subtree is greater than the average number of characters of the text node of the document sequence tree;
步骤a2:在对文档次序树进行遍历的同时,确定文档次序树中位于中间子树且字符数最多的文本节点位置;确定了文本节点位置即确定了从该节点到文档次序树根节点的唯一路径,选择该路径上相同子节点数最多的节点,将以该节点为根的子树确定为候选记录区域2;Step a2: While traversing the document order tree, determine the position of the text node that is located in the middle subtree and has the largest number of characters in the document order tree; determining the position of the text node means determining the unique path from this node to the root node of the document order tree Path, select the node with the largest number of identical child nodes on the path, and determine the subtree rooted at this node as the candidate record area 2;
步骤a3:如果候选记录区域1的根节点和候选记录区域2的根节点相同,则候选记录区域1和候选记录区域2所代表的子树为最终记录区域,否则计算两棵子树根节点相同子节点个数的商值,引入阈值θ ,如果商值大等于阈值,说明记录条数少,需要依靠最多字符数文本节点来准确定位记录区域,选择候选记录区域2为最终记录区域;如果商值小于阈值,说明记录条数多,选择候选记录区域1作为最终记录区域。Step a3: If the root node of the candidate recording area 1 is the same as the root node of the candidate recording area 2, then the subtree represented by the candidate recording area 1 and the candidate recording area 2 is the final recording area; otherwise, calculate the root node of the two subtrees with the same The quotient value of the number of nodes is introduced into the thresholdθ . If the quotient value is greater than the threshold value, it means that the number of records is small, and it is necessary to rely on the text node with the largest number of characters to accurately locate the record area, and select the candidate record area 2 as the final record area; if the quotient value If it is less than the threshold, it means that the number of records is large, and candidate recording area 1 is selected as the final recording area.
在步骤3中,从记录区域中查找记录之间的分隔符,包括以下步骤:In step 3, find the delimiter between records from the record area, including the following steps:
步骤b1:将根节点的所有子节点定义为记录区域节点块,在记录区域节点块中利用双向搜索方法查找能覆盖最多节点的非重叠重复子块;具体方法为:Step b1: Define all sub-nodes of the root node as node blocks in the record area, and use a bidirectional search method in the node block in the record area to find non-overlapping and repeated sub-blocks that can cover the most nodes; the specific method is:
用X表示记录区域节点块序列,其中为节点序列,为彼此相同的子块,每个又可以表示为的节点序列,则问题定义为Use X to represent the sequence of record zone node blocks ,in is a sequence of nodes, for subblocks that are identical to each other, each can also be expressed as node sequence, then the problem is defined as
即需求得最优的子块使得所有和它相同的子块标签数之和加上的标签数最大。同时求得的不能包含重复的子块,即,表示节点序列和不能完全相同,否则,这里的m为偶数。That is, the optimally demanded sub-block such that all subblocks identical to it Number of tags sum plus of tags maximum. obtained at the same time cannot contain duplicate subblocks, i.e. , representing the sequence of nodes and cannot be identical, otherwise , where m is an even number.
我们利用上面的假设2来简化查找的过程,首先确定记录区域内字符数最多的文本节点,以该T节点为基准双向拓展来寻找重复子块(查找的包含该T节点)。 以图2为例,图中带阴影的P节点为区域内字符数最多文本节点,第一轮查找的标签数为1,得到为12;第二轮标签数为2,采用向左向右拓展子块分别得到重复子块为aP和PP,此时得到值为8;以此类推在标签数为4,子块为PPPa时,此时得到为16,覆盖了最多节点,则最终确定为PPPa子块。We take advantage of Assumption 2 above to simplify the lookup The process, first determine the text node with the largest number of characters in the recording area, and use the T node as the basis to expand bidirectionally to find repeated sub-blocks (searched contains the T node). Taking Figure 2 as an example, the shaded P node in the figure is the text node with the largest number of characters in the area, and the first round of search The number of labels is 1, get for 12; second round The number of labels is 2, and the sub-blocks are expanded from left to right to obtain repeated sub-blocks as aP and PP, and the value is 8 at this time; and so on in When the number of tags is 4, and the sub-block is PPPa, then get is 16, covering the most nodes, then finalize It is a PPPa sub-block.
步骤b2:在记录区域节点块中进行重复子块的匹配,在所有匹配到的第一个节点位置插入分隔符,以对记录区域内记录进行分割。图2中sep即表示插入的记录分隔符。Step b2: Perform matching of repeated sub-blocks in the node block of the record area, and insert separators at the positions of all matched first nodes, so as to divide the records in the record area. In Figure 2, sep represents the inserted record separator.
以上是本发明的较佳实施例,凡依本发明技术方案所作的改变,所产生的功能作用未超出本发明技术方案的范围时,均属于本发明的保护范围。The above are the preferred embodiments of the present invention, and all changes made according to the technical solution of the present invention, when the functional effect produced does not exceed the scope of the technical solution of the present invention, all belong to the protection scope of the present invention.
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201410034376.4ACN103761312B (en) | 2014-01-24 | 2014-01-24 | Information extraction system and method for multi-recording webpage |
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201410034376.4ACN103761312B (en) | 2014-01-24 | 2014-01-24 | Information extraction system and method for multi-recording webpage |
| Publication Number | Publication Date |
|---|---|
| CN103761312A CN103761312A (en) | 2014-04-30 |
| CN103761312Btrue CN103761312B (en) | 2017-02-08 |
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201410034376.4AActiveCN103761312B (en) | 2014-01-24 | 2014-01-24 | Information extraction system and method for multi-recording webpage |
| Country | Link |
|---|---|
| CN (1) | CN103761312B (en) |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN104217025B (en)* | 2014-09-28 | 2018-04-13 | 福州大学 | For the entry extraction system and method for more record webpages |
| CN106294722B (en)* | 2016-08-09 | 2019-11-22 | 上海资誉网络科技有限公司 | A kind of web page contents extraction method and device |
| CN107122403B (en)* | 2017-03-22 | 2020-08-07 | 安徽大学 | Webpage academic report information extraction method and system |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN1786965A (en)* | 2005-12-21 | 2006-06-14 | 北大方正集团有限公司 | Method for acquiring news web page text information |
| CN101192234A (en)* | 2007-06-07 | 2008-06-04 | 腾讯科技(深圳)有限公司 | Searching system and method based on web page extraction |
| CN101515287A (en)* | 2009-03-24 | 2009-08-26 | 崔志明 | Automatic generating method of wrapper of complex page |
| CN101582075A (en)* | 2009-06-24 | 2009-11-18 | 大连海事大学 | Web information extraction system |
| CN101727486A (en)* | 2009-12-04 | 2010-06-09 | 中国人民解放军信息工程大学 | Web forum information extraction system |
| CN101872350A (en)* | 2009-04-24 | 2010-10-27 | 富士通株式会社 | Web page text extraction method and device |
| EP2482206A1 (en)* | 2011-01-27 | 2012-08-01 | Samsung Electronics Co., Ltd. | Method and apparatus for web browsing of handheld device |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN1786965A (en)* | 2005-12-21 | 2006-06-14 | 北大方正集团有限公司 | Method for acquiring news web page text information |
| CN101192234A (en)* | 2007-06-07 | 2008-06-04 | 腾讯科技(深圳)有限公司 | Searching system and method based on web page extraction |
| CN101515287A (en)* | 2009-03-24 | 2009-08-26 | 崔志明 | Automatic generating method of wrapper of complex page |
| CN101872350A (en)* | 2009-04-24 | 2010-10-27 | 富士通株式会社 | Web page text extraction method and device |
| CN101582075A (en)* | 2009-06-24 | 2009-11-18 | 大连海事大学 | Web information extraction system |
| CN101727486A (en)* | 2009-12-04 | 2010-06-09 | 中国人民解放军信息工程大学 | Web forum information extraction system |
| EP2482206A1 (en)* | 2011-01-27 | 2012-08-01 | Samsung Electronics Co., Ltd. | Method and apparatus for web browsing of handheld device |
| Title |
|---|
| 针对Web论坛的一种结构化数据自动抽取方法;关冕等;《山东大学学报(理学版)》;20100516;第42-47页* |
| Publication number | Publication date |
|---|---|
| CN103761312A (en) | 2014-04-30 |
| Publication | Publication Date | Title |
|---|---|---|
| CN104268148B (en) | A kind of forum page Information Automatic Extraction method and system based on time string | |
| CN101464905B (en) | Web page information extraction system and method | |
| CN103942335B (en) | Construction method of uninterrupted crawler system oriented to web page structure change | |
| CN103544255A (en) | Text semantic relativity based network public opinion information analysis method | |
| CN107229668A (en) | A kind of text extracting method based on Keywords matching | |
| CN103870506B (en) | Webpage information extraction method and system | |
| CN102135976B (en) | Hypertext markup language page structured data extraction method and device | |
| CN102890702A (en) | Internet forum-oriented opinion leader mining method | |
| CN102662969A (en) | Internet information object positioning method based on webpage structure semantic meaning | |
| CN101727498A (en) | Automatic extraction method of web page information based on WEB structure | |
| CN102591612A (en) | General webpage text extraction method based on punctuation continuity and system thereof | |
| Ferrara et al. | Automatic wrapper adaptation by tree edit distance matching | |
| CN104268230B (en) | A kind of Chinese micro-blog viewpoint detection method based on heterogeneous figure random walk | |
| CN102117289B (en) | Method and device for extracting comment content from webpage | |
| CN101950312A (en) | Method for analyzing webpage content of internet | |
| CN105824791B (en) | A kind of bibliography format checking method | |
| CN103559234A (en) | System and method for automated semantic annotation of RESTful Web services | |
| CN108733813A (en) | Information extracting method, system towards BBS forum Web pages contents and medium | |
| CN103136358A (en) | Method for automatically extracting BBS (bulletin board system) data | |
| CN103092973B (en) | information extraction method and device | |
| CN105740355A (en) | Aggregated text density based webpage body text extraction method and apparatus | |
| CN118194995A (en) | Method and device for acquiring key information of scientific and technological literature in field of global environment | |
| CN103761312B (en) | Information extraction system and method for multi-recording webpage | |
| CN107943929A (en) | The automatic generating method of wrapper being abstracted based on dom tree | |
| CN107145591A (en) | Title-based webpage effective metadata content extraction method |
| Date | Code | Title | Description |
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| C10 | Entry into substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| C14 | Grant of patent or utility model | ||
| GR01 | Patent grant |