CN1581172A

Movatterモバイル変換

Info

Publication number: CN1581172A
Application number: CN 200410070553
Authority: CN
Inventors: 刘金松; 于浩; 西野文人
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2003-08-08
Filing date: 2004-08-06
Publication date: 2005-02-16
Anticipated expiration: 2024-08-06
Also published as: CN100336061C

Abstract

本发明提供了一种多媒体对象检索设备和方法，用于在同时包含多媒体对象和相关注释文本的结构化文档中检索多媒体对象。本发明的设备和方法对输入的结构化文档进行分析，并将其表示为诸如DOM树那样的某种分析结果；识别所输入的分析结果中的主块，并输出主块标示结构化文档模型；提取多媒体对象和所述注释的配对，并输出诸如XML格式对象索引那样的结构化对象索引；以及在结构化对象索引中进行搜索，并获得目标对象列表。本发明的设备和方法可适用于各种结构化文档，可以更高精度地提取对象注释，可提取内容对象的公共注释，并可标识对象与文档标题间的关系。

The invention provides a multimedia object retrieval device and method for retrieving multimedia objects in a structured document containing both multimedia objects and related annotation texts. The device and method of the present invention analyze the input structured document and represent it as a certain analysis result such as a DOM tree; identify the main block in the input analysis result, and output the main block to indicate the structured document model ; Extract the pairing of the multimedia object and the annotation, and output a structured object index such as an object index in XML format; and search in the structured object index, and obtain a list of target objects. The device and method of the invention are applicable to various structured documents, can extract object annotations with higher precision, can extract public annotations of content objects, and can identify the relationship between objects and document titles.

Description

Translated fromChinese

多媒体对象检索设备和方法Multimedia object retrieval device and method

技术领域technical field

本发明涉及多媒体对象检索。具体而言，涉及对结构化文档，例如网页、XML文件、报纸等中的多媒体对象，如图像、动画、视频、音频、表等的注释进行检索的设备和方法。The present invention relates to multimedia object retrieval. Specifically, it relates to a device and method for retrieving annotations of multimedia objects such as images, animations, videos, audios, tables, etc. in structured documents, such as web pages, XML files, newspapers, etc.

背景技术Background technique

互联网技术的发展使得在互联网上发布诸如图像、音乐和电影那样的商业对象变得容易和有利可图。但是另一方面，这也为非法复制和再发布多媒体对象提供了便利。现在，这种非法复制在互联网上几乎随处可见。这大大降低了合法商业活动的利润。因此，非常需要开发出一种互联网警察系统，以查出这些非法对象。图像检索系统是一种典型的对象检索系统。The development of Internet technology has made it easy and profitable to distribute commercial objects such as images, music and movies on the Internet. But on the other hand, it also facilitates illegal copying and redistribution of multimedia objects. Today, such illegal copying can be found almost everywhere on the Internet. This greatly reduces the profits of legitimate business activities. Therefore, it is very necessary to develop a kind of Internet police system to find out these illegal objects. Image retrieval system is a typical object retrieval system.

自20世纪70年代以来，图像检索一直是一个非常活跃的研究领域。一个方向主要是基于文本，参见Anna Bjarnestam在1999年2月25-26日于英国Newcastle upon Tyne举行的“The Challenge of Image RetrievalConference”会议上发表的论文“Text-Based Hierarchical ImageClassification and Retrieval of Stock Photography”。另一个依靠视觉特征，例如数据的颜色、纹理和形状，称为基于内容的图像检索。参见Eakins，J P和Graham，M E在1999年1月的“Report to JISCTechnology Application Programme”中发表的“Content-Based ImageRetrieval”。Image retrieval has been a very active research area since the 1970s. One direction is mainly based on text, see the paper "Text-Based Hierarchical Image Classification and Retrieval of Stock Photography" published by Anna Bjarnestam at the "The Challenge of Image Retrieval Conference" held in Newcastle upon Tyne, UK on February 25-26, 1999 . Another relies on visual features, such as the color, texture, and shape of the data, called content-based image retrieval. See "Content-Based Image Retrieval" by Eakins, JP and Graham, ME, "Report to JISC Technology Application Programme", January 1999.

除了费力耗时以外，这两种方法的不足是未利用网页格式的优点。并且，对进行图像检索的用户的调查表明，他们对图像识别和由图像表示的动作的感兴趣程度比对大多数基于内容的检索系统所提供的颜色、形状和其它视觉特征的感兴趣程度要大得多。参见C.Jorgensen于1998年在“Information Processing and Management”第34卷，第2/3期，第161-174页中发表的“Attributes of Images in Describing Tasks”。Aside from being laborious and time consuming, the downside of these two methods is that they do not take advantage of the advantages of the web format. Also, surveys of image retrieval users show that they are less interested in image recognition and the actions represented by images than in color, shape, and other visual features provided by most content-based retrieval systems Much bigger. See "Attributes of Images in Describing Tasks" by C. Jorgensen, 1998, "Information Processing and Management", Vol. 34, No. 2/3, pp. 161-174.

对随机网络照片的另一项调查表明，93％具有一个以上的标题。仅7％没有可视标题。参见Neil C.Rowe于1999年在“the MARIE Project”中发表的“Precise and Efficient Retrieval of Captioned Images”。Another survey of random web photos showed that 93% had more than one caption. Only 7% have no visual title. See "Precise and Efficient Retrieval of Captioned Images" by Neil C. Rowe in "the MARIE Project", 1999.

因此，最近学者们对基于网络的图像检索越来越感兴趣。他们使用与图形特征相结合的元数据、HTML标题、图像URL、别名、锚定文本等元素，在互联网中检索图像。参见Rong Zhao和William I.Grosky在“IEEETransactions on Multimedia”2002年第4(2)期第189-200页中发表的“Narrowing the Semantic Gap-Improved Text-Based Web DocumentRetrieval Using Visual Features”。Therefore, scholars have recently become more and more interested in web-based image retrieval. They retrieve images across the Internet using elements such as metadata, HTML titles, image URLs, aliases, anchor text, etc. combined with graphic features. See "Narrowing the Semantic Gap-Improved Text-Based Web Document Retrieval Using Visual Features" published by Rong Zhao and William I. Grosky in "IEEE Transactions on Multimedia" 2002, No. 4 (2), No. 189-200.

已经取得了良好的效果，并已建立了商业化的图像检索系统，例如Google。Good results have been achieved and commercial image retrieval systems such as Google have been established.

图1是传统的对象检索系统的结构框图。输入的是结构化文档101，例如网页。首先，该系统使用简单的分析单元102对输入的结构化文档101进行分析，然后注释提取单元104简单地通过对多媒体对象和文本之间的距离进行计算，从分析单元102输出的分析结果103中提取出各多媒体对象的注释，并作为结果输出多媒体对象索引105。最后，多媒体对象检索单元106把多媒体对象索引105与用户输入的检索要求107进行比较，返回目标对象列表108。Figure 1 is a structural block diagram of a traditional object retrieval system. The input is a structured document 101, such as a web page. First, the system uses a simple analysis unit 102 to analyze the input structured document 101, and then the annotation extraction unit 104 simply calculates the distance between the multimedia object and the text, from the analysis result 103 output by the analysis unit 102 Annotations for each multimedia object are extracted, and a multimedia object index 105 is output as a result. Finally, the multimedia object retrieval unit 106 compares the multimedia object index 105 with the retrieval request 107 input by the user, and returns the target object list 108 .

由此可见，传统的对象检索系统仍存在一些不足之处。It can be seen that there are still some deficiencies in the traditional object retrieval system.

首先，传统上，通过计算对象和文本之间的距离来提取对象的注释。如果该距离小于一个临界值，则把文本设定为相关对象的注释，否则设定为不是。这种算法太简单，以致丢失了许多有用的信息，从而使对象检索系统的性能低下。网页通常会包含主文本块或重复对象块(以下称其为主块)，如果能在提取多媒体对象的注释之前识别出页面的主块，则能大大提高对象检索的效率。First, traditionally, annotations of objects are extracted by computing the distance between the object and the text. If the distance is less than a threshold, set the text to be an annotation of the associated object, otherwise set it to not. This algorithm is too simple, so much useful information is lost, resulting in poor performance of the object retrieval system. A web page usually contains a main text block or a repeated object block (hereinafter referred to as the main block), if the main block of the page can be identified before the annotation of the multimedia object is extracted, the efficiency of object retrieval can be greatly improved.

第二，HTML标题显然与其中的对象具有某种联系。但是它仅与页面内的某些对象而不是所有对象相关。由于传统的多媒体对象检索系统不对网页的结构进行详细分析，因而无法区分相关对象与非相关对象，要么把标题设定为各个对象的注释，要么设定为都不是。这显然是不恰当的。如果能够识别出主块，则可以只把标题设定为主块内的对象的注释，从而可以提高系统性能。Second, the HTML headings clearly have some connection to the objects within them. But it only relates to some objects within the page not all objects. Since the traditional multimedia object retrieval system does not analyze the structure of the web page in detail, it cannot distinguish relevant objects from non-related objects, and either sets the title as the annotation of each object, or sets it as neither. This is obviously inappropriate. If the main block can be identified, only the title can be set as the comment of the object in the main block, so that the system performance can be improved.

第三，在包含一个以上内容对象的页面中，除了各个单独对象的注释之外，通常还有对所有对象的公共内容进行说明的公共注释。而传统系统无法对此进行处理。如果可以识别出主文本块和重复对象块，则可以把注释分为单独注释和公共注释，并分别提取，从而可以大大提高系统的性能。Third, on pages containing more than one content object, there is usually a common comment describing the common content of all objects, in addition to the comments for each individual object. And traditional systems can't handle that. If the main text block and repeated object blocks can be identified, the annotations can be divided into individual annotations and public annotations, and extracted separately, so that the performance of the system can be greatly improved.

发明内容Contents of the invention

本发明的目的是解决现有的多媒体对象检索中存在的问题，并提供一种新的用于对结构化文档，例如网页、XML文件、报纸等中的多媒体对象，如图像、动画、视频、音频、表等的注释进行分析的设备和方法。The purpose of the present invention is to solve the problems existing in the existing multimedia object retrieval, and provide a new method for searching multimedia objects in structured documents, such as web pages, XML files, newspapers, etc., such as images, animations, videos, Apparatus and method for analyzing annotations of audio, tables, etc.

根据本发明的一个方面，提供了一种多媒体对象检索设备，用于从同时包含多媒体对象和相关注释文本的结构化文档中检索多媒体对象，该多媒体对象检索设备包括：分析单元，其对输入的结构化文档进行分析，并将其表示为预定形式的分析结果；主块识别单元，其分析所输入的分析结果中的主块，并输出主块标示结构化文档模型；对象注释提取单元，其从主块标示结构化文档模型中提取多媒体对象和相应注释的配对，分析多媒体对象的注释，提取对多媒体对象内容进行实际注释的关键词，删除无效注释，并输出预定形式的结构化对象索引；以及多媒体对象检索单元，其在结构化对象索引中进行搜索，并获得目标对象列表。According to one aspect of the present invention, a multimedia object retrieval device is provided for retrieving a multimedia object from a structured document that simultaneously contains a multimedia object and related annotation text, the multimedia object retrieval device includes: an analysis unit, which analyzes the input The structured document is analyzed and expressed as an analysis result in a predetermined form; the main block identification unit analyzes the main block in the input analysis result, and outputs the main block to indicate the structured document model; the object annotation extraction unit, its Extracting the pairing of multimedia objects and corresponding annotations from the main block marked structured document model, analyzing the annotations of the multimedia objects, extracting keywords for actually annotating the content of the multimedia objects, deleting invalid annotations, and outputting a structured object index in a predetermined form; and a multimedia object retrieval unit, which searches the structured object index and obtains a list of target objects.

优选地，本发明的多媒体对象检索设备还具有公共注释提取单元，其根据公共注释提取规则，提取各主块内各个多媒体对象的公共注释。Preferably, the multimedia object retrieval device of the present invention further has a common comment extraction unit, which extracts the common comments of each multimedia object in each main block according to the common comment extraction rules.

根据本发明的另一个方面，提供了一种多媒体对象检索方法，用于在同时包含多媒体对象和相关注释文本的结构化文档中检索多媒体对象，该方法包括以下步骤：对输入的结构化文档进行分析，并将其表示为某种分析结果；识别所输入的分析结果中的主块，并输出主块标示结构化文档模型；提取多媒体对象和相应注释的配对，并输出结构化对象索引；以及在结构化对象索引中进行搜索，并获得目标对象列表。According to another aspect of the present invention, a multimedia object retrieval method is provided for retrieving a multimedia object in a structured document that simultaneously contains a multimedia object and related annotation text, the method includes the following steps: performing an input structured document analyze, and represent it as some analysis result; identify main blocks in the input analysis results, and output a main block designation structured document model; extract pairs of multimedia objects and corresponding annotations, and output a structured object index; and Search in structured object index and get target object list.

优选地，本发明的多媒体对象检索方法还包括公共注释提取步骤，其中，根据公共注释提取规则，提取各主块内各个多媒体对象的公共注释。Preferably, the multimedia object retrieval method of the present invention further includes a public annotation extraction step, wherein, according to the public annotation extraction rules, the public annotations of each multimedia object in each main block are extracted.

优选地，本发明中的主块是主文本块或者重复对象块。Preferably, the main block in the present invention is a main text block or a repeated object block.

本发明的设备和方法几乎可以适用于所有类型的结构化文档。通过识别主文本块和重复对象块来获得注释，不仅能够以更高的精度提取对象注释，而且还能识别一组对象的公共注释，并能识别多媒体对象和结构化文档的标题之间的关系。采用本发明的设备和方法可以大大提高多媒体对象检索的性能。The device and method of the present invention can be applied to almost all types of structured documents. Obtaining annotations by identifying main text blocks and repeated object blocks not only enables object annotations to be extracted with higher precision, but also identifies common annotations for a group of objects and identifies relationships between multimedia objects and titles of structured documents . The device and method of the invention can greatly improve the performance of multimedia object retrieval.

附图说明Description of drawings

下面将结合附图对本发明的多媒体对象检索设备和方法进行详细说明。图中相同的标号表示相同的部件或步骤。其中：The multimedia object retrieval device and method of the present invention will be described in detail below with reference to the accompanying drawings. The same reference numerals in the figures denote the same components or steps. in:

图1是传统的对象检索系统的结构框图；Fig. 1 is a structural block diagram of a traditional object retrieval system;

图2是本发明的对象检索系统的原理框图；Fig. 2 is a functional block diagram of the object retrieval system of the present invention;

图3是主块识别单元的结构框图；Fig. 3 is a structural block diagram of the main block identification unit;

图4是主文本块识别单元的结构框图；Fig. 4 is the structural block diagram of main text block identification unit;

图5是重复对象块识别单元的结构框图；Fig. 5 is the structural block diagram of repeated object block identification unit;

图6是对象注释提取单元的结构框图；Fig. 6 is the structural block diagram of object annotation extracting unit;

图7是对象检索单元的结构框图；Fig. 7 is a structural block diagram of an object retrieval unit;

图8是包含四种图像对象(多媒体对象的一个示例)的输入网页的一个示例；Fig. 8 is an example of an input webpage comprising four kinds of image objects (an example of a multimedia object);

图9是HTML DOM树(分析结果的一个示例)的一个示例；Figure 9 is an example of an HTML DOM tree (an example of an analysis result);

图10是包含主文本块的网页的一个示例；Figure 10 is an example of a web page comprising a main text block;

图11是包含重复图像块(重复对象块的一个示例)的网页的一个示例；Fig. 11 is an example of a web page containing repeated image blocks (an example of repeated object blocks);

图12是重复图像块(重复对象块的一个示例)的HTML标记流(结构化文档标记流的一个示例)的一个示例；Figure 12 is an example of an HTML markup flow (an example of a structured document markup flow) for a repeating image block (an example of a repeating object block);

图13是从网页(结构化文档的一个示例)中提取的输出XML格式对象索引(结构化对象索引的一个示例)的一个示例。FIG. 13 is an example of an output XML-format object index (an example of a structured object index) extracted from a web page (an example of a structured document).

具体实施方式Detailed ways

图2是本发明的对象检索设备的原理框图。该设备的输入是结构化文档201，例如网页。首先，分析单元202把输入的结构化文档转换为某种分析结果203，例如DOM(文档对象模型，Document Object Model)树。然后，主块识别单元204从分析结果203中识别出结构化文档201的主块，并输出主块标示分析结果205。接着，多媒体对象注释提取单元206提取多媒体对象和相应注释的配对，并输出结构化对象索引207，例如XML格式对象索引。最后，对象分析单元208将输入要求209与结构化对象索引207进行比较，判断候选对象是否是目标对象，并以目标对象列表210的形式返回检索结果。Fig. 2 is a functional block diagram of the object retrieval device of the present invention. The input to the device is a structureddocument 201, such as a web page. First, theanalysis unit 202 converts the input structured document into someanalysis result 203, such as a DOM (Document Object Model, Document Object Model) tree. Then, the mainblock identification unit 204 identifies the main block of the structureddocument 201 from theanalysis result 203 , and outputs the main blockidentification analysis result 205 . Next, the multimedia objectannotation extracting unit 206 extracts a pair of multimedia objects and corresponding annotations, and outputs astructured object index 207, such as an XML format object index. Finally, theobject analysis unit 208 compares theinput request 209 with thestructured object index 207 , judges whether the candidate object is the target object, and returns the retrieval result in the form of thetarget object list 210 .

由于所输入的HTML源代码这样的结构化文档201直接处理起来比较麻烦，所以开发了HTML分析器这样的分析单元202，用于将结构化文档201表示为某种分析结果203，例如HTML DOM树，以便于随后处理。图9显示了HTML DOM树(分析结果203的一个示例)的一个示例。Since it is cumbersome to directly process astructured document 201 such as an input HTML source code, ananalysis unit 202 such as an HTML analyzer has been developed for representing a structureddocument 201 as acertain analysis result 203, such as an HTML DOM tree. , for subsequent processing. Figure 9 shows an example of the HTML DOM tree (an example of the analysis result 203).

图3示出了输入的结构化文档201的主块识别的关键步骤。主块识别单元204可以包括主文本块识别单元302和重复对象块识别单元303。首先，分别使用主文本块识别单元302和重复对象块识别单元303给输入分析结果203添加标注。主文本块识别单元302的输出是主文本块标示分析结果304。重复对象块识别单元303的输出是重复对象块标示分析结果305。接着，标示结果组合单元306把这两个结果组合成主块标示分析结果205，其中，主文本块和重复对象块都被添加了标注。FIG. 3 shows the key steps of main block identification of an input structureddocument 201 . The mainblock identification unit 204 may include a main text block identification unit 302 and a repeated object block identification unit 303 . First, the main text block identification unit 302 and the repeated object block identification unit 303 are respectively used to add annotations to theinput analysis result 203 . The output of the primary text block identification unit 302 is a primary text block label analysis result 304 . The output of the repeated object block identifying unit 303 is the repeated object block identification analysis result 305 . Next, the labeling result combining unit 306 combines the two results into the main blocklabeling analysis result 205, wherein both the main text block and the repeated object block are marked.

图4示出了主文本块识别的关键步骤。输入的是由分析单元202输出的分析结果203。首先，使用文本长度统计单元402计算分析结果203中各节点的文本长度。接着，使用中心文本节点查找单元403来查找中心文本节点。然后，使用主文本块计算单元404来识别主文本块。在识别了主文本块之后，利用主文本块内对象标注单元405对主文本块中的多媒体对象进行标注。从而获得了主文本块标示分析结果304。Figure 4 shows the key steps of main text block recognition. The input is theanalysis result 203 output by theanalysis unit 202 . First, use the text length statistics unit 402 to calculate the text length of each node in theanalysis result 203 . Next, use the central text node search unit 403 to find the central text node. Then, the primary text block calculation unit 404 is used to identify the primary text block. After the main text block is identified, use the object marking unit 405 in the main text block to mark the multimedia objects in the main text block. Thus, the main text block labeling analysis result 304 is obtained.

在文本长度统计单元402中，计算分析结果203中各节点的文本长度。节点的文本长度是指该节点是文本节点时其内容的长度(版权声明这样的无效文本节点除外，此时长度认为是零)。首先去除文本节点内容中的标点。如果节点具有多个子节点，则该节点的文本长度是其子节点的总文本长度。In the text length statistics unit 402, the text length of each node in theanalysis result 203 is calculated. The text length of a node refers to the length of its content when the node is a text node (except for invalid text nodes such as copyright notices, where the length is considered to be zero). First remove the punctuation in the content of the text node. If a node has multiple children, the node's text length is the total text length of its children.

中心文本节点查找单元403是用于查找分析结果的节点的中心文本节点的设备。使用以下规则来判断一个节点是否有中心文本节点：首先，如果该节点的文本长度小于一个预定的值LEAST_MAIN_BLOCK_LENGTH(例如，50)，或者它根本就没有子节点，则它不会有中心文本节点。接着，遍历所有的子节点，如果子节点是表，并且其文本长度大于该节点的文本长度的一个预定比率MAX_CENTER_NODE_RATE(例如，90％)，或者其文本长度大于一个预定的值MAIN_BLOCK_LENGTH(例如，200)，并且子节点的文本长度与该节点的比值大于一个预定的值LEAST_CENTER_NODE_RATE(例如，60％)，则该节点具有中心文本节点，并且相应的子节点是这个节点的中心文本节点。The center text node search unit 403 is a device for finding a center text node of a node of the analysis result. Use the following rules to determine whether a node has a central text node: First, if the node's text length is less than a predetermined value LEAST_MAIN_BLOCK_LENGTH (for example, 50), or it has no child nodes at all, it will not have a central text node. Then, traverse all child nodes, if the child node is a table, and its text length is greater than a predetermined ratio MAX_CENTER_NODE_RATE (for example, 90%) of the text length of the node, or its text length is greater than a predetermined value MAIN_BLOCK_LENGTH (for example, 200 ), and the ratio of the text length of the child node to the node is greater than a predetermined value LEAST_CENTER_NODE_RATE (for example, 60%), then the node has a central text node, and the corresponding child node is the central text node of this node.

主文本块是网页之类的结构化文档201中对输入结构化文档201的主要内容进行说明的文本段。主文本块通常与结构化文档201的标题相关。通常在这些段中设置有许多多媒体对象，以帮助更清楚地表达思想或使读者感兴趣。这些对象也与结构化文档201的标题相关。图10是网页(一种结构化文档201)中的主文本块的一个示例。The main text block is a text segment describing the main content of the input structureddocument 201 in the structureddocument 201 such as a web page. The main text block is generally related to the title of the structureddocument 201 . There are often many multimedia objects placed within these segments to help express ideas more clearly or to interest the reader. These objects are also related to the title of the structureddocument 201 . Fig. 10 is an example of a main text block in a web page (a structured document 201).

以下对主文本块计算单元404进行说明。首先，文本长度：我们主要根据文本长度来识别主文本块。如果文本太短(文本长度小于一个预定的值LEAST_MAIN_TEXT_BLOCK_LENGTH)或者它是链接文本块，则它不会是主文本块。链接文本块是一种HTML DOM树(分析结果的一个示例)节点，其中，链接文本长度大于一个预定的值LEAST_LINK_BLOCK_LENGTH(例如，30)，文本长度小于一个预定的值MAIN_BLOCK_LENGTH(例如，200)，并且链接长度与总文本长度的比值大于一个预定的比率LINK_BLOCK_RATE(例如，80％)。如果文本长度大于一个预定的值MAIN_TEXT_BLOCK_LENGTH(例如，200)或者与根节点的文本长度的比值大于一个预定的比值MAIN_TEXT_BLOCK_RATE，则可以把它识别为主文本块。第二，关键词：它把足够长并且包含结构化文档201的标题(例如HTML标题)的文本段标记为主文本块。HTML<body>：如果在子节点中没有识别出主文本块，则把文本长度大于MAIN_TEXT_BLOCK_LENGTH的<body>设定为主文本块。方向：如果从上到下使用这些规则，则顶部的标记应非常容易地满足这些规则。但是这没有意义，因此我们按照从下到上的方向使用这些规则。当有两个以上的子节点被识别为主文本块时，这个节点也是主文本块。如果一个节点具有中心文本节点，则该节点是否是主文本块就相当于该节点的中心文本节点是否是主文本块。The main text block calculation unit 404 will be described below. First, text length: we identify main text blocks primarily by text length. If the text is too short (the text length is less than a predetermined value LEAST_MAIN_TEXT_BLOCK_LENGTH) or it is a linked text block, it will not be the main text block. A link text block is an HTML DOM tree (an example of an analysis result) node in which the link text length is greater than a predetermined value LEAST_LINK_BLOCK_LENGTH (for example, 30), the text length is less than a predetermined value MAIN_BLOCK_LENGTH (for example, 200), and The ratio of the link length to the total text length is greater than a predetermined ratio LINK_BLOCK_RATE (eg, 80%). If the text length is greater than a predetermined value MAIN_TEXT_BLOCK_LENGTH (for example, 200) or the ratio to the text length of the root node is greater than a predetermined ratio MAIN_TEXT_BLOCK_RATE, it can be identified as the main text block. Second, the keyword: it marks a text segment that is long enough and contains the title of the structured document 201 (eg HTML title) as the main text block. HTML<body>: If the main text block is not identified in the child node, set the <body> whose text length is greater than MAIN_TEXT_BLOCK_LENGTH as the main text block. Orientation: If these rules are used from top to bottom, the markup at the top should satisfy these rules very easily. But that doesn't make sense, so we use the rules in a bottom-up direction. When more than two child nodes are identified as a main text block, this node is also a main text block. If a node has a central text node, whether the node is a main text block is equivalent to whether the node's central text node is a main text block.

图5示出了重复对象块识别的关键步骤。输入的是某种分析结果203，例如HTML DOM树。首先，使用对象过滤单元，例如图5中所示的无效多媒体对象标注单元502对无效对象加以标注。然后，对象数统计单元503计算分析结果203中各节点的对象数。接着，使用中心对象节点查找单元504检索分析结果203中的各节点(例如HTML DOM树节点)的中心对象节点。之后，使用重复对象块识别单元505来识别重复对象块。最后，重复对象模式内对象标注单元506对重复对象块中的各对象进行标注。因此获得了重复对象块标示分析结果305。Fig. 5 shows the key steps of repeated object block identification. The input is some kind ofanalysis result 203, such as HTML DOM tree. First, use an object filtering unit, such as the invalid multimedia object marking unit 502 shown in FIG. 5, to mark invalid objects. Then, the object number counting unit 503 calculates the object number of each node in theanalysis result 203 . Next, use the central object node search unit 504 to search for the central object node of each node (eg HTML DOM tree node) in theanalysis result 203 . After that, the duplicate target block is identified using the duplicate target block identification unit 505 . Finally, the object labeling unit 506 in the repeated object pattern labels each object in the repeated object block. Therefore, the repeated object block marking analysis result 305 is obtained.

在无效多媒体对象标注单元502中，自动地对修饰图像之类的无效对象进行标注。可以把网页中的对象分为四类，即：内容对象，修饰对象，菜单对象和广告对象。图8是所有这四种对象的一个示例。内容对象：这些对象具有注释或者位于主文本块或重复对象块内。修饰对象：这些对象与网页的内容无关，它们存在的目的仅仅是使页面更美观并使用户更感兴趣。许多修饰对象循环地出现。菜单对象：许多网页具有由对象列表构成的图像菜单(菜单对象的一个示例)。这些对象具有指向其它结构化文档201(例如网页、子目录结构化文档201以及网站的子目录网页)的链接。这些对象通常位于所输入的结构化文档201的最左边或顶部。广告对象：经常会有这样的对象，其内容与当前网页的主要思想无关，而是指向其它的商业网站，这些对象被称为广告对象。在所有这四种对象中，只有内容对象才是对象搜索引擎想要提供给用户的对象。因此，其它三种对象被归类为无效对象。在提取注释字段和识别主块之前，无法清楚地确定内容对象和无效对象。开始，只能通过一些特征，例如对象大小和循环属性等，找出一些修饰对象。在该无效对象标注单元502中，可以根据以下规则来识别无效对象。修饰对象：如果对象极长，即高度/宽度小于一个预定的值RATE_OBJECT_TOO_LONG(例如，1/4)，或者细长，即高度/宽度大于一个预定的值RATE_OBJECT_TOO_SLIM(例如，4)，或尺寸太小，即高度*宽度小于一个预定的值SIZE_TOO_SMALL(例如，900)，或循环出现，即出现一次以上，则该对象是修饰对象。其它对象暂时设定为候选对象。如果对象大小未知，并且宽度和高度未知，则该对象也被设定为候选对象。In the invalid multimedia object labeling unit 502, invalid objects such as modified images are automatically marked. Objects in web pages can be divided into four categories, namely: content objects, decoration objects, menu objects and advertisement objects. Figure 8 is an example of all four of these objects. Content Objects: These objects have comments or are located within a main text block or repeating object block. Decorative objects: These objects have nothing to do with the content of the web page, they exist only to make the page more beautiful and more interesting to the user. Many grooming objects appear cyclically. Menu Objects: Many web pages have image menus (an example of a menu object) that consist of a list of objects. These objects have links to otherstructured documents 201, such as web pages, subdirectory structureddocuments 201, and subdirectory web pages of the website. These objects are usually located at the far left or top of the imported structureddocument 201 . Advertisement objects: There are often such objects whose content has nothing to do with the main idea of the current webpage, but points to other commercial websites. These objects are called advertisement objects. Among all these four kinds of objects, only the content object is the object that the object search engine wants to provide to the user. Therefore, the other three objects are classified as invalid objects. Content objects and invalid objects cannot be clearly determined until the annotation field is extracted and the main block is identified. At first, only some decoration objects can be found through some characteristics, such as object size and cycle property, etc. In the invalid object labeling unit 502, invalid objects can be identified according to the following rules. Modified object: If the object is extremely long, that is, the height/width is less than a predetermined value RATE_OBJECT_TOO_LONG (for example, 1/4), or slender, that is, the height/width is greater than a predetermined value RATE_OBJECT_TOO_SLIM (for example, 4), or the size is too small , that is, the height*width is less than a predetermined value SIZE_TOO_SMALL (for example, 900), or occurs cyclically, that is, appears more than once, then the object is a modified object. Other objects are temporarily set as candidate objects. If the object size is unknown, and the width and height are unknown, the object is also set as a candidate.

对象数统计单元503用于计算分析结果203中的各节点(例如HTMLDOM树节点)的对象数。如果一个节点是对象节点并且该对象是候选对象，则对象数是1，否则为0。如果一个节点具有子节点，则对象数是子节点对象数的总数。The object number counting unit 503 is used to calculate the object number of each node (eg HTMLDOM tree node) in theanalysis result 203 . The object number is 1 if a node is an object node and the object is a candidate, and 0 otherwise. If a node has child nodes, the object count is the total number of child node object counts.

中心对象节点查找单元504用于查找当前节点的中心对象节点。根据以下规则来识别中心对象节点：如果一个节点没有对象，则它没有中心对象节点；如果一个子节点的对象数大于该节点的MAX_CENTER_NODE_RATE(例如，90％)，则它是该节点的中心对象节点。The central object node search unit 504 is used to search for the central object node of the current node. Central object nodes are identified according to the following rules: if a node has no objects, it does not have a central object node; if a child node has objects greater than the node's MAX_CENTER_NODE_RATE (for example, 90%), it is the node's central object node .

重复对象模式计算单元505使用以下规则来识别重复对象模式。对象数：如果一个节点中的对象数小于2，则它不会是重复对象块。结构化文档的标记：以HTML文件为例，如果节点不是<body>或<table>或<tr>，则它不会是重复对象块。子节点的HTML标记流：此处DOM树节点的标记流是指采用深度优先法搜索到的HTML标记列表。图12是一个示例。该<table>节点的HTML标记流是“<table>，<tr>，<td>，<img>，<td>，<img>，<td>，<img>，<tr>，<td>，<txt>，<td>，<txt>，<td>，<txt>，<tr>，<td>，<img>，<td>，<img>，<td>，<img>，<tr>，<td>，<txt>，<td>，<txt>，<td>，<txt>”。<img>表示DOM树的图像节点(对象节点的一个示例)。<txt>表示DOM树的文本节点。在此，我们认为标记<img>与标记<txt>相同。如果有两个以上子节点的标记流相同，则可以认为该节点是重复对象块。如果该节点是<table>节点，则重复模式应在<Tr>子节点中，并应包含一个以上的对象或文本。而如果该节点是tr节点，则重复模式应在td中。前面的<table>节点是重复对象块，因为它是<table>节点并包含两行六个对象。该节点的子节点具有相同的标记流。方向：与主文本块识别的方向不同，从上到下识别重复对象块。The repeated object pattern calculation unit 505 uses the following rules to identify repeated object patterns. Number of Objects: If the number of objects in a node is less than 2, it will not be a duplicate object block. Markup for structured documents: Taking HTML files as an example, if a node is not <body> or <table> or <tr>, it will not be a repeating object block. HTML tag stream of child nodes: Here, the tag stream of the DOM tree node refers to the HTML tag list searched by the depth-first method. Figure 12 is an example. The <table> node's HTML tag stream is "<table>, <tr>, <td>, <img>, <td>, <img>, <td>, <img>, <tr>, <td> , <txt>, <td>, <txt>, <td>, <txt>, <tr>, <td>, <img>, <td>, <img>, <td>, <img>, < tr>, <td>, <txt>, <td>, <txt>, <td>, <txt>". <img> represents an image node (an example of an object node) of a DOM tree. <txt> represents a text node of the DOM tree. Here, we consider the tag <img> to be the same as the tag <txt>. A node can be considered a duplicate object block if there are more than two child nodes with the same token flow. If the node is a <table> node, the repeating pattern shall be in a <Tr> child node and shall contain more than one object or text. Whereas if that node is a tr node, the repeat pattern should be in td. The preceding <table> node is a repeating object block because it is a <table> node and contains two rows of six objects. The child nodes of this node have the same token flow. Direction: Different from the direction recognized by the main text block, repeating object blocks are recognized from top to bottom.

图6示出了对象注释提取的关键步骤。输入的是主块标示分析结果307，例如HTML DOM树。单个对象注释提取单元602提取各候选对象的注释。然后，公共注释提取单元603提取候选对象的公共注释。对象索引构建单元604生成结构化对象索引207，例如所有内容对象的XML格式索引605。Figure 6 shows the key steps of object annotation extraction. The input is the main block marking analysis result 307, such as an HTML DOM tree. The single object annotation extracting unit 602 extracts annotations for each candidate object. Then, the public annotation extracting unit 603 extracts the public annotations of the candidate objects. The object index construction unit 604 generates astructured object index 207, such as an XML format index 605 of all content objects.

单个对象注释提取单元602根据以下规则来提取候选对象的9种注释，包括结构化文档的绝对地址，例如网页的URL；结构化文档的标题，例如网页的标题；对象的文件名；别名；单独注释；公共注释；周边文本(surrounding)；对象是否位于主文本块的标示；对象是否位于重复对象块的标示；文件名和别名：文件名和别名是对象的自然注释，它们是对象的两种属性，由分析单元确定；单个HTML标记：如果对象和文本位于单个结构化文档标记内(例如单个HTML标记内)，例如<A>，<td>，<center>，则该文本被认为是对象注释；一行中的对象和文本：如果对象和文本位于一行中，例如在<tr>内单独的<td>中，则该文本被认为是相应对象的注释；重复对象块中的对象和文本：如果对象和文本位于重复对象块中，则根据重复模式来提取对象注释。以图12为例，节点<table>是重复对象块。重复模式是“<tr><td><img><td><img><td><img>”(注意：我们认为<txt>与<img>相同)。因此，第2行中的text11、text12、text13分别是图像对象11、图像对象12、图像对象13的注释。并且第4行中的text21、text22、text23分别是图像对象21、图像对象22、图像对象23的注释。所有作为注释提取出来的文本都被标记为已使用，并在以下处理中将不再被提取。如果使用前面的所有方法都不能找到对象注释，则可以按照距离来提取注释。按照结构化文档的标记类型，例如HTML标记的类型来计算距离。不同的标记具有不同的距离值。这是一种常用的对象注释检索方法。如果在单个HTML标记或一行中具有一个以上的候选对象和文本，则也可按照距离来提取注释。按照距离提取的注释被标记为周边文本。The single object annotation extracting unit 602 extracts 9 kinds of annotations of the candidate object according to the following rules, including the absolute address of the structured document, such as the URL of the web page; the title of the structured document, such as the title of the web page; the file name of the object; the alias; Comments; public comments; surrounding text (surrounding); indication of whether the object is located in the main text block; indication of whether the object is located in a repeated object block; file name and alias: file name and alias are natural annotations of the object, which are two properties of the object, determined by the unit of analysis; a single HTML tag: if the object and text are located within a single structured document tag (e.g. within a single HTML tag), such as <A>, <td>, <center>, the text is considered an object comment; Objects and text in one line: if the object and text are on a single line, for example in a separate <td> within a <tr>, the text is considered to be a comment for the corresponding object; repeating objects and text in a block of objects: if the object and text in a repeating object block, object annotations are extracted based on the repeating pattern. Taking Figure 12 as an example, the node <table> is a repeated object block. The repeating pattern is "<tr><td><img><td><img><td><img>" (note: we consider <txt> to be the same as <img>). Therefore, text11, text12, and text13 in the second line are annotations of image object 11, image object 12, and image object 13, respectively. And text21, text22, and text23 in the fourth line are the comments of image object 21, image object 22, and image object 23, respectively. All text extracted as comments is marked as used and will not be extracted again in the following processing. If object annotations cannot be found using all previous methods, annotations can be extracted by distance. The distance is calculated according to the markup type of the structured document, such as the type of HTML markup. Different markers have different distance values. This is a commonly used object annotation retrieval method. Annotations can also be extracted by distance if there is more than one candidate object and text in a single HTML tag or line. Annotations extracted by distance are labeled as surrounding text.

可选地，单个对象注释提取单元602中可以具有关键词提取单元，其使用预定的实际注释关键词分析规则，分析多媒体对象的注释，提取对多媒体对象内容进行实际注释的关键词，并删除无效的注释。Optionally, there may be a keyword extraction unit in the single object annotation extraction unit 602, which uses predetermined actual annotation keyword analysis rules to analyze the annotation of the multimedia object, extract keywords that actually annotate the content of the multimedia object, and delete invalid Notes for .

公共注释提取单元603提取候选对象的公共注释。公共注释是另一种对象注释，它对一组对象而不是单个对象的内容进行说明。例如，图11中黑色椭圆内的文本就是公共注释的一个示例。它对该网页中所有七个对象的内容进行说明。根据以下规则来提取公共注释。首先，遍历分析结果，例如主文本块的HTML DOM树。如果主文本块包含候选对象，则提取还未使用的并标记为对象注释的文本，当一个节点的标记流是重复对象模式时，该节点中的所有文本都不予考虑。把该文本设定为该主文本块中所有候选对象的公共注释。第二，遍历重复对象块的HTML DOM树。如果重复对象块被发现有文本，则提取所有未使用的文本和重复模式以外的文本作为公共注释。把该文本设定为该重复对象块的重复模式中的候选对象的公共注释。如果在重复对象块中没有文本，则把重复对象块前面的文本看作是公共注释，除非前一节点是另一重复对象块或重复对象模式或多节点或候选对象。多节点是包含候选对象和文本两者的HTMLDOM树节点。The public annotation extracting unit 603 extracts public annotations of candidate objects. Common annotations are another type of object annotation that describe the contents of a group of objects rather than a single object. For example, the text inside the black oval in Figure 11 is an example of a public annotation. It describes the content of all seven objects in the page. Public annotations are extracted according to the following rules. First, the analysis results, such as the HTML DOM tree of the main text block, are traversed. If the main text block contains candidate objects, text that has not yet been used and marked as object annotations is extracted. When a node's token flow is a repeated object pattern, all text in this node is not considered. Set this text as a common comment for all candidates in this master text block. Second, the HTML DOM tree for blocks of repeated objects is traversed. If a repeated object block is found to have text, all unused text and text outside the repeated pattern are extracted as common annotations. This text is set as the public comment for the candidate objects in the repeating pattern of the repeating object block. If there is no text in a repeating object block, the text preceding the repeating object block is considered a common comment, unless the previous node is another repeating object block or a repeating object pattern or a multinode or candidate object. A multi-node is an HTMLDOM tree node that contains both candidate objects and text.

在这个步骤中，已经提取了候选对象的所有注释。现在，对象索引构建单元604生成结构化对象索引207，例如输入结构化文档201中的所有多媒体对象的XML格式索引。图13显示了XML格式对象索引(结构化对象索引207的一个示例)。所有对象的注释被记录在标记<WebPage>和</WebPage>之间。在<Head>中记录整个页面的信息，包括网页的URL、页面的本地路径，页面中的HTML标题和内容对象总数。在<body>中，有记录各个对象的信息的对象标记列表。对象信息包括：对象的文件名、对象的绝对URL地址、对象大小、别名、单独注释、公共注释、周边文本及对象是否处于主块中的标示。当对象在主文本块中时，对应项目<IsInMainTextBlock>被设定为真，当对象在重复对象块中时，对应项目<IsInRepeatingObjectBlock>被设定为真。In this step, all annotations of candidate objects have been extracted. Now, the object index construction unit 604 generates the structuredobject index 207 , for example, an XML format index of all multimedia objects in the input structureddocument 201 . Fig. 13 shows an XML format object index (an example of structured object index 207). All object comments are recorded between the tags <WebPage> and </WebPage>. Record the information of the entire page in <Head>, including the URL of the web page, the local path of the page, the HTML title and the total number of content objects in the page. In <body>, there is an object tag list that records information of each object. The object information includes: the file name of the object, the absolute URL address of the object, the size of the object, the alias, individual annotations, public annotations, surrounding text, and an indication of whether the object is in the main block. The corresponding item <IsInMainTextBlock> is set to true when the object is in the main text block, and the corresponding item <IsInRepeatingObjectBlock> is set to true when the object is in the repeating object block.

图7示出了使用对象索引来检索目标对象的关键步骤。输入的是结构化对象索引207，例如XML格式对象索引，以及检索要求209，例如关键词。要求转换单元703把输入的检索要求转换为另一格式。例如，在字典中搜索与输入的关键词相关联的字。目标对象识别单元704计算一个对象是否是目标对象。把结果记录在目标对象列表705中并返回给用户。Figure 7 shows the key steps of using the object index to retrieve the target object. The input is astructured object index 207, such as an object index in XML format, and aretrieval requirement 209, such as keywords. The request conversion unit 703 converts the input retrieval request into another format. For example, a dictionary is searched for a word associated with an input keyword. The target object recognition unit 704 calculates whether an object is a target object. The result is recorded in the target object list 705 and returned to the user.

以上利用优选实施例对本发明进行了说明。但应该理解，本发明的范围不限于上述的优选实施例。本发明的设备和方法可以应用于任何结构化文档，包括，但不限于网页、XML文件等等，本发明的设备和方法可以用于检索各种多媒体对象，包括，但不限于图像、动画、音频、视频、表等等。同时，本发明不限于以上记载的细节，在所附权利要求限定的范围内，本发明可以有各种变化和改进。The present invention has been described above using preferred embodiments. However, it should be understood that the scope of the present invention is not limited to the preferred embodiments described above. The device and method of the present invention can be applied to any structured document, including, but not limited to web pages, XML files, etc., and the device and method of the present invention can be used to retrieve various multimedia objects, including, but not limited to images, animations, Audio, video, tables and more. At the same time, the present invention is not limited to the details described above, and various changes and modifications can be made to the present invention within the scope defined by the appended claims.