JP4307287B2

Movatterモバイル変換

Info

Publication number: JP4307287B2
Application number: JP2004046611A
Authority: JP
Inventors: 泰三亀代; 敬平野
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2004-02-23
Filing date: 2004-02-23
Publication date: 2009-08-05
Anticipated expiration: 2024-02-23
Also published as: JP2005235099A

Description

この発明は、電子文書から有用な情報を抽出するメタデータ抽出装置に関するものである。 The present invention relates to a metadata extraction apparatus that extracts useful information from an electronic document.

近年、コンピュータの高性能化とディスク容量の増大、ネットワークインフラの整備などを背景に、ワープロソフト、ＣＡＤ（Computer Aided Design)ソフト等で作成した文書、図面等をファイルサーバに保存して共有することで、文書、図面等の検索または閲覧を可能にする文書管理方法が増加している。また、文書の電子化・共有化が進み文書数が増加するにつれて、サーバに大量に蓄積された文書群から有用な情報を抽出して活用したいという要求が高まっている。 In recent years, documents, drawings, etc. created with word processing software, CAD (Computer Aided Design) software, etc. are stored on a file server and shared against the background of higher performance computers, increased disk capacity, and network infrastructure. Therefore, there are an increasing number of document management methods that enable searching or browsing of documents, drawings, and the like. Further, as the number of documents increases as the digitization and sharing of documents progresses, there is an increasing demand for extracting and utilizing useful information from a large amount of documents stored in a server.

文書中から有用な情報を抽出する方法として、例えば、非構造化文書を登録時に構造化して文書タイトル、章節タイトルを抽出する方法や、文書中から特定の形式で記述された情報を抽出する方法がある。 As a method of extracting useful information from a document, for example, a method of extracting an unstructured document at the time of registration and extracting a document title and a chapter title, and a method of extracting information described in a specific format from a document There is.

例えば、特許文献１には、非構造化文書をサーバ登録時に構造化して保存する方法が開示されている。具体的には、紙文書をスキャナで読み込んで作成した文書イメージから、テキスト領域、表領域、イメージ領域を検出し、例えばテキスト領域からはテキストを抽出すると共に、文字列のレイアウト情報を利用して文書タイトル、章節タイトルを抽出する。また、表領域からはセル内の文字認識を行って数値データを抽出する。 For example,Patent Document 1 discloses a method for storing an unstructured document in a structured manner at the time of server registration. Specifically, a text area, a table area, and an image area are detected from a document image created by reading a paper document with a scanner. For example, text is extracted from the text area and character string layout information is used. Extract document title and chapter title. Further, character data in the cell is recognized from the table area to extract numerical data.

また、特許文献２には、有用な情報として文書中から特定の形式の情報を抽出する方法が開示されている。具体的には、抽出する項目と探索範囲を予め定義ファイルに登録しておき、この定義ファイルに従って入力文書から文字抽出を行い、その結果を文書記述言語ＳＧＭＬ（Standard Generalized Markup Language）形式で出力する方法である。例えば、タイトルを抽出する場合は、文字列「タイトル：」に続き改行までの文字列を「タイトル」として抽出するように定義ファイルに登録しておき、これに基づいて文字抽出を行う。 Patent Document 2 discloses a method for extracting information in a specific format from a document as useful information. Specifically, the items to be extracted and the search range are registered in the definition file in advance, characters are extracted from the input document according to the definition file, and the result is output in the document description language SGML (Standard Generalized Markup Language) format. Is the method. For example, when extracting a title, the character string “title:” is registered in the definition file so that a character string up to a line feed is extracted as “title”, and character extraction is performed based on this.

特開平０８−２１２２９３号公報JP 08-212293 A特開２００１−２９０８０１号公報JP 2001-290801 A

特許文献１に記載の発明は上記のように構成されているので、入力文書から抽出できる情報は、例えばテキスト領域からは文書タイトル、章節タイトル等であり、文書画像中の文字列の位置または大きさのみに基づいて抽出できる情報に限定されてしまう。したがって、文書中の文章間・単語間の関係等、文書の内容に関する情報は抽出できないため、文書から抽出できる有用な知識が限られてしまうという課題があった。 Since the invention described inPatent Document 1 is configured as described above, information that can be extracted from an input document is, for example, a document title, a chapter title, etc. from a text area, and the position or size of a character string in a document image. It will be limited to the information which can be extracted based only on it. Therefore, there is a problem that useful knowledge that can be extracted from the document is limited because information about the contents of the document such as the relationship between sentences and words in the document cannot be extracted.

また、特許文献２に記載の発明では、定型文書に対応するように予め作成された定義ファイルに従って情報抽出が行われるため、非定形文書に対してはそれぞれに応じた抽出形式の定義ファイルを作成しなければならないという課題があった。 Further, in the invention described in Patent Document 2, information extraction is performed according to a definition file created in advance so as to correspond to a standard document. Therefore, for a non-standard document, a definition file of an extraction format corresponding to each is created. There was a problem that had to be done.

この発明は上記のような課題を解決するためになされたもので、文書中の文章間・単語間の関係等、文書から有用な知識を抽出でき、また非定形文書からの定義ファイルを作成しなおすことなく有用な知識が抽出できるメタデータ抽出装置を得ることを目的とする。 The present invention has been made to solve the above-described problems, and can extract useful knowledge from a document, such as the relationship between sentences and words in a document, and can create a definition file from an atypical document. It is an object of the present invention to provide a metadata extraction device that can extract useful knowledge without correction.

処理対象の電子文書を取得する文書取得手段と、文書取得手段が取得した電子文書から、電子文書に記述される文章を構成する要素と、各要素の電子文書内における相対的な位置を示す位置データとを抽出する要素抽出手段と、文章を構成する１または複数の要素からなる要素列の属性を、要素列を構成する要素の品詞に関連付けて定義した属性定義を記憶する属性定義記憶手段と、属性定義を参照して、前記要素列に属性を割り当てる属性割り当て手段と、属性割り当て手段が属性を割り当てた要素列のうち、着目する要素列間の関連性を示す関連度を算出する関連度算出手段と、属性割り当て手段が割り当てた属性を識別子として各要素列に付すと共に、関連度算出手段によって算出された関連度に応じて要素列同士をグループ化し、当該グループに対応する識別子を付すことによって第１の関連度抽出情報を生成する関連度抽出情報生成手段とを備えたものである。A document acquisition unit that acquires an electronic document to be processed, elements that constitute a sentence described in the electronic document from the electronic document acquired by the document acquisition unit, and a position that indicates the relative position of each element in the electronic document Element extraction means for extracting data, and attribute definition storage means for storing an attribute definition in which an attribute of an element sequence composed of one or more elements constituting a sentence is defined in association with a part of speech of an element constituting the element sequence , An attribute assignment unit that assigns an attribute to the element string with reference to the attribute definition, and a degree of association that calculates a degree of association indicating the relation between the element columns of interest among the element columns to which the attribute assignment unit has assigned the attribute The attribute assigned by the calculating means and the attribute assigning means is attached to each element string as an identifier, and element strings are grouped according to the degree of association calculated by the degree of association calculating means. It is obtained by a relation level extraction information generation means for generating afirst relation level extractor information by subjecting an identifier corresponding to the group.

処理対象の電子文書を取得する文書取得手段と、文書取得手段が取得した電子文書から、電子文書に記述される文章を構成する要素と、各要素の電子文書内における相対的な位置を示す位置データとを抽出する要素抽出手段と、文章を構成する１または複数の要素からなる要素列の属性を、要素列を構成する要素の品詞に関連付けて定義した属性定義を記憶する属性定義記憶手段と、属性定義を参照して、前記要素列に属性を割り当てる属性割り当て手段と、属性割り当て手段が属性を割り当てた要素列のうち、着目する要素列間の関連性を示す関連度を算出する関連度算出手段と、属性割り当て手段が割り当てた属性を識別子として各要素列に付すと共に、関連度算出手段によって算出された関連度に応じて要素列同士をグループ化し、当該グループに対応する識別子を付すことによって第１の関連度抽出情報を生成する関連度抽出情報生成手段とを備えるように構成したので、文書中の文章間・単語間の関係等、文書から有用な知識を抽出でき、また非定形文書からの定義ファイルを作成しなおすことなく有用な知識が抽出できる効果がある。A document acquisition unit that acquires an electronic document to be processed, elements that constitute a sentence described in the electronic document from the electronic document acquired by the document acquisition unit, and a position that indicates the relative position of each element in the electronic document Element extraction means for extracting data, and attribute definition storage means for storing an attribute definition in which an attribute of an element sequence composed of one or more elements constituting a sentence is defined in association with a part of speech of an element constituting the element sequence , An attribute assignment unit that assigns an attribute to the element string with reference to the attribute definition, and a degree of association that calculates a degree of association indicating the relation between the element columns of interest among the element columns to which the attribute assignment unit has assigned the attribute The attribute assigned by the calculating means and the attribute assigning means is attached to each element string as an identifier, and element strings are grouped according to the degree of association calculated by the degree of association calculating means. Since it is configured to include a relation level extractor information generating means for generating afirst relation level extractor information by subjecting an identifier corresponding to the group, the relationship or the like between the sentence and inter-word in the document, useful from the document Knowledge can be extracted, and useful knowledge can be extracted without recreating a definition file from a non-standard document.

実施の形態１．
図１はこの発明の実施の形態１によるメタデータ抽出装置のブロック図である。図に示すように、メタデータ抽出装置１０は、文書取得手段１１、文字抽出手段１２、属性割り当て手段（要素抽出手段、属性割り当て手段）１３、属性間関連度算出手段（関連度算出手段）１４、属性構造化手段（第１の関連度抽出情報生成手段）１５、出力手段１６、属性定義ＤＢ（属性定義記憶手段）１７、属性間関連度ＤＢ１８を備える。Embodiment 1 FIG.
FIG. 1 is a block diagram of a metadata extraction apparatus according toEmbodiment 1 of the present invention. As shown in the figure, themetadata extraction apparatus 10 includes adocument acquisition unit 11, acharacter extraction unit 12, an attribute assignment unit (element extraction unit, attribute assignment unit) 13, and an inter-attribute relevance calculation unit (relationship degree calculation unit) 14. , Attribute structuring means (first relevance degree extraction information generating means) 15, output means 16, attribute definition DB (attribute definition storage means) 17, and inter-attributerelevance degree DB 18.

文書取得手段１１は、ユーザが入力した文書を取得し、文字抽出手段１２は、文書取得手段１１が取得した文書に記述される文章を構成する文字、および各文字の位置座標を抽出する。属性割り当て手段１３は、文字抽出手段１２が抽出した文字に形態素解析を行い、属性定義ＤＢ１７を参照して文字列に属性を割り当てる。属性間関連度算出手段１４は、属性割り当て手段１３が割り当てた文字列間の意味の関連性を示す関連度を算出する。属性構造化手段１５は、属性間関連度算出手段１４が算出した関連度に基づいて、関連性の高い文字列同士をグループ化して構造化文書を作成する。出力手段１６は、作成された構造化文書をファイル形式で出力する。属性定義ＤＢ１７は、文章を構成する文字列に対して属性を割り当てるために、文字列を構成する文字の品詞に関連付けて属性を定義した属性定義を保持する。属性間関連度ＤＢ１８は、属性間の関連度を定性的な値で予め定義したテーブルを保持する。 Thedocument acquisition unit 11 acquires a document input by the user, and thecharacter extraction unit 12 extracts characters constituting a sentence described in the document acquired by thedocument acquisition unit 11 and the position coordinates of each character. Theattribute assigning unit 13 performs morphological analysis on the character extracted by thecharacter extracting unit 12 and assigns an attribute to the character string with reference to theattribute definition DB 17. The inter-attributerelevance calculating unit 14 calculates a relevance indicating the relevance of meaning between character strings assigned by theattribute assigning unit 13. Theattribute structuring unit 15 creates a structured document by grouping highly related character strings based on the relevance calculated by the inter-attributerelevance calculation unit 14. Theoutput unit 16 outputs the created structured document in a file format. Theattribute definition DB 17 holds an attribute definition in which attributes are defined in association with the part of speech of characters constituting the character string in order to assign attributes to the character strings constituting the sentence. The inter-attribute relevance DB 18 holds a table that predefines the relevance between attributes with qualitative values.

図１の文書取得手段１１、文字抽出手段１２、属性割り当て手段１３、属性間関連度算出手段１４、属性構造化手段１５、出力手段１６はメタデータ抽出装置１０の中央演算装置を、該中央演算装置の動作を制御するプログラムのモジュールに従って便宜的に分割したものである。 Thedocument acquisition unit 11, thecharacter extraction unit 12, theattribute assignment unit 13, the attributerelevance calculation unit 14, theattribute structuring unit 15, and theoutput unit 16 in FIG. These are divided for convenience in accordance with program modules for controlling the operation of the apparatus.

図２は図１のメタデータ抽出装置１０の動作を示すフローチャートである。この図を参照してメタデータ抽出装置１０の動作について説明する。
まず、文書取得手段１１が、ユーザが指定したコンピュータ読み取り可能な文書を取得する（ステップＳＴ１００）。例えば、コンピュータで処理可能なワープロソフト等を用いて作成した文書ファイル、ＣＡＤソフトを用いて作成した図面ファイル、ＰＤＦ(Portable Document Format)形式のファイル等を取得の対象とする。図３は文書取得手段１１が取得した文書の例を示す図である。ここでは、ワープロソフトで作成した文書ファイルを用いるとする。FIG. 2 is a flowchart showing the operation of themetadata extraction apparatus 10 of FIG. The operation of themetadata extraction apparatus 10 will be described with reference to this figure.
First, thedocument acquisition unit 11 acquires a computer-readable document designated by the user (step ST100). For example, a document file created using word processing software that can be processed by a computer, a drawing file created using CAD software, a PDF (Portable Document Format) format file, or the like is acquired. FIG. 3 is a diagram illustrating an example of a document acquired by thedocument acquisition unit 11. Here, it is assumed that a document file created by word processing software is used.

続いて、文字抽出手段１２が文字抽出処理を行う（ステップＳＴ２００）。文字抽出手段１２は文書取得手段１１が取得した文書ファイルから、記述されている文字、数字等の各要素の文字コードおよび文書内における位置座標を抽出する。文字コードおよび位置座標の抽出は、ＰＤＦ等のようにフォーマット仕様が公開されているものについては、それに従って文書ファイルを解析することで文字コード、位置座標を抽出することができる。一方、仕様が公開されていないワープロソフト等で作成された文書については、文書を作成したアプリケーションから擬似的な印刷処理、すなわち、文書、画像等をプリンタが解釈できるページ記述言語で記述したＰＤＦファイルを作成し、これを用いて文書解析を行って、文字コード、位置座標を抽出する。文字抽出手段１２は抽出した文字コード、位置座標を属性割り当て手段１３に供給する。 Subsequently, the character extraction means 12 performs a character extraction process (step ST200). Thecharacter extraction unit 12 extracts the character code of each element such as a character and a number described in the document file acquired by thedocument acquisition unit 11 and the position coordinates in the document. For extraction of character codes and position coordinates, for those whose format specifications are publicly available, such as PDF, the character code and position coordinates can be extracted by analyzing the document file accordingly. On the other hand, for documents created with word processing software whose specifications are not disclosed, a pseudo-print process from the application that created the document, that is, a PDF file described in a page description language that allows the printer to interpret the document, image, etc. Is used to perform document analysis to extract character codes and position coordinates. Thecharacter extraction unit 12 supplies the extracted character code and position coordinates to theattribute assignment unit 13.

属性割り当て手段１３は、取得した文字コードと属性定義ＤＢ１７が保持する属性定義とに基づいて、文書に記述される文字列に属性を割り当てる（ステップＳＴ３００）。ここで、属性とは文書中に記述された文字列の意義を示す情報、例えば会社名、人名、地名等の固有表現、または日付、長さ、重さ、状態等の項目付けが可能な情報である。属性割り当て手段１３は、属性定義ＤＢ１７を参照して文字抽出手段１２が抽出した文字コードに対応する属性を割り当てる。 The attribute assigning means 13 assigns an attribute to the character string described in the document based on the acquired character code and the attribute definition held in the attribute definition DB 17 (step ST300). Here, the attribute is information indicating the significance of the character string described in the document, for example, a unique expression such as a company name, a person name, or a place name, or information that can be itemized such as date, length, weight, state, etc. It is. The attribute assigning means 13 assigns an attribute corresponding to the character code extracted by the character extracting means 12 with reference to theattribute definition DB 17.

図４は属性定義ＤＢ１７が保持する属性定義の例を示す図である。図には、「日付」、「所在地」、「組織名」、「社長」の属性が記述されている。例えば、属性「日付」は、２〜４桁の数字、記号、１〜２桁の数字、記号および１〜２桁の数字の組み合わせ、または、２〜４桁の数字、「年」、１〜２桁の数字、「月」、１〜２桁の数字および「日」の組み合わせによって表示される文字列であると定義される。また、属性「社長」は、構成文字の品詞および付加情報が［名詞−固有名詞−人名］からなり、かつ文字列「社長」の近くにある文字列であると定義される（詳細は後述する）。 FIG. 4 is a diagram illustrating an example of attribute definitions held in theattribute definition DB 17. In the figure, attributes of “date”, “location”, “organization name”, and “president” are described. For example, the attribute “date” has 2 to 4 digits, a symbol, 1 to 2 digits, a combination of a symbol and 1 to 2 digits, or a 2 to 4 digits, “year”, 1 to 2 It is defined as a character string displayed by a combination of a 2-digit number, “month”, a 1-2 digit number, and “day”. The attribute “President” is defined as a character string whose part of speech and additional information of the constituent characters are [noun−proper noun−person name] and near the character string “President” (details will be described later). ).

図５は属性割り当て手段１３の動作を示すフローチャートである。この図を参照して属性割り当て手段１３の具体的な動作について説明する。
属性割り当て手段１３は、文字抽出手段１２から取得した文字コードに対して形態素解析処理を行う（ステップＳＴ３１０）。形態素解析処理は日本文解析の公知の技術であるので詳細な動作の説明は省略するが、自然文を意味のある最小の単位に分解する処理である。属性割り当て手段１３は形態素解析処理によって文章を意味のある最小の単位の文字列（以下、最小文字列と呼ぶ）に分け、各文字列に対して品詞を割り当てる。FIG. 5 is a flowchart showing the operation of the attribute assigning means 13. A specific operation of theattribute assignment unit 13 will be described with reference to this figure.
Theattribute assignment unit 13 performs a morpheme analysis process on the character code acquired from the character extraction unit 12 (step ST310). Since the morphological analysis process is a known technique of Japanese sentence analysis, a detailed description of the operation is omitted, but it is a process of decomposing a natural sentence into meaningful minimum units. Theattribute assigning means 13 divides a sentence into meaningful minimum unit character strings (hereinafter referred to as minimum character strings) by morphological analysis processing, and assigns parts of speech to each character string.

図６は図３に示す文書に対する属性割り当て手段１３の形態素解析処理の結果を示す図である。図に示すように、形態素解析処理の結果、各最小文字列に品詞および必要に応じて付加情報１〜３が付加される。付加情報１は、文字列の品詞が名詞である場合に、その名詞の種類を示す。例えば、一般名詞、固有名詞等である。付加情報２は文字列の品詞が名詞で、かつ付加情報１が固有名詞である場合に、固有名詞の種類を示す。例えば、組織、地域、人名等である。さらに必要な場合は付加情報３が付加される。例えば、付加情報２が人名のとき「姓」、「名」が付加される。 FIG. 6 is a diagram showing the result of the morphological analysis processing of theattribute assigning means 13 for the document shown in FIG. As shown in the figure, as a result of the morphological analysis process, part of speech andadditional information 1 to 3 are added to each minimum character string as necessary. Theadditional information 1 indicates the type of the noun when the part of speech of the character string is a noun. For example, general nouns and proper nouns. The additional information 2 indicates the type of proper noun when the part of speech of the character string is a noun and theadditional information 1 is a proper noun. For example, an organization, a region, a person name, and the like. If necessary, additional information 3 is added. For example, when the additional information 2 is a person name, “last name” and “first name” are added.

属性割り当て手段１３はまた、図２のステップＳＴ２００において文字抽出手段１２が抽出した各要素の位置座標から、最小文字列の左上座標、右下座標を算出する。図６に各座標値を示す。ここに示す座標値は、例えば文書の横方向をｘ軸、縦方向をｙ軸としたときの原点からの距離を示す。各最小文字列は左上座標と右下座標を対角線とする矩形に収まるように配置される。 Theattribute assigning means 13 also calculates the upper left coordinates and lower right coordinates of the minimum character string from the position coordinates of each element extracted by the character extracting means 12 in step ST200 of FIG. FIG. 6 shows each coordinate value. The coordinate values shown here indicate the distance from the origin when the horizontal direction of the document is the x axis and the vertical direction is the y axis, for example. Each minimum character string is arranged so as to fit in a rectangle whose diagonal is an upper left coordinate and a lower right coordinate.

続いて、属性割り当て手段１３は属性パタン照合処理を行う（ステップＳＴ３２０）。具体的には、図６に示した形態素解析結果に対して図４の属性定義との照合を行い、属性定義と一致する品詞および付加情報１〜３を有する文字列にその属性を割り当てる。例えば、「日付」の属性を割り当てる場合、図４の「日付」を表す組み合わせと一致する組み合わせからなる文字列を図６から抽出する。まず、「数字２〜４桁」となる文字列を検索する。図６の文頭１〜４文字が「数字２〜４桁」の文字列に当てはまるため、文頭１〜４文字である「２００３」を抽出する。続いて、「２００３」に続く文字が「記号」または「年」であるかを判定する。図６では、「２００３」に続いて「年」が出現するので、ここまで「日付」の照合に成功し、「年」を抽出する。以下、属性割り当て手段１３は同様に処理を続行し、「日」まで抽出したところで文字列「２００３年９月１６日」に属性「日付」を割り当てる。 Subsequently, theattribute assignment unit 13 performs an attribute pattern matching process (step ST320). Specifically, the morpheme analysis result shown in FIG. 6 is collated with the attribute definition of FIG. 4, and the attribute is assigned to the character string having the part of speech andadditional information 1 to 3 that match the attribute definition. For example, when assigning the attribute “date”, a character string composed of a combination that matches the combination representing “date” in FIG. 4 is extracted from FIG. First, a character string having “numbers 2 to 4” is searched. Since the first to fourth letters in FIG. 6 apply to the character string of “2 to 4 digits”, “2003” that is the first to fourth letters is extracted. Subsequently, it is determined whether the character following “2003” is “symbol” or “year”. In FIG. 6, “year” appears after “2003”, so “date” has been successfully verified so far, and “year” is extracted. Thereafter, theattribute assigning means 13 continues the process in the same manner, and assigns the attribute “date” to the character string “September 16, 2003” when “day” is extracted.

属性割り当て手段１３は、同様に他の属性についても図４の属性定義に従って文字列に割り当てる。例えば、「組織名」の属性を割り当てる場合は、品詞−付加情報１−付加情報２が［名詞−固有名詞−組織］となる文字列を抽出して割り当てる。図７、図８は図３に示す文書のうち属性割り当て手段１３によって属性が割り当てられた文字列を説明する図である。図８に示すように、属性割り当て手段１３は属性「日付」を文字列「２００３年９月１６日」に、属性「組織名」を文字列「○×電気」に、属性「所在地」を文字列「東京都千代田区丸の内１−１−１」に、属性「社長」の文字列「○田×男」に割り当てた。以降、属性を割り当てた文字列を属性値と称する。図８にはまた抽出した各属性値の左上座標、右上座標も示す。属性割り当て手段１３は割り当て結果を属性間関連度算出手段１４に供給する（ステップＳＴ３３０）。 Similarly, theattribute assigning means 13 assigns other attributes to the character string according to the attribute definition of FIG. For example, when assigning the attribute “organization name”, a character string in which the part of speech—additional information 1—additional information 2 is “noun—proper noun—organization” is extracted and assigned. 7 and 8 are diagrams for explaining character strings to which attributes are assigned by theattribute assigning means 13 in the document shown in FIG. As shown in FIG. 8, theattribute assigning means 13 sets the attribute “date” as the character string “September 16, 2003”, the attribute “organization name” as the character string “Ox Electric”, and the attribute “location” as the character. In the column “1-1-1 Marunouchi, Chiyoda-ku, Tokyo”, the character string “Oda × M” of the attribute “President” was assigned. Hereinafter, a character string to which an attribute is assigned is referred to as an attribute value. FIG. 8 also shows the upper left coordinates and upper right coordinates of the extracted attribute values. Theattribute assigning means 13 supplies the assignment result to the attribute relevance calculating means 14 (step ST330).

続いて、属性間関連度算出手段１４は属性値間の関連度の算出を行う。関連度とは着目する２属性値間の意味、位置関係に基づく関連性を示す値であり、割り当てられた属性の関連度に基づいて算出される。図９は図４に定義された属性間の関連度を記述した属性間関連度テーブルである。各関連度は０から１までの値をとり、数値が大きいほど関連度が高いとする。例えば、図９に示すように「組織名」と、「所在地」、「社長」の関連度がそれぞれ０．８であるのに対し、「日付」と、「組織名」、「所在地」、「社長」の関連度、「所在地」と「社長」の関連度はそれぞれ０．３と低く設定している。属性間関連度テーブルは、ユーザが予め作成して属性間関連度ＤＢ１８に保存するものであり、ここでは日本文文法と属性内容を参考にヒューリスティックに作成する。 Subsequently, the inter-attribute relevance calculation means 14 calculates the relevance between attribute values. The relevance is a value indicating the relevance based on the meaning and positional relationship between the two attribute values of interest, and is calculated based on the relevance of the assigned attribute. FIG. 9 is an inter-attribute relevance table describing the relevance between attributes defined in FIG. Each degree of association takes a value from 0 to 1, and it is assumed that the degree of association is higher as the numerical value is larger. For example, as shown in FIG. 9, the degree of association between “organization name”, “location”, and “president” is 0.8, whereas “date”, “organization name”, “location”, “ The relevance level of “President” and the relevance level of “Location” and “President” are each set as low as 0.3. The inter-attribute relevance table is created in advance by the user and stored in theinter-attribute relevance DB 18. Here, the inter-attribute relevance table is heuristically created with reference to Japanese sentence grammar and attribute contents.

図１０は属性間関連度算出手段１４の動作を示すフローチャートである。この図を参照して文書属性間関連度の算出手順について説明する。
まず属性間関連度算出手段１４は、文章を構成する要素に対して属性割り当て手段１３が属性を割り当てた属性値の数を求め変数Ｎに代入する（ステップＳＴ４１０）。ここでは図８に示すように５つの文字列に属性を割り当てたため、Ｎ＝５である。続いて、変数ｉに１を代入し（ステップＳＴ４２０）、変数ｊをｊ＝ｉ＋１からｊ＝Ｎまで変化させながら各属性との関連度を以下の式を用いて算出する（ステップＳＴ４３０）。FIG. 10 is a flowchart showing the operation of the attribute relevance calculation means 14. A procedure for calculating the degree of association between document attributes will be described with reference to FIG.
First, the inter-attributerelevance calculating means 14 obtains the number of attribute values to which theattribute assigning means 13 assigns attributes for the elements constituting the sentence and assigns them to the variable N (step ST410). Here, since attributes are assigned to five character strings as shown in FIG. 8, N = 5. Subsequently, 1 is substituted into the variable i (step ST420), and the degree of association with each attribute is calculated using the following formula while changing the variable j from j = i + 1 to j = N (step ST430).

Ｆ（Ｐ_ｉＰ_ｊ）＝ａ_１＊ｆ_１（Ｐ_ｉＰ_ｊ）
＋ａ_２＊ｆ_２（Ｐ_ｉＰ_ｊ）＋ａ_３＊ｆ_３（Ｐ_ｉＰ_ｊ） …（１）F (P_i P_j ) = a₁ * f₁ (P_i P_j )
+ A₂ * f₂ (P_i P_j ) + a₃ * f₃ (P_i P_j ) (1)

ａ_１＋ａ_２＋ａ_３＝１ …（２）a₁ + a₂ + a₃ = 1 (2)

ｆ_２（Ｐ_ｉＰ_ｊ）＝１／ｌｏｇ（Ｚ_ｉＺ_ｊ＋１）（Ｚ_ｉＺ_ｊ＞０）…（３）
＝１（Ｚ_ｉＺ_ｊ＝０）f₂ (P_i P_j ) = 1 / log (Z_i Z_j +1) (Z_i Z_j > 0) (3)
= 1 (Z_i Z_j = 0)

ｆ_３（Ｐ_ｉＰ_ｊ）＝１（同一文内） …（４）
＝０（上記以外）f₃ (P_i P_j ) = 1 (in the same sentence) (4)
= 0 (other than above)

ここで、Ｆ（Ｐ_ｉＰ_ｊ）は属性値Ｐ_ｉと属性値Ｐ_ｊとの関連度、ｆ_１（Ｐ_ｉＰ_ｊ）は図９に示す属性間関連度テーブルの値、ｆ_２（Ｐ_ｉＰ_ｊ）は注目する２つの属性値の位置座標の距離を用いた評価値であり、式（３）で表す。式（３）においてＺ_ｉＺ_ｊは属性値Ｐ_ｉと属性値Ｐ_ｊの距離であり、図８に示す左上座標および右下座標で囲まれる各矩形同士の最短距離である。ａ_１，ａ_２，ａ_３は式（２）を満たすように予め設定される任意の値である。Here, F (P_i P_j ) is the degree of association between the attribute value P_i and the attribute value P_j , f₁ (P_i P_j ) is the value of the inter-attribute degree of association table shown in FIG. 9, and f₂ (P_i P_j ) is an evaluation value using the distance between the position coordinates of two attribute values of interest, and is expressed by Expression (3). In Expression (3), Z_i Z_j is the distance between the attribute value P_i and the attribute value P_j , and is the shortest distance between the rectangles surrounded by the upper left coordinates and the lower right coordinates shown in FIG. a₁ , a₂ , and a₃ are arbitrary values set in advance so as to satisfy Expression (2).

ｆ_３（Ｐ_ｉＰ_ｊ）は文脈に依存する評価値であり、式（４）で表す。すなわち、属性間関連度算出手段１４は注目する２つの属性値が同一文内に存在するか否かを判定し、存在する場合は１を、存在しない場合は０を割り当てる。同一文内に存在するか否かは、注目する２属性値間に存在する文字列中の助詞または句点の有無から判定する。２属性値間の文字列中に助詞を含み、かつ句点、ピリオド等の文書の終了を示す記号を含まない場合には同一文内にあると判定する。例えば、図７で属性値１「２００３年９月１６日」を含む行と属性値２「○×電気」を含む行との間には助詞が存在しないため、同一文ではないと判定する。f₃ (P_i P_j ) is an evaluation value depending on the context, and is represented by Expression (4). That is, the inter-attribute relevance calculation means 14 determines whether or not two attribute values of interest exist in the same sentence, and assigns 1 if they exist and 0 if they do not exist. Whether or not they exist in the same sentence is determined from the presence or absence of particles or phrases in the character string existing between the two attribute values of interest. If the character string between the two attribute values includes a particle and does not include a symbol indicating the end of the document such as a punctuation mark or a period, it is determined that they are in the same sentence. For example, in FIG. 7, since there is no particle between the line includingattribute value 1 “September 16, 2003” and the line including attribute value 2 “◯ × Electricity”, it is determined that they are not the same sentence.

式（１）から、関連度Ｆ（Ｐ_ｉＰ_ｊ）はＰ_ｉとＰ_ｊとの属性間関連度テーブルに示される属性間関連度の値が大きいほど、また位置座標の距離が小さいほど、さらに同一文中にあるほど高くなる。From Expression (1), the degree of association F (P_i P_j ) is larger as the value of the degree of association between attributes shown in the attribute degree of association table between P_i and P_j is larger, and as the position coordinate distance is smaller. Furthermore, it becomes higher as it is in the same sentence.

属性間関連度算出手段１４は関連度Ｆを算出すると、算出した関連度Ｆが閾値よりも高いか否かを判定し（ステップＳＴ４４０）、高い場合はグループ化処理を行う（ステップＳＴ４５０）。例えば、閾値を０．６とする。ｉ＝１のときｊ＝ｉ＋１〜Ｎの関連度Ｆをそれぞれ算出し、Ｐ_ｉとＰ_ｊの関連度Ｆが０．６以上であるものを同一の属性グループであると認定する。続いて、ｉ＝Ｎであるか否かを判定し（ステップＳＴ４６０）、ｉ＝Ｎの場合は処理を終了し、ｉ≠Ｎの場合はｉをインクリメントして（ステップＳＴ４７０）、ステップＳＴ４３０に戻る。After calculating the relevance F, the inter-attribute relevance calculation means 14 determines whether or not the calculated relevance F is higher than the threshold (step ST440), and if it is higher, performs a grouping process (step ST450). For example, the threshold is 0.6. When i = 1, the degree of association F of j = i + 1 to N is calculated, respectively, and those having a degree of association F of P_i and P_j of 0.6 or more are recognized as the same attribute group. Subsequently, it is determined whether or not i = N (step ST460). If i = N, the process ends. If i ≠ N, i is incremented (step ST470), and the process returns to step ST430. .

例えば、ａ_１＝ａ_２＝０．４、ａ_３＝０．２として属性値間のｆ_１〜ｆ_３および関連度Ｆの値を計算する。図１１〜図１４は計算結果を示す図であり、図１１はｆ_１の値、図１２はｆ_２の値、図１３はｆ_３の値、図１４は関連度Ｆの値をそれぞれ示す。図１１〜図１４の属性値１〜５は図７の属性値１〜５にそれぞれ対応する。For example, the values of f_{1 to} f₃ and the degree of association F between the attribute values are calculated as a₁ = a₂ = 0.4 and a₃ = 0.2. 11 to 14 are views showing calculation results, Fig. 11 shows the value of_{f 1,} 12 the value of_{f 2,} FIG. 13 is the value of_{f 3,} Figure 14 the value of the relevance F respectively. The attribute values 1 to 5 in FIGS. 11 to 14 correspond to the attribute values 1 to 5 in FIG. 7, respectively.

図１４においてｉ＝１（属性値１）とｊ＝２〜５（属性値２〜属性値５）との関連度はそれぞれ０．４２，０．３３，０．２６，０．２９であり、全て０．６未満であるためグループ化処理は行わない。ｉ＝２（属性値２），ｊ＝３（属性値３）のとき、図１４より関連度Ｆ＝０．６２であり、閾値０．６以上であるためグループ化処理を行う。属性間関連度算出手段１４は、属性値２と属性値３とが同一の属性グループである旨の情報を保持しておく。また、図１４から、属性値４と属性値５の関連度Ｆ＝０．６７であり、閾値０．６以上であるため属性値４と属性値５も同一グループであるとの情報を保持しておく。 In FIG. 14, the relevance between i = 1 (attribute value 1) and j = 2 to 5 (attribute value 2 to attribute value 5) is 0.42, 0.33, 0.26, and 0.29, respectively. Since all of them are less than 0.6, the grouping process is not performed. When i = 2 (attribute value 2) and j = 3 (attribute value 3), the relevance F = 0.62 as shown in FIG. The attribute relevance calculation means 14 holds information indicating that the attribute value 2 and the attribute value 3 are the same attribute group. Further, from FIG. 14, the degree of association F = 0.67 between the attribute value 4 and the attribute value 5 and the threshold value 0.6 or more holds information indicating that the attribute value 4 and the attribute value 5 are also in the same group. Keep it.

続いて、属性構造化手段１５が属性構造化処理を行う（ステップＳＴ５００）。ここでは、属性割り当て手段１３の出力結果、すなわち図７、図８に示す結果と、属性間関連度算出手段１４の算出結果、すなわち図１４に示す結果とに基づいて処理を行い、関連度が高い属性同士を同一タグ（識別子）、例えば「グループ」タグで囲んで出力する。上述のように、属性間関連度算出手段１４は属性値２と属性値３、属性値４と属性値５をそれぞれ同一グループと判定している。また、属性値２と属性値４は同一値（「○×電機」）であるため、属性構造化手段１５は属性値２、属性値３および属性値５が同一グループであると判定し、これら３つの属性値について構造化処理を行う。ここでは、各属性値をその属性に対応するタグによって囲み、ファイル形式の文書を作成する。 Subsequently, the attribute structuring means 15 performs an attribute structuring process (step ST500). Here, processing is performed based on the output result of theattribute assigning means 13, that is, the result shown in FIGS. 7 and 8, and the calculation result of the inter-attributerelevance calculating means 14, that is, the result shown in FIG. High attributes are enclosed in the same tag (identifier), for example, a “group” tag and output. As described above, the attributerelevance calculating unit 14 determines that the attribute value 2 and the attribute value 3, and the attribute value 4 and the attribute value 5 are the same group. Further, since the attribute value 2 and the attribute value 4 are the same value (“Ox Electric”), theattribute structuring unit 15 determines that the attribute value 2, the attribute value 3, and the attribute value 5 are the same group. A structuring process is performed for the three attribute values. Here, each attribute value is enclosed by a tag corresponding to the attribute to create a file format document.

図１５は属性構造化手段１５による文書属性構造化処理の結果を示す図である。図に示すように、各属性値をその属性を示すタグで囲み、さらに属性間関連度算出手段１４が同一グループと判定した「○×電機」、「東京都千代田区丸の内１−１−１」および「○田×男」を「グループ」タグで囲む。一方、グループ化されない「２００３年９月１６日」については「グループ」タグで囲まない。属性構造化手段１５は、グループ化した結果をファイル形式で出力手段１６に供給する。出力手段１６は、取得した処理結果をファイル形式でメタデータ保存領域に登録したり、文書保存サーバ（図示せず）に出力したりする（ステップＳＴ６００）。 FIG. 15 is a diagram showing the result of the document attribute structuring process by the attribute structuring means 15. As shown in the figure, each attribute value is enclosed by a tag indicating the attribute, and the inter-attribute relevance calculation means 14 determines that the group is the same group, “○ × Denki”, “1-1-1 Marunouchi, Chiyoda-ku, Tokyo” Enclose "* da x man" with a "group" tag. On the other hand, “September 16, 2003” that is not grouped is not surrounded by a “group” tag. Theattribute structuring unit 15 supplies the grouped result to theoutput unit 16 in a file format. The output means 16 registers the acquired processing result in a metadata storage area in a file format or outputs it to a document storage server (not shown) (step ST600).

以上のように、この実施の形態１によれば、属性を品詞に関連付けて定義した属性定義に従って属性割り当て手段１３が文書に記述される文字列に属性を割り当て、属性間関連度算出手段１４が属性値間の関連度を判定してグループ化し、属性構造化手段１５が構造化してメタデータを出力するようにしたので、定型文書のみでなく非定形文書からもメタデータの抽出が行えると共に、関連度の高い属性値同士をグループ化することで属性値間の関連を示すことができるため、検索、閲覧等においてユーザにとって使い勝手のよいメタデータを抽出することができるという効果が得られる。 As described above, according to the first embodiment, theattribute assignment unit 13 assigns an attribute to the character string described in the document according to the attribute definition defined by associating the attribute with the part of speech, and the inter-attributerelevance calculation unit 14 Since the degree of association between attribute values is determined and grouped, and attribute structuring means 15 is structured to output metadata, metadata can be extracted not only from a standard document but also from a non-standard document, By grouping attribute values having a high degree of relevance, it is possible to show the relationship between attribute values, so that it is possible to extract metadata that is convenient for the user in search, browsing, and the like.

実施の形態２．
この実施の形態２では、上記実施の形態１で説明した単一文書に対して属性を割り当てて属性値間の関連度を判定することに加えて、他の文書またはデータベース中の文字列の属性とも照合を行い、関連度の高い属性値を同一グループ化する処理について説明する。Embodiment 2. FIG.
In the second embodiment, in addition to determining an association between attribute values by assigning an attribute to the single document described in the first embodiment, the attribute of a character string in another document or database. In the following, a process for performing collation and grouping attribute values having high relevance to the same group will be described.

図１６はこの発明の実施の形態２によるメタデータ抽出装置のブロック図である。図に示すように、このメタデータ抽出装置１０’は実施の形態１の図１に示す構成に加えて、属性追加手段１９と文書ＤＢ２０とを備える。属性追加手段１９は、文書取得手段１１が取得した文書が含む属性グループを他の文書が含む属性グループと比較して、他の文書に同一の属性グループが存在する場合はそこから属性値を取り込む処理を行う。文書ＤＢ２０は、文書とそのメタデータを保持する。 FIG. 16 is a block diagram of a metadata extraction apparatus according to Embodiment 2 of the present invention. As shown in the figure, themetadata extraction apparatus 10 ′ includes anattribute adding unit 19 and adocument DB 20 in addition to the configuration shown in FIG. 1 of the first embodiment. Theattribute adding unit 19 compares the attribute group included in the document acquired by thedocument acquiring unit 11 with the attribute group included in the other document, and if the same attribute group exists in the other document, captures the attribute value therefrom. Process. Thedocument DB 20 holds a document and its metadata.

図１７は、この発明の実施の形態２によるメタデータ抽出装置の動作を示すフローチャートである。この図は実施の形態１の図２に示すフローチャートのステップＳＴ５００とステップＳＴ６００の間に属性追加処理（ステップＳＴ６００）を追加し、メタデータ出力処理をステップＳＴ７００としたものである。この図を参照してメタデータ抽出装置の動作について説明する。 FIG. 17 is a flowchart showing the operation of the metadata extraction apparatus according to the second embodiment of the present invention. In this figure, attribute addition processing (step ST600) is added between step ST500 and step ST600 of the flowchart shown in FIG. 2 of the first embodiment, and metadata output processing is made step ST700. The operation of the metadata extraction apparatus will be described with reference to this figure.

文書取得手段１１が文書の取得処理を行う（ステップＳＴ１００）。ここでは、実施の形態１と同様に、図３に示す文書を取得するとする。以降、ステップＳＴ２００からステップＳＴ６００まで実施の形態１と同様の処理を行い、図１５に示す結果を取得する。 Thedocument acquisition unit 11 performs document acquisition processing (step ST100). Here, as in the first embodiment, it is assumed that the document shown in FIG. 3 is acquired. Thereafter, the same processing as in the first embodiment is performed from step ST200 to step ST600, and the result shown in FIG. 15 is acquired.

続いて、属性追加手段１９が属性追加処理を行う（ステップＳＴ６００）。図１８は属性追加手段１９の動作を示すフローチャートである。この図を参照して属性追加手段１９の動作について説明する。
属性追加手段１９は属性構造化手段１５から図１５に示す構造化処理結果を取得すると、文書ＤＢ２０が蓄積する文書数Ｎをカウントして取得する。図１９は、文書ＤＢ２０に格納されるメタデータの例を示す図である。ここでは、文書ＤＢ２０には図１９に示すメタデータとこのメタデータに対応する１文書のみが格納されているとする。したがって、Ｎ＝１とする（ステップＳＴ７１０）。Subsequently, the attribute adding means 19 performs an attribute adding process (step ST600). FIG. 18 is a flowchart showing the operation of theattribute adding means 19. The operation of the attribute adding means 19 will be described with reference to this figure.
When theattribute adding unit 19 acquires the structuring process result shown in FIG. 15 from theattribute structuring unit 15, theattribute adding unit 19 counts and acquires the number N of documents stored in thedocument DB 20. FIG. 19 is a diagram illustrating an example of metadata stored in thedocument DB 20. Here, it is assumed that thedocument DB 20 stores only the metadata shown in FIG. 19 and one document corresponding to the metadata. Therefore, N = 1 is set (step ST710).

続いて、属性追加手段１９は、変数ｋに１を代入し（ステップＳＴ７２０）、蓄積文書Ｄ１のメタデータ数ｍを取得する（ステップＳＴ７３０）。メタデータ数とは「グループ」タグの数であり、図１９のメタデータにおいてはｍ＝１である。続いて、属性追加手段１９は蓄積文書Ｄ１の各メタデータのグループと文書中のグループとの同一性を判定する。 Subsequently, theattribute adding unit 19substitutes 1 for the variable k (step ST720), and acquires the number of metadata m of the stored document D1 (step ST730). The number of metadata is the number of “group” tags, and m = 1 in the metadata of FIG. Subsequently, theattribute adding unit 19 determines the identity between each metadata group of the stored document D1 and the group in the document.

同一性の判定は、文書中のグループと蓄積文書のメタデータのグループとを比較し、属性名と属性値とが共に一致する属性値がグループ内で一定数α以上である場合は同一であると判定する。図１９と図１５に示す属性グループを比較すると、「組織名」、「所在地」、「社長」の３つの属性名およびその属性値が一致する（ステップＳＴ７４０）。ここで、Ｍは取得文書中のメタデータ数である。属性追加手段１９はこの同一属性値数３に基づいて同一性の判定を行う（ステップＳＴ７５０）。例えば、閾値α＝２とすると、同一の属性グループであると判断される。 The identity determination is the same when the group in the document is compared with the metadata group of the stored document, and the attribute value having the same attribute name and attribute value is equal to or greater than a certain number α in the group. Is determined. When the attribute groups shown in FIGS. 19 and 15 are compared, the three attribute names “organization name”, “location”, and “president” and their attribute values match (step ST740). Here, M is the number of metadata in the acquired document. The attribute adding means 19 determines the identity based on the same attribute value number 3 (step ST750). For example, if the threshold value α = 2, it is determined that they are the same attribute group.

同一の属性グループである場合、属性追加手段１９はタグ取り込み処理を行う（ステップＳＴ７６０）。具体的には、属性追加手段１９は文書ＤＢ２０内のメタデータの「グループ」タグ内に存在して、文書取得手段１１が取得した文書中の「グループ」タグ内に存在しない属性を検出する。図１５と図１９を比較すると「資本金」および「２００２年度売上」の２つの属性がこれに該当する。そこで、属性追加手段１９はこれら２つの属性を図１５に示す「グループ」タグ内に加える。図２０は属性追加手段１９によるタグ取り込み処理の結果を示す図である。属性追加手段１９は図１５の「グループ」タグ内に「資本金」および「２００２年度売上」を追加した。 If they are the same attribute group, the attribute adding means 19 performs a tag capturing process (step ST760). Specifically, theattribute adding unit 19 detects an attribute that exists in the “group” tag of the metadata in thedocument DB 20 and does not exist in the “group” tag in the document acquired by thedocument acquisition unit 11. Comparing FIG. 15 and FIG. 19, two attributes of “capital” and “2002 sales” correspond to this. Therefore, the attribute adding means 19 adds these two attributes to the “group” tag shown in FIG. FIG. 20 is a diagram showing the result of the tag loading process by theattribute adding means 19. The attribute adding means 19 added “capital” and “2002 sales” in the “group” tag of FIG.

また、文書中の「グループ」タグ内に存在して蓄積文書のメタデータの「グループ」タグ内に存在しない属性値がある場合は、その属性値をメタデータの「グループ」タグ内に取り込む。ここでは、図１５の「グループ」タグ内に存在して図１９の「グループ」タグ内には存在しない属性値はないので、取り込み処理は行わない。続いて、変数ｋと変数Ｎを比較し（ステップＳＴ７７０）、ｋ＝Ｎである場合は処理を終了し、ｋ≠Ｎである場合はｋをインクリメントして（ステップＳＴ７８０）、ステップＳＴ７３０に戻る。ここではｋ＝Ｎ＝１であるため、処理を終了する。 If there is an attribute value that exists in the “group” tag in the document but does not exist in the “group” tag of the metadata of the stored document, the attribute value is taken into the “group” tag of the metadata. Here, since there is no attribute value that exists in the “group” tag in FIG. 15 but does not exist in the “group” tag in FIG. 19, the import process is not performed. Subsequently, the variable k and the variable N are compared (step ST770). If k = N, the process ends. If k ≠ N, k is incremented (step ST780), and the process returns to step ST730. Here, since k = N = 1, the process ends.

以上のように、この実施の形態２によれば、属性追加手段１９が、文書入力手段１１が取得した文書のメタデータのグループと文書ＤＢ２０が保持する文書のメタデータのグループとを比較し、同一グループと判定した場合には各グループ間で属性値の取り込みを行うようにしたので、文書中に記述のない属性値についても他の文書またはデータベースから取り込むことができるため、文書中に記述のない内容についてもメタデータとして保持・活用できる効果が得られる。 As described above, according to the second embodiment, theattribute adding unit 19 compares the document metadata group acquired by thedocument input unit 11 with the document metadata group stored in thedocument DB 20, and When attribute groups are determined to be in the same group, attribute values are imported between groups, so attribute values that are not described in the document can also be imported from other documents or databases. Even if there is no content, the effect that can be retained and used as metadata is obtained.

この発明の実施の形態１によるメタデータ抽出装置のブロック図である。It is a block diagram of the metadata extraction device byEmbodiment 1 of this invention.同実施の形態１によるメタデータ抽出装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the metadata extraction apparatus by thesame Embodiment 1. FIG.同実施の形態１に係る文書取得手段が取得する文書の例である。It is an example of the document which the document acquisition means concerningEmbodiment 1 acquires.同実施の形態１に係る属性定義ＤＢが保持する属性定義の例を示す図である。It is a figure which shows the example of the attribute definition which attribute definition DB which concerns on thesame Embodiment 1 hold | maintains.同実施の形態１に係る属性割り当て手段の動作を示すフローチャートである。It is a flowchart which shows the operation | movement of the attribute assignment means which concerns on thesame Embodiment 1.同実施の形態１に係る属性割り当て手段の図３に示す文書に対する形態素解析処理の結果を示す図である。It is a figure which shows the result of the morphological analysis process with respect to the document shown in FIG. 3 of the attribute assignment means which concerns on thesame Embodiment 1. FIG.同実施の形態１に係る属性割り当て手段によって属性抽出された文字列を示す図である。It is a figure which shows the character string by which the attribute extraction means based on the saidEmbodiment 1 attribute-extracted.同実施の形態１に係る属性割り当て手段によって抽出された属性値について説明する図である。It is a figure explaining the attribute value extracted by the attribute allocation means which concerns on thesame Embodiment 1. FIG.図４に示す属性間の関連度を記述した属性間関連度テーブルである。5 is an inter-attribute relevance table describing the relevance between attributes shown in FIG.この発明の実施の形態１に係る属性間関連度判定手段の動作を示すフローチャートである。It is a flowchart which shows the operation | movement of the attribute relevance determination means based onEmbodiment 1 of this invention.同実施の形態１に係る属性間関連度判定手段によるｆ_１の計算結果を示す図である。It is a diagram showing a calculation result of f₁ by the attribute between relevance determining unit according toembodiment 1.同実施の形態１に係る属性間関連度判定手段によるｆ_２の計算結果を示す図である。It is a diagram showing a calculation result of f₂ by attribute relevancy determining means according toembodiment 1.同実施の形態１に係る属性間関連度判定手段によるｆ_３の計算結果を示す図である。It is a diagram showing a calculation result of f₃ by attribute relevancy determining means according toembodiment 1.同実施の形態１に係る属性間関連度判定手段による関連度Ｆの計算結果を示す図である。It is a figure which shows the calculation result of the relevance F by the inter-attribute relevance determination means based on the first embodiment.同実施の形態１に係る属性構造化手段による属性構造化処理の結果を示す図である。It is a figure which shows the result of the attribute structuring process by the attribute structuring means according to the first embodiment.この発明の実施の形態２によるメタデータ抽出装置のブロック図である。It is a block diagram of the metadata extraction apparatus by Embodiment 2 of this invention.同実施の形態２によるメタデータ抽出装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the metadata extraction apparatus by the same Embodiment 2.同実施の形態２に係る属性追加手段の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the attribute addition means which concerns on the same Embodiment 2.同実施の形態２に係る文書ＤＢに格納されるメタデータの例を示す図である。It is a figure which shows the example of the metadata stored in document DB which concerns on the same Embodiment 2. FIG.同実施の形態２に係る属性追加手段によるタグ取り込み処理の結果を示す図である。It is a figure which shows the result of the tag taking-in process by the attribute addition means which concerns on the same Embodiment 2.

符号の説明Explanation of symbols

１０，１０’ メタデータ抽出装置、１１文書取得手段、１２文字抽出手段、１３属性割り当て手段（要素抽出手段、属性割り当て手段）、１４属性間関連度算出手段（関連度判定手段）、１５属性構造化手段（第１の関連度抽出情報生成手段）、１６出力手段、１７属性定義ＤＢ（属性定義記憶手段）、１８属性間関連度ＤＢ、１９属性追加手段、２０文書ＤＢ。10, 10 'metadata extraction device, 11 document acquisition means, 12 character extraction means, 13 attribute assignment means (element extraction means, attribute assignment means), 14 inter-attribute relevance calculation means (relevance degree determination means), 15 attribute structure Conversion means (first relevance degree extraction information generation means), 16 output means, 17 attribute definition DB (attribute definition storage means), 18 inter-attribute relevance DB, 19 attribute addition means, 20 document DB.

Claims

Translated fromJapanese

処理対象の電子文書を取得する文書取得手段と、
前記文書取得手段が取得した電子文書から、当該電子文書に記述される文章を構成する要素と、各要素の電子文書内における相対的な位置を示す位置データとを抽出する要素抽出手段と、
前記文章を構成する１または複数の要素からなる要素列の属性を、当該要素列を構成する要素の品詞に関連付けて定義した属性定義を記憶する属性定義記憶手段と、
前記属性定義を参照して、前記要素列に属性を割り当てる属性割り当て手段と、
前記属性割り当て手段が属性を割り当てた要素列のうち、着目する要素列間の関連性を示す関連度を算出する関連度算出手段と、
前記属性割り当て手段が割り当てた属性を識別子として各要素列に付すと共に、前記関連度算出手段によって算出された関連度に応じて要素列同士をグループ化し、当該グループに対応する識別子を付すことによって第１の関連度抽出情報を生成する関連度抽出情報生成手段とを備えたメタデータ抽出装置。Document acquisition means for acquiring an electronic document to be processed;
Element extracting means for extracting elements constituting the text described in the electronic document and position data indicating the relative position of each element in the electronic document from the electronic document acquired by the document acquiring means;
Attribute definition storage means for storing an attribute definition in which the attribute of an element sequence composed of one or more elements constituting the sentence is defined in association with the part of speech of the element constituting the element sequence;
Referring to the attribute definition, attribute assigning means for assigning an attribute to the element sequence;
Of the element strings to which the attribute assigning means has assigned attributes, a relevance degree calculating means for calculating a relevance degree indicating the relevance between the element strings of interest;
The attribute assignment unit assigns the attribute assigned to each element column as an identifier, groups the element sequences according to the degree of association calculated by the association degree calculation unit, and attaches an identifier corresponding to the group. A metadata extraction device comprising: relevance level extraction information generating means for generating1 relevance level extraction information.