JP2011054198A

Movatterモバイル変換

Info

Publication number: JP2011054198A
Application number: JP2010250068A
Authority: JP
Inventors: Kyoko Makino; 恭子牧野; Kayoko Isoo; 佳代子磯尾; Seiji Iwata; 誠司岩田
Original assignee: Toshiba Corp; Toshiba Solutions Corp
Current assignee: Toshiba Corp; Toshiba Digital Solutions Corp
Priority date: 2001-03-19
Filing date: 2010-11-08
Publication date: 2011-03-17
Anticipated expiration: 2021-09-03
Also published as: JP5117560B2

Abstract

【課題】文書データに含まれている内容を正しく抽出する。
【解決手段】文書データ分析プログラム１７は、コンピュータに、データベースに記憶されており文書を構成する文書要素のうち使用期間が所定の基準より短い文書要素とその性質を示すマスクデータとを関連付けた第１の定義辞書１５を参照し、コンピュータによって取得された分析対象の文書データに含まれている文書要素のうち第１の定義辞書１５に含まれている文書要素をその文書要素に関連するマスクデータに変換するマスク概念抽出機能１４と、データベースに記憶されており使用期間が所定の基準より長い文書要素とその属性データとを関連付けた第２の定義辞書１６を参照し、マスク概念抽出機能１４によって変換された後の文書データに含まれておりかつ第２の定義辞書１６に含まれている複数の文書要素とその属性データとを抽出する概念抽出機能３とを実現させる。
【選択図】図１０Content extracted from document data is correctly extracted.
A document data analysis program 17 associates, in a computer, a document element stored in a database and having a usage period shorter than a predetermined reference among document elements constituting a document and mask data indicating the nature thereof. The mask data related to the document element of the document element included in the first definition dictionary 15 among the document elements included in the document data to be analyzed obtained by referring to one definition dictionary 15 The mask concept extraction function 14 for converting to the second definition dictionary 16 that associates the document data stored in the database with the use period longer than a predetermined standard and the attribute data thereof is referred to by the mask concept extraction function 14. A plurality of document elements and their attribute data included in the converted document data and included in the second definition dictionary 16 To realize the concept extraction function 3 for extracting and.
[Selection] Figure 10

Description

Translated fromJapanese

本発明は、本発明は、入力された１以上の文書データを分析する文書データ分析プログラム、コンピュータによる文書データ分析方法、文書データ分析システムに関する。 The present invention relates to a document data analysis program for analyzing one or more input document data, a computer-based document data analysis method, and a document data analysis system.

特願平１１−３３２１１４号においては、個々の概念を抽出した後のアンド検索について記載されている。 Japanese Patent Application No. 11-332114 describes an AND search after extracting individual concepts.

この特願平１１−３３２１１４号の実施例１では、複数の分類軸に属する概念を抽出した後で、違う軸に属する概念のアンド検索によって複合概念を判定する方法が提案されている。 In Example 1 of Japanese Patent Application No. 11-332114, a method is proposed in which after extracting concepts belonging to a plurality of classification axes, a composite concept is determined by AND search of concepts belonging to different axes.

また、特願平１１−３３２１１４号の実施例２では、接続詞の情報と概念情報とその文書中の位置情報を用いて概念の接続関係を判定する方法が提案されている。 In Example 2 of Japanese Patent Application No. 11-332114, there is proposed a method for determining connection relations of concepts using information on conjunctions, concept information, and position information in the document.

特願平１１−３３２１１４号公報Japanese Patent Application No. 11-332114

上記特願平１１−３３２１１４号の発明は、主に、アクションと結果のような因果関係にある概念の複合概念を判定する。 The invention of the above Japanese Patent Application No. 11-332114 mainly determines a composite concept of a concept having a causal relationship such as an action and a result.

文章の表記が口語的で解析が困難な文書から特定の商品とその商品に関する概念を正しく組み合わせて判定を行うには、上記の実施例１では再現率（本来抽出されるべき情報が実際に抽出される割合）は高いものの適合率（抽出された情報のうち、内容の正しいものの割合）が低く、上記の実施例２では適合率は高いものの再現率が低い、という問題がある。 In order to make a correct combination of a specific product and a concept related to the product from a document whose collation is colloquial and difficult to analyze, in the first embodiment, the reproduction rate (the information that should be extracted is actually extracted). However, in the second embodiment, the reproducibility is low but the reproducibility is low.

また、商品名のような陳腐化しやすい表現と概念のような長く使える表現を区別し、複合概念として扱うことは困難であった。 In addition, it is difficult to distinguish between expressions that are prone to obsolescence such as product names and expressions that can be used for a long time such as concepts and treat them as compound concepts.

本発明は、以上のような実情に鑑みてなされたもので、口語的で解析しにくい文書データに対して、商品名とアクション又は結果を正しく結び付けるための文書データ分析プログラム、コンピュータによる文書データ分析方法、文書データ分析システムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and document data analysis program for correctly linking a product name and an action or result to document data that is colloquial and difficult to analyze, document data analysis by computer It is an object to provide a method and a document data analysis system.

本発明を実現するにあたって講じた具体的手段について以下に説明する。 Specific means taken for realizing the present invention will be described below.

第１の発明は、コンピュータに、文書を構成する文書要素のうち使用期間が所定の基準より短い文書要素とその性質を示すマスクデータとを関連付けた第１の辞書情報を参照し、分析対象の文書データに含まれている文書要素のうち第１の辞書情報に含まれている文書要素をその文書要素に関連するマスクデータに変換するマスク機能と、使用期間が所定の基準より長い文書要素とその属性データとを関連付けた第２の辞書情報を参照し、マスク機能によって変換された後の文書データに含まれておりかつ第２の辞書情報に含まれている文書要素とその属性データとを抽出する抽出機能とを実現させるための文書データ分析プログラムである。 The first invention refers to the first dictionary information that associates a document element having a usage period shorter than a predetermined standard with mask data indicating the nature of the document element constituting the document, and the analysis object to be analyzed. A mask function for converting a document element included in the first dictionary information among document elements included in the document data into mask data related to the document element, and a document element having a usage period longer than a predetermined reference; The second dictionary information associated with the attribute data is referred to, the document element included in the document data after being converted by the mask function and included in the second dictionary information, and the attribute data A document data analysis program for realizing an extraction function for extraction.

なお、入れ替わりが激しく陳腐化する文書要素ほど使用期間が短いとされる。 Note that a document element that is heavily replaced and becomes obsolete has a shorter usage period.

この第１の発明では、入れ替わりの激しい文書要素と長く使える文書要素とを区別している。 In the first aspect of the invention, document elements that are frequently replaced are distinguished from document elements that can be used for a long time.

これにより、文書要素の整理が容易となり、データベースをコンパクトに構築できる。またデータベースのメンテナンスを容易に行うことができる。 This facilitates the organization of document elements and allows a database to be constructed in a compact manner. In addition, the database can be easily maintained.

また、この第１の発明を利用すれば、自社の商品名、他社の商品名、自社の商品か他社の商品かの区別を活用した文書の抽出を有効に行うことができる。 In addition, if this first invention is used, it is possible to effectively extract a document using the company's product name, the product name of another company, and the distinction between the company's product and the other company's product.

第２の発明は、上記第１の発明において、複数の属性データの結合関係を定めた結合規則と抽出機能によって抽出された属性データとに基づいて、抽出機能によって抽出された文書要素又は属性データを組み合わせる分析機能をコンピュータに実現させる文書データ分析プログラムである。 According to a second invention, in the first invention, the document element or the attribute data extracted by the extraction function based on the combination rule defining the connection relation of the plurality of attribute data and the attribute data extracted by the extraction function. This is a document data analysis program for causing a computer to realize an analysis function for combining.

なお、この分析機能は、下記の第６から第１３までの発明と同様の変更が可能である。 This analysis function can be modified in the same manner as in the sixth to thirteenth inventions below.

第３の発明は、上記第１又は第２の発明のマスクデータが、使用期間が所定の基準より短い文書要素の性質を階層的に特定しているという特徴を持つ。 The third invention is characterized in that the mask data of the first or second invention hierarchically specifies the properties of document elements whose use period is shorter than a predetermined reference.

これにより、例えばマスクデータを「商品：自社商品：商品名」とすることで、「商品」「自社商品」「商品名」の階層のうちユーザのニーズにあうデータによって置き換え可能であり、分類精度を向上させることができる。 Thus, for example, by setting the mask data to “Product: In-house product: Product name”, it can be replaced with data that meets the user's needs in the “Product”, “In-house product”, “Product name” hierarchy, and classification accuracy Can be improved.

第４の発明は、上記第１から第３の発明において、文書データのうち所定の文書要素が含まれていない部分を表示する機能をコンピュータに実現させる文書データ分析プログラムである。 A fourth invention is a document data analysis program for causing a computer to realize a function of displaying a portion of document data that does not include a predetermined document element in the first to third inventions.

例えば、ある商品の正式名称は辞書情報に登録されているが、その商品の略称が未登録の場合に、正式名称が含まれていない部分をユーザに提示することでユーザがその略称を認識できる。したがって、文書要素及び属性データの更新に利用することができる。 For example, if the official name of a product is registered in the dictionary information, but the abbreviation of the product is not registered, the user can recognize the abbreviation by presenting the user with a part that does not include the official name. . Therefore, it can be used to update document elements and attribute data.

第５の発明は、上記第１から第４の発明において、ユーザの指定内容にしたがって辞書情報を更新する登録機能をコンピュータに実現させる文書データ分析プログラムである。 A fifth invention is a document data analysis program for causing a computer to realize a registration function for updating dictionary information in accordance with user-specified contents in the first to fourth inventions.

これにより、ユーザのニーズにあった分析が可能となる。 This enables analysis that meets the needs of the user.

第６の発明は、コンピュータに、文書を構成する文書要素とその属性データとを関連付けた辞書情報を参照し、分析対象の文書データに含まれておりかつ辞書情報に含まれている文書要素とその属性データとを抽出する抽出機能と、複数の属性データの結合関係を定めた結合規則と抽出機能によって抽出された属性データとに基づいて、抽出機能によって抽出された文書要素又は属性データを組み合わせる分析機能とを実現させるための文書データ分析プログラムである。 According to a sixth aspect of the invention, the computer refers to the dictionary information that associates the document elements constituting the document with the attribute data, and includes the document elements included in the document data to be analyzed and included in the dictionary information. The document element or attribute data extracted by the extraction function is combined based on the extraction function that extracts the attribute data, the combination rule that defines the connection relation of a plurality of attribute data, and the attribute data extracted by the extraction function. This is a document data analysis program for realizing an analysis function.

例えば、組み合わせの手法には、抽出された全ての文書要素又は属性データを組み合わせるアンド検索がある。 For example, the combination method includes AND search that combines all extracted document elements or attribute data.

また、組み合わせる属性データの関係を分類軸として予め定めておき、この分類軸にしたがって抽出された文書要素又は属性データを組み合わせてもよい。 Further, the relationship of attribute data to be combined may be determined in advance as a classification axis, and document elements or attribute data extracted according to the classification axis may be combined.

また、分析機能は、抽出された文書要素を組み合わせてもよく、抽出された属性データを組み合わせてもよく、抽出された文書要素と抽出された属性データとを組み合わせてもよい。 The analysis function may combine the extracted document elements, may combine the extracted attribute data, or may combine the extracted document elements and the extracted attribute data.

なお、文書要素には、様々な単語、句、節、文書に含まれる表現などが含まれる。 The document element includes various words, phrases, sections, expressions included in the document, and the like.

また、属性データには、商品（製品）名、自社の商品か他社の商品かの区別、アクション、結果などのような文書要素の概念を利用可能である。 In addition, the attribute data can use the concept of document elements such as the name of a product (product), the distinction between a company's product and another company's product, an action, and a result.

この第６の発明を利用すると、組み合わせた結果が自社商品についての文書か、他社商品についての文書か、その他の概念についての文書かを容易に区別できる。例えば「売れている」という結果の概念を含む文書を、自社商品が売れているのか、他社商品が売れているのか、特定の販売店で売れているのか、など区別して判断できる。 By using this sixth invention, it is possible to easily distinguish whether the combined result is a document about a company product, a document about another company's product, or a document about other concepts. For example, a document including the concept of “selling” can be determined by distinguishing whether the company's products are sold, whether the products of other companies are sold, or sold at a specific store.

第７の発明は、上記第６の発明と同様の文書データ分析プログラムである。この第７の発明において、分析機能は、文書データを分割するための所定の文書分割規則にしたがって分析対象の文書データを区切り、この区切られた状態から定まる組み合わせ範囲内で抽出機能によって抽出された文書要素又は属性データを組み合わせる。これにより、組み合わせ結果の適合率を向上させることができる。 The seventh invention is a document data analysis program similar to the sixth invention. In the seventh aspect of the invention, the analysis function divides the document data to be analyzed in accordance with a predetermined document division rule for dividing the document data, and the analysis function is extracted by the extraction function within a combination range determined from the divided state. Combine document elements or attribute data. Thereby, the relevance rate of the combination result can be improved.

第８の発明は、上記第７の発明の文書分割規則が、句点を基準として文書データを区切る規則であるとしている。 In the eighth invention, the document division rule of the seventh invention is a rule for dividing document data on the basis of a punctuation mark.

この第８の発明を利用すると、文書データが「。」で区切られる。そして、区切られた範囲内から定まる組み合わせ範囲内で抽出された文書要素又は属性データの組み合わせが行われる。 When the eighth invention is used, the document data is delimited by “.”. Then, a combination of document elements or attribute data extracted within a combination range determined from the delimited range is performed.

なお、句点に限らず、「、」「．」「，」などのような他の句読点を基準として文書データを区切ってもよい。 The document data may be divided based on other punctuation marks such as “,” “.”, “,” And the like without being limited to the punctuation marks.

なお、ある区切り範囲とその直前の区切り範囲とから組み合わせ範囲が構成されるとし、この組み合わせ範囲内の文書要素又は属性データを組み合わせるとしてもよい。 Note that a combination range may be constituted by a certain delimitation range and the delimitation range immediately before it, and document elements or attribute data within this combination range may be combined.

第９の発明は、上記第７の発明の文書分割規則が、固有名詞を示す属性データに関連付けされている文書要素の前で区切る規則であるとしている。 In a ninth aspect, the document division rule of the seventh aspect is a rule that separates a document element associated with attribute data indicating a proper noun.

第１０の発明は、上記第９の発明の文書分割規則が、文書要素に比較を意味する文書要素が付されている場合に、この文書要素を区切りの基準としない規則を含むとしている。 In a tenth aspect of the invention, the document division rule of the ninth aspect of the invention includes a rule that does not use this document element as a delimitation reference when a document element meaning comparison is attached to the document element.

第１１の発明は、上記第７の発明の文書分割規則が、結果を示す属性データに関連付けされている文書要素の後ろで区切る規則であるとしている。 In an eleventh aspect of the invention, the document division rule of the seventh aspect is a rule that separates the document element associated with the attribute data indicating the result.

なお、この第１１の発明の文書分割規則は、アクションを示す属性データに関連付けされている文書要素の後ろで区切るとしてもよい。 The document division rule according to the eleventh aspect of the invention may be divided after the document element associated with the attribute data indicating the action.

上記第９から第１１までの発明の文書分割規則を利用することにより、適合率と再現率とを向上させることができる。 By using the document dividing rules of the ninth to eleventh inventions, the relevance ratio and the recall ratio can be improved.

第１２の発明は、上記第６から第１１までの発明において、分析機能は、抽出機能によって抽出された文書要素に比較を意味する文書要素が付されている場合に、この文書要素を組み合わせの候補から排除する文書データ分析プログラムである。 In a twelfth aspect based on the sixth to eleventh aspects, the analysis function is configured to combine the document elements when a document element meaning comparison is added to the document elements extracted by the extraction function. This is a document data analysis program to be excluded from candidates.

例えば、比較を意味する文書要素には「〜より」「〜と比べて」「対〜比」などがある。 For example, document elements that mean comparison include “to more”, “compared to”, and “vs. to ratio”.

これにより、例えば「自社商品は他社商品より売れています」という文書データについて分析を行った場合に、組み合わせ結果が「他社商品−売れています」となることはなく、「自社商品−売れています」となる。なお、このような分析を係り受け分析とする。 As a result, for example, when analyzing document data such as “Our products are selling more than other companies”, the combined result will not be “Other products—selling”, but “Own products—selling”. " Such analysis is called dependency analysis.

この係り受け分析を利用することにより適合率を向上させることができる。 The precision can be improved by using this dependency analysis.

第１３の発明は、上記第６から第１２までの発明において、分析機能は、比較関係にある文書要素を関連付けた対応情報を参照し、抽出機能によって抽出された文書要素に比較を意味する文書要素が付されているがこの文書要素の比較対象となる文書要素が抽出されていない場合、この文書要素の比較対象となる文書要素を対応情報に基づいて決定し、この決定された文書要素を含めて文書要素又は属性データの組み合わせを行う文書データ分析プログラムである。 In a thirteenth aspect based on the sixth to twelfth aspects, the analysis function refers to correspondence information that associates document elements that are in a comparison relationship, and a document that means comparison with the document elements extracted by the extraction function If an element is attached but the document element to be compared with this document element has not been extracted, the document element to be compared with this document element is determined based on the correspondence information, and the determined document element is A document data analysis program for combining document elements or attribute data.

例えば、対応情報には自社商品と他社商品とが対応付けされているとする。そして、「他社商品よりは売れています」という文書データについて分析を行ったとする。この場合、比較を意味する「より」が付されている「他社商品」が抽出されるが、この「他社商品」の比較対象がないため、対応情報から「自社商品」が決定され、「自社商品−売れています」という組み合わせ結果が得られる。これにより、適合率と再現率とを向上させることができる。 For example, it is assumed that the correspondence information associates the company product with the other company product. Assume that an analysis is performed on document data that “sells more than other companies' products”. In this case, “other company products” with “more” meaning comparison are extracted, but since there is no comparison target of this “other company products”, “own product” is determined from the corresponding information, The result is a combination of “product-selling”. Thereby, the precision and the recall can be improved.

上記のようなプログラム、及びこのプログラムを記録した記録媒体を用いることによって、上述した機能を有していない計算機システム、サーバやクライアント等の計算機に対して、簡単に上述した機能を付加することができる。 By using the program as described above and a recording medium storing the program, the above-described function can be easily added to a computer system, server, client, or other computer that does not have the above-described function. it can.

なお、上記第１から第１３の発明のプログラムで実施される文書データ分析方法を発明の対象としてもよい。 The document data analysis method implemented by the programs of the first to thirteenth inventions may be the subject of the invention.

また、上記第１から第１３の発明のプログラムで実現される機能を実現する手段を具備した文書データ分析システムを発明の対象としてもよい。 A document data analysis system having means for realizing the functions realized by the programs of the first to thirteenth inventions may be the subject of the invention.

本発明においては、文書を構成する文書要素とその属性データとを関連付けた辞書情報を参照することで、分析対象の文書データからその文書データに含まれている内容を正しく抽出することができる。 In the present invention, by referring to the dictionary information that associates the document elements constituting the document with the attribute data, the contents included in the document data can be correctly extracted from the document data to be analyzed.

また、本発明においては、使用期間に応じて辞書情報の登録を区別することにより、文書要素の整理を容易に行うことができる。また、辞書情報をコンパクトに構築できる。また、辞書情報のメンテナンスを容易に行うことができる。 In the present invention, document elements can be easily organized by distinguishing the registration of dictionary information according to the period of use. Moreover, dictionary information can be constructed in a compact manner. Also, dictionary information can be easily maintained.

本発明を利用すると、例えば日報、月報、営業報告などが文書データであり、大量の文書データが収集される場合であっても、この大量の文書データから所定の形式で要約した情報を抽出できる。これにより、ユーザは大量の文書データ全てを読むことなく文書データに記述されている内容を把握することができる。 By utilizing the present invention, for example, daily reports, monthly reports, business reports, and the like are document data, and even when a large amount of document data is collected, information summarized in a predetermined format can be extracted from the large amount of document data. . Thereby, the user can grasp the contents described in the document data without reading all the large amount of document data.

本発明の第１の実施の形態に係る文書データ分析プログラムによって計算機上で実現される機能を示すブロック図。FIG. 3 is a block diagram showing functions realized on a computer by the document data analysis program according to the first embodiment of the present invention.同実施の形態における概念抽出機能による抽出結果を例示する図。The figure which illustrates the extraction result by the concept extraction function in the embodiment.本発明の第２の実施の形態に係る文書データ分析プログラムによって計算機上で実現される機能を示すブロック図。The block diagram which shows the function implement | achieved on a computer by the document data analysis program which concerns on the 2nd Embodiment of this invention.同実施の形態における概念抽出機能による抽出結果を例示する図。The figure which illustrates the extraction result by the concept extraction function in the embodiment.固有名詞の直前を基準に分割する文書分割規則にしたがって文書データを区切った結果を例示する図。The figure which illustrates the result of having divided | segmented document data according to the document division | segmentation rule divided | segmented on the basis just before a proper noun.本発明の第３の実施の形態に係る文書データ分析プログラムによって計算機上で実現される機能を示すブロック図。The block diagram which shows the function implement | achieved on a computer by the document data analysis program which concerns on the 3rd Embodiment of this invention.マスク概念抽出機能による抽出内容を例示するブロック図。The block diagram which illustrates the extraction content by a mask concept extraction function.マスク概念抽出機能によって変換された文書データを例示するブロック図。The block diagram which illustrates the document data converted by the mask concept extraction function.マスク用語に変換された文書データに対する概念抽出機能の抽出結果を例示する図。The figure which illustrates the extraction result of the concept extraction function with respect to the document data converted into the mask term.本発明の第５の実施の形態に係る文書データ分析プログラムによって計算機上で実現される機能を示すブロック図。The block diagram which shows the function implement | achieved on a computer by the document data analysis program which concerns on the 5th Embodiment of this invention.文書データ分析プログラムにより実施されるサービスをＡＳＰが提供する形態を例示するブロック図。The block diagram which illustrates the form in which ASP provides the service implemented by a document data analysis program.

以下、図面を参照しながら本発明の実施の形態について説明する。なお、以下に示す各図において、同一の部分については同一の符号を付してその説明を省略し、異なる部分についてのみ詳しく説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the drawings shown below, the same portions are denoted by the same reference numerals, description thereof is omitted, and only different portions will be described in detail.

（第１の実施の形態）
本実施の形態においては、例えば「Ｂスナックは試食会で完売。Ａスナック情報。対Ｃスナック比１２０％で売れています。」のような文書データに対して、「Ａスナック」「Ｂスナック」「Ｃスナック」のような商品名の表現（文書要素）と、「試食会」「完売」「売れています」のようなアクション又は結果に関する表現を、正しく結び付ける文書データ分析プログラムについて説明する。(First embodiment)
In the present embodiment, for example, “A snack” and “B snack” for document data such as “B snack sold out at a tasting party. A snack information. Sold at 120% of C snacks.” A description will be given of a document data analysis program that correctly links expressions (document elements) of product names such as “C snack” and expressions related to actions or results such as “tasting party”, “sold out”, and “sold”.

また、本実施の形態に係る文書データ分析プログラムは、表現を結び付けるための判定ルールを切り換え可能とする。 Further, the document data analysis program according to the present embodiment can switch the determination rule for linking expressions.

図１は、本実施の形態に係る文書データ分析プログラムによって計算機上で実現される機能を示すブロック図である。 FIG. 1 is a block diagram showing functions realized on a computer by a document data analysis program according to the present embodiment.

本実施の形態に係る文書データ分析プログラム１は、取得機能２、概念抽出機能３、分析機能４、選択機能５、表示機能６を計算機や計算機システム上で実現する。 The document data analysis program 1 according to the present embodiment realizes anacquisition function 2, aconcept extraction function 3, ananalysis function 4, aselection function 5, and adisplay function 6 on a computer or a computer system.

また、文書データ分析プログラム１は、概念定義辞書（データベース）７と対応付けテーブル８とを参照する。 Further, the document data analysis program 1 refers to the concept definition dictionary (database) 7 and the association table 8.

取得機能２は、分析対象のデータを取得する。取得の手法は、予め登録されている文書データ群の中からユーザに指定された文書データを取得する方法、ユーザの入力した文書データを取得する方法などがある。 Theacquisition function 2 acquires data to be analyzed. The acquisition method includes a method of acquiring document data designated by the user from a group of document data registered in advance, a method of acquiring document data input by the user, and the like.

文書データには、例えばメーカーの営業日報等の報告データがある。具体的には「Ｂスナックは試食会で完売。Ａスナック情報。対Ｃスナック比１２０％で売れています」などの文書が文書データとして取得機能２に取得される。 The document data includes report data such as a manufacturer's business daily report. Specifically, a document such as “B snack sold out at a tasting party. A snack information. Sold to 120% of C snacks” is acquired by theacquisition function 2 as document data.

概念定義辞書７は、ユーザのニーズに応じて、分析対象の文書データの特性に基づいて作成されている。 The concept definition dictionary 7 is created based on the characteristics of the document data to be analyzed according to the user's needs.

この概念定義辞書７は、自社（ＭＡ社）商品辞書９、他社（ＭＢ社〜ＭＤ社）商品辞書１０、アクション・結果辞書１１を含んでいる。 The concept definition dictionary 7 includes an in-house (MA company) product dictionary 9, another company (MB company to MD company) product dictionary 10, and an action / result dictionary 11.

自社商品辞書９は、自社商品の表現に、概念（属性データ）と概念ＩＤとを対応付けた構造を持つ。正式名称「Ａスナック」を略称「Ａ」などと表現する場合などのように、同一の商品に異なる表現が存在する場合には、同一商品の各表現に対して、同一の概念と概念ＩＤとが付される。概念は、「商品：自社商品：商品種別：商品名」などのように、表現の所属する階層の全てを特定するように定義される。 The in-house product dictionary 9 has a structure in which a concept (attribute data) and a concept ID are associated with an expression of the in-house product. When different expressions exist in the same product, such as when the official name “A snack” is expressed as an abbreviation “A”, the same concept and concept ID are assigned to each expression of the same product. Is attached. The concept is defined so as to specify all the hierarchies to which the expression belongs, such as “product: own product: product type: product name”.

他社商品辞書１０も同様であるが、ここでは自社の商品の代わりに、他社の商品が定義されている。 The same applies to the other company's product dictionary 10, but here, other companies' products are defined instead of their own products.

アクション・結果辞書１１では、表現に概念と概念ＩＤとを関連付けた構造を持つ。概念は、表現毎に、「アクションか結果かの指定：表現の意味内容」を特定するように定義される。 The action / result dictionary 11 has a structure in which a concept is associated with a concept ID. The concept is defined for each expression to specify “designation of action or result: semantic content of the expression”.

概念定義辞書７は、一つのファイルとして構成してもよく、内容ごとに複数のファイルに分けた構成としてもよい。なお、概念定義辞書７は、概念の大分類（自社商品辞書９、他社商品辞書１０、アクション・結果辞書１１）が容易に把握可能な状態であることが望ましい。 The concept definition dictionary 7 may be configured as a single file or may be divided into a plurality of files for each content. It is desirable that the concept definition dictionary 7 be in a state in which a large classification of concepts (in-house product dictionary 9, other company product dictionary 10, action / result dictionary 11) can be easily grasped.

概念抽出機能３は、取得機能２によって取得された文書データを受け付け、概念定義辞書７を参照する。そして、概念抽出機能３は、概念定義辞書７の各表現と文書データとを比較し、概念定義辞書７の表現と同じ表現が文書データ中に見出された場合は、文書データ中の位置と概念ＩＤを記録する。 Theconcept extraction function 3 receives the document data acquired by theacquisition function 2 and refers to the concept definition dictionary 7. Theconcept extraction function 3 compares each expression in the concept definition dictionary 7 with the document data. If the same expression as the expression in the concept definition dictionary 7 is found in the document data, theconcept extraction function 3 Record the concept ID.

図２は、概念抽出機能３による抽出結果を例示する図である。文書データに含まれており概念定義辞書７に登録されている表現とその位置とその概念ＩＤとが抽出されている。 FIG. 2 is a diagram illustrating an extraction result obtained by theconcept extraction function 3. Expressions included in the document data and registered in the concept definition dictionary 7, their positions, and their concept IDs are extracted.

分析機能４は、表現又は商品、アクション、結果などの概念を複数の分類軸に基づいて組み合わせて複合概念を抽出する。複合概念を構成する分類軸は、「商品−アクション」「商品−結果」「商品−アクション−結果」などのように、予め定められている。 Theanalysis function 4 extracts a composite concept by combining expressions or concepts such as products, actions, and results based on a plurality of classification axes. The classification axes constituting the composite concept are determined in advance, such as “product-action”, “product-result”, “product-action-result”, and the like.

組み合わせを行うか否かを判定する判定処理には、アンド検索処理４ａ、文書区切り処理４ｂ、係り受け分析処理４ｃ、対応付け分析処理４ｄのうち選択機能５において選択された処理が利用される。 For the determination process for determining whether or not to perform the combination, a process selected by theselection function 5 among the ANDsearch process 4a, thedocument separation process 4b, thedependency analysis process 4c, and theassociation analysis process 4d is used.

選択機能５は、ユーザが求める分析性能を考慮し、処理４ａ〜４ｄのうちユーザに設定された処理を使って分析を行うように分析機能４に対して指示を出力する。 Theselection function 5 outputs an instruction to theanalysis function 4 so as to perform analysis using the process set by the user among theprocesses 4a to 4d in consideration of the analysis performance required by the user.

例えば、選択機能５は、「再現率を最優先とする」「適合率と再現率とを同程度に優先する」「適合率を最優先する」などの選択肢をユーザに提示する。 For example, theselection function 5 presents the user with options such as “prioritize the recall rate”, “prioritize the matching rate and the recall rate”, and “prioritize the matching rate”.

そして、ユーザが再現率を優先することを望む場合、選択機能５はアンド検索処理４ａを選択する。ユーザが適合率と再現率とを同程度に優先することを望む場合、選択機能５は文書区切り処理４ｂを選択する。ユーザが適合率を優先することを望む場合、選択機能５は係り受け分析処理４ｃ又は対応付け分析処理４ｄを選択する。 When the user desires to prioritize the recall rate, theselection function 5 selects the ANDsearch process 4a. When the user desires to give priority to the matching rate and the recall rate to the same extent, theselection function 5 selects thedocument separation process 4b. When the user desires to prioritize the precision, theselection function 5 selects thedependency analysis process 4c or theassociation analysis process 4d.

他の例として、選択機能５は、ユーザに対して判定処理４ａ〜４ｄの選択用画面を提示してもよい。選択機能５は、すべての概念に対してまとめて判定処理を設定してもよく、個々の複合概念毎に判定処理を設定してもよい。また、選択機能５は、分類軸毎に判定処理を設定してもよく、すべての分類軸に同じ判定処理を設定してもよい。 As another example, theselection function 5 may present a selection screen for the determination processes 4a to 4d to the user. Theselection function 5 may set the determination process for all concepts together, or may set the determination process for each composite concept. Theselection function 5 may set a determination process for each classification axis, or may set the same determination process for all the classification axes.

選択機能５は、判定処理４ａ〜４ｄのリストをユーザに提示し、選択させてもよいし、「全てを抽出する−正解率を高くする」などのように抽象的な表示をユーザに提示し、ユーザの要求する分析性能をユーザに指定させてもよい。 Theselection function 5 may present the user with a list of the determination processes 4a to 4d to be selected, or present an abstract display to the user such as “extract all-increase the accuracy rate”. The analysis performance requested by the user may be specified by the user.

なお、本実施の形態においては、選択機能５により判定処理４ａ〜４ｄを自由に選択可能としているが、ユーザにより選択されなくても所定の判定処理が利用されるとしてもよい。 In the present embodiment, theselection process 5 allows the determination processes 4a to 4d to be freely selected, but a predetermined determination process may be used even if theselection process 5 is not selected by the user.

表示機能６は、分析機能４で抽出された複合概念を、ランキング表形式で、複合概念ごとの文書要素を含めてユーザに提示する。 Thedisplay function 6 presents the composite concept extracted by theanalysis function 4 to the user in a ranking table format including the document elements for each composite concept.

対応付けテーブル８は、対応付け分析処理４ｄに参照されるテーブルである。この対応付けテーブル８には、比較関係（例えば競合関係）にある商品名「Ａスナック」「Ｂスナック」「Ｃスナック」が関連付けて登録されている。 The association table 8 is a table referred to by theassociation analysis process 4d. In the association table 8, product names “A snack”, “B snack”, and “C snack” in a comparison relationship (for example, a competitive relationship) are registered in association with each other.

以下に、判別処理４ａ〜４ｄについて詳細に説明する。 Hereinafter, the determination processes 4a to 4d will be described in detail.

アンド検索処理４ａは、抽出された全ての表現又は概念を組み合わせる。 The ANDsearch process 4a combines all the extracted expressions or concepts.

文書区切り処理４ｂは、文書データを「。」で区切り、「。」で区切られたそれぞれの範囲内でアンド検索を行う。なお、分類軸に含まれているが区切られた範囲内にはない概念があり、直前の区切り範囲内にその概念がある場合には、その直前の区切り範囲内の概念をコピーして利用してもよい。 Thedocument delimiter 4b delimits the document data with “.” And performs an AND search within each range delimited by “.”. If there is a concept that is included in the classification axis but not within the delimited range, and the concept is in the previous delimiter range, copy and use the concept in the previous delimiter range. May be.

係り受け分析処理４ｃは、アクション又は概念の軸に対して、対応する商品を特定する場合に、「。」で区切られた範囲内又はその範囲より前において、アクション又は概念の軸に最も近い位置にあり、「〜より」「〜と比べて」「〜比」などの除外接続表現が続かない商品を探す。そして、係り受け分析処理４ｃは、探し出した商品をアクション又は概念の軸に対して、対応する商品と判断し、組み合わせを行う。 In thedependency analysis process 4c, when the corresponding product is identified with respect to the action or concept axis, the position closest to the action or concept axis is within or before the range delimited by “.”. Look for products that are not followed by excluded connection expressions such as "~ more", "compared to", and "~ ratio". Then, thedependency analysis process 4c determines that the found product is a product corresponding to the action or concept axis, and performs a combination.

対応付け処理４ｄは、自社と他社の同種の商品（競合関係にある商品）を登録している対応付けテーブル８を参照し、係り受け分析処理４ｃを実行しても対応商品が判断できない場合、除外接続表現のついた他社商品に対応する自社商品を対応商品と判断し、組み合わせを行う。 Theassociation process 4d refers to the association table 8 in which the same type of products of the company and other companies (competitive products) are registered, and if the corresponding product cannot be determined even if thedependency analysis process 4c is executed, The company's products that correspond to products of other companies with an excluded connection expression are judged as compatible products and combined.

上記の文書データ「Ｂスナックは試食会で完売。Ａスナック情報。対Ｃスナック比１２０％で売れています。」の場合、「商品」と「アクション・結果」の正しい組み合わせは、「Ｂスナック−試食会」「Ｂスナック−完売」「Ａスナック−売れています」の３つである。 In the case of the above document data “B snack sold out at tasting party. A snack information. Sold at 120% of C snacks.” The correct combination of “product” and “action / result” is “B snack- There are three types: “tasting party”, “B snack-sold out” and “A snack-sold”.

上記各種判定処理４ａ〜４ｄの分析精度を適合率及び再現率で評価すると以下のようになる。なお、組み合わせに利用する分類軸は「商品−アクション」、「商品−結果」とする。 When the analysis accuracy of the above-described various determination processes 4a to 4d is evaluated based on the relevance ratio and the recall ratio, it is as follows. The classification axes used for the combination are “product-action” and “product-result”.

アンド検索処理４ａでは、上記図２に示すように抽出された表現又は概念を分類軸にしたがって全て組み合わせる。したがって、このアンド検索処理４ａで抽出される複合概念は「Ｂスナック−試食会」「Ｂスナック−完売」「Ｂスナック−売れています」「Ａスナック−試食会」「Ａスナック−完売」「Ａスナック−売れています」「Ｃスナック−試食会」「Ｃスナック−完売」「Ｃスナック−売れています」の９つである。この結果の適合率は約３３％、再現率は１００％である。 In the ANDsearch process 4a, all the expressions or concepts extracted as shown in FIG. 2 are combined according to the classification axis. Therefore, the combined concept extracted by the ANDsearch process 4a is "B snack-tasting party", "B snack-sold out", "B snack-sold", "A snack-tasting party", "A snack-sold out", "A “Snack—Selling”, “C Snack—Tasting Party”, “C Snack—Sold Out”, “C Snack—Selling”. The precision of this result is about 33% and the recall is 100%.

文書区切り処理４ｂでは、「。」で文書データを区切り、その範囲内でアンド検索を行う。したがって、この文書区切り処理４ｂで抽出される複合概念は「Ｂスナック−試食会」「Ｂスナック−完売」「Ｃスナック−売れています」の３つである。この結果の適合率は約６６％、再現率は約６６％となる。 In thedocument separation process 4b, document data is separated by “.”, And AND search is performed within the range. Therefore, there are three compound concepts extracted by thedocument separation process 4b: “B snack-tasting party”, “B snack—sold out”, and “C snack—sold”. As a result, the matching rate is about 66%, and the recall rate is about 66%.

係り受け分析処理４ｃは、アクション又は概念の軸に対し、対応する商品を特定する場合に、「。」で区切られた範囲内又はその範囲より前において、該当概念に最も近い位置にあり、除外接続表現が続かない商品表現を探し、組み合わせを行う。したがって、この係り受け分析処理４ｃで抽出される複合概念は「Ｂスナック−試食会」「Ｂスナック−完売」「Ａスナック−売れています」の３つである。この結果の適合率は適合率１００％、再現率は１００％となる。 In thedependency analysis process 4c, when the corresponding product is specified with respect to the action or concept axis, thedependency analysis process 4c is within the range delimited by "." Search for product expressions that do not continue to connect and combine them. Accordingly, there are three composite concepts extracted in thedependency analysis process 4c: “B snack-tasting party”, “B snack—sold out”, “A snack—sold”. As a result, the precision is 100%, and the recall is 100%.

対応付け分析処理４ｄは、係り受け分析処理４ｃで対応商品が判断できない場合に、除外接続表現の付されている他社商品に対応する自社商品を対応付けテーブル８から求め、対応商品として組み合わせを行う。したがって、この対応付け分析処理４ｄで作成される複合概念は「Ｂスナック−試食会」「Ｂスナック−完売」「Ａスナック−売れています」の３つである。この結果の適合率は１００％、再現率は１００％となる。 In thecorrespondence analysis process 4d, when the corresponding product cannot be determined by thedependency analysis process 4c, the corresponding product corresponding to the other company's product with the excluded connection expression is obtained from the correspondence table 8 and combined as the corresponding product. . Therefore, there are three composite concepts created by thecorrespondence analysis process 4d: “B snack-tasting party”, “B snack—sold out”, “A snack—sold”. The precision of this result is 100%, and the recall is 100%.

以上説明したように、本実施の形態に係る文書データ分析プログラム１を利用すると、例えば「売れている」という同一の結果についても、「自社商品」が売れているのか、「他社商品」が売れているのか、「試食会で」売れているのかを区別できる。したがって、文書データの分析精度を向上させることができる。 As described above, when the document data analysis program 1 according to the present embodiment is used, for example, the same result “selling” indicates whether “in-house product” is sold or “other company product” is sold. And whether it is sold “at the tasting party”. Therefore, the analysis accuracy of document data can be improved.

また、本実施の形態に係る文書データ分析プログラム１では、複合概念を抽出する分析機能４において複数の判定処理４ａ〜４ｄを実行可能であり、この判定処理４ａ〜４ｄの自由に選択可能である。したがって、分析対象の文書データの質やユーザのニーズに合わせた柔軟な分析性能を提供することができる。 In the document data analysis program 1 according to the present embodiment, a plurality of determination processes 4a to 4d can be executed in theanalysis function 4 for extracting a composite concept, and the determination processes 4a to 4d can be freely selected. . Therefore, it is possible to provide flexible analysis performance that matches the quality of document data to be analyzed and the needs of users.

（第２の実施の形態）
本実施の形態においては、上記第１の実施の形態に係る文書データ分析プログラム１の変形例について説明する。(Second Embodiment)
In the present embodiment, a modified example of the document data analysis program 1 according to the first embodiment will be described.

図３は、本実施の形態に係る文書データ分析プログラム１によって計算機上で実現される機能を示すブロック図である。 FIG. 3 is a block diagram showing functions realized on the computer by the document data analysis program 1 according to the present embodiment.

本実施の形態における分析機能４は、固有名詞区切り処理４ｅ、固有名詞判断処理４ｆ、結果区切り処理４ｇ、結果判断処理４ｈを利用する。 Theanalysis function 4 in this embodiment uses a propernoun delimiter process 4e, a propernoun determination process 4f, aresult delimiter process 4g, and aresult determination process 4h.

本実施の形態におけるアクション・結果辞書１１には、「試食会」の概念は「アクション」、「完売」「今一つ」「好評」の概念は「結果」である旨が登録されている。アクション・結果辞書１１の内容は、分析対象の文書データに応じて、ユーザのニーズ及び文書の特性分析から定められる。 In the action / result dictionary 11 according to the present embodiment, it is registered that the concept of “tasting party” is “action”, and the concepts of “sold out”, “immediately”, and “popular” are “results”. The contents of the action / result dictionary 11 are determined from user needs and document characteristic analysis according to the document data to be analyzed.

選択機能５は、ユーザが求める分析性能を考慮し、処理４ｅ〜４ｈのうちユーザに設定された処理を使って分析を行うように分析機能４に対して指示を出力する。 Theselection function 5 outputs an instruction to theanalysis function 4 so as to perform analysis using the process set by the user among theprocesses 4e to 4h in consideration of the analysis performance required by the user.

以下に、処理４ｅ〜４ｈについて詳細に説明する。 Hereinafter, theprocesses 4e to 4h will be described in detail.

固有名詞区切り処理４ｅは、固有名詞を示す概念に関連付けされている表現から、この表現と同一の概念に関連付けされている次の表現の前までを一区間とする文書分割規則にしたがって分析対象の文書データを区切り、概念抽出機能３によって抽出された表現又は概念について区切られたそれぞれの範囲内でアンド検索を行う。 Theproper noun delimiter 4e performs analysis according to a document division rule in which one section extends from an expression associated with a concept indicating a proper noun to a next expression associated with the same concept as this expression. The document data is divided, and an AND search is performed within each range divided for the expression or concept extracted by theconcept extraction function 3.

固有名詞判断処理４ｆは、固有名詞を示す概念に関連付けされている表現から、この表現と同一の概念に関連付けされている次の表現の前までを一区間とする規則と、固有名詞を示す概念に関連付けされている表現であっても「〜より」「〜と比べて」「〜比」などの除外接続表現が付されている場合には区切りの基準としないという規則とを含む文書分割規則にしたがって分析対象の文書データを区切る。そして、固有名詞判断処理４ｆは、概念抽出機能３によって抽出された表現又は概念から「〜より」「〜と比べて」「〜比」などの除外接続表現が付されている表現又は概念を排除し、それぞれ区切られた範囲内でアンド検索を行う。 The propernoun determination process 4f includes a rule that takes a section from an expression associated with a concept indicating a proper noun to a next expression associated with the same concept as this expression, and a concept indicating a proper noun. Even if it is an expression associated with, a document division rule including a rule that it is not used as a delimiter criterion when an excluded connection expression such as “to more”, “compared to”, or “to ratio” is attached. The document data to be analyzed is divided according to Then, the propernoun determination process 4f excludes expressions or concepts to which excluded connection expressions such as “to more”, “compared to”, and “to ratio” are added from the expression or concept extracted by theconcept extraction function 3. Then, AND search is performed within the divided ranges.

結果区切り処理４ｇは、結果を示す概念に関連付けされている表現の後ろで区切る文書分割規則にしたがって分析対象の文書データを区切り、それぞれ区切られた範囲内でアンド検索を行う。 Theresult delimiter 4g delimits the document data to be analyzed according to the document division rule delimited after the expression associated with the concept indicating the result, and performs an AND search within each delimited range.

結果判断処理４ｈは、結果を示す概念に関連付けされている表現の後ろで区切る文書分割規則にしたがって分析対象の文書データを区切る。そして、結果判断処理４ｈは、概念抽出機能３によって抽出された表現又は概念から「〜より」「〜と比べて」「〜比」などの除外接続表現が付されている表現又は概念を排除し、それぞれ区切られた範囲内でアンド検索を行う。 Theresult determination process 4h delimits the document data to be analyzed in accordance with the document division rule delimiting after the expression associated with the concept indicating the result. Then, theresult determination process 4h excludes an expression or concept to which an excluded connection expression such as “to more”, “compared to”, and “to ratio” is added from the expression or concept extracted by theconcept extraction function 3. , AND search is performed within the divided ranges.

上記の処理４ｅ〜４ｇの分析精度を適合率及び再現率で評価すると以下のようになる。なお、組み合わせに利用する分類軸は「商品−結果」「商品−アクション−結果」とする。 When the analysis accuracy of theabove processes 4e to 4g is evaluated based on the relevance ratio and the recall ratio, it is as follows. The classification axes used for the combination are “product-result” and “product-action-result”.

固有名詞区切り処理４ｅでは、分析対象の文書データが、分類軸に属する概念に関連付けされており商品名などの固有名詞である表現から始まり、この始まりの概念と同じ概念に関連付けされている表現が次に出現する直前で終わる区間で区切られる。そして、固有名詞区切り処理４ｅでは、概念抽出機能３によって抽出された表現又は概念について、それぞれの区切られた範囲内でアンド検索が行われる。 In the propernoun separation process 4e, the document data to be analyzed is associated with a concept belonging to the classification axis and starts with an expression that is a proper noun such as a product name, and an expression associated with the same concept as the original concept It is separated by a section that ends just before the next occurrence. In the propernoun delimiter processing 4e, an AND search is performed within the respective delimited ranges for the expression or concept extracted by theconcept extraction function 3.

例えばメーカーの営業日報「Ｂスナックは試食会で完売、Ｃスナックは今一つだそうです。Ａスナックも試食会開催。好評です。」という文書データを分析するとする。 For example, suppose you want to analyze the document data of the manufacturer's daily report “B snacks sold out at tasting party, C snacks are one more. A snack is also held at tasting party.

概念抽出機能３は、概念定義辞書７の各表現と文書データとを比較し、概念定義辞書７の表現と同じ表現が文書データ中に見出された場合は、文書データ中の位置と概念ＩＤを記録する。 Theconcept extraction function 3 compares each expression in the concept definition dictionary 7 with the document data, and if the same expression as the expression in the concept definition dictionary 7 is found in the document data, the position in the document data and the concept ID Record.

図４は、概念抽出機能３による抽出結果を例示する図である。文書データに含まれており概念定義辞書７に登録されている表現とその位置とその概念ＩＤとが抽出されている。 FIG. 4 is a diagram illustrating an extraction result by theconcept extraction function 3. Expressions included in the document data and registered in the concept definition dictionary 7, their positions, and their concept IDs are extracted.

「Ｂスナック」「Ｃスナック」「Ａスナック」の概念は商品、「試食会」の概念はアクション、「完売」「今一つ」「好評」の概念は結果である。 The concept of “B snack”, “C snack”, and “A snack” is a product, the concept of “tasting party” is an action, and the concepts of “sold out”, “now” and “popular” are results.

固有名詞区切り処理４ｅでは、固有名詞を示す概念に関連付けされている表現から、この表現と同一の概念に関連付けされている次の表現の前までを一区間とする文書分割規則にしたがって、分析対象の文書データが区切られる。 In the propernoun delimiter processing 4e, the analysis target is analyzed in accordance with the document division rule in which one section is from the expression associated with the concept indicating the proper noun to the next expression associated with the same concept as this expression. Document data is separated.

図５は、固有名詞の直前を基準に分割する文書分割規則にしたがって文書データを区切った結果を例示する図である。固有名詞を示す概念に関連付けされている表現には「Ｂスナック」「Ｃスナック」「Ａスナック」がある。このため、図５においては、文書データが「Ｂスナック」「Ｃスナック」「Ａスナック」の直前で区切られている。 FIG. 5 is a diagram exemplifying a result of dividing document data in accordance with a document division rule that divides immediately before the proper noun. Expressions associated with a concept indicating a proper noun include “B snack”, “C snack”, and “A snack”. Therefore, in FIG. 5, the document data is divided immediately before “B snack”, “C snack”, and “A snack”.

そして、区切られた範囲内で各分類軸に属する全ての表現又は概念がアンド検索される。 Then, all expressions or concepts belonging to each classification axis within the delimited range are AND-searched.

「Ｂスナックは試食会で完売、Ｃスナックは今一つだそうです。Ａスナックも試食会開催。好評です。」についての正しい複合概念は、「Ｂスナック−試食会−完売」「Ｃスナック−今一つ」「Ａスナック−試食会−好評」の３つである。 "B snacks are sold out at the tasting party, and C snacks are one more. A snack is also held at the tasting party. It is well received." The three are "A snack-tasting party-popular".

固有名詞区切り処理４ｅによって得られる複合概念は、「Ｂスナック−試食会−完売」「Ｃスナック−今一つ」「Ａスナック−試食会−好評」となる。 The compound concept obtained by the propernoun delimiter processing 4e is "B snack-tasting party-sold out", "C snack-immediately", and "A snack-tasting party-popular".

この結果、適合率は１００％、再現率は１００％となる。 As a result, the precision is 100% and the recall is 100%.

固有名詞判断処理４ｆでは、固有名詞を示す概念に関連付けされている表現であり「〜より」「〜と比べて」「〜比」などの除外接続表現の付されていない表現が分析対象の文書データから求められ、この求められた表現の直前で分析対象の文書データが区切られる。そして、固有名詞判断処理４ｆでは、概念抽出機能３によって抽出された表現又は概念のうち除外接続表現が付されていない表現又は概念について区切られた範囲内でアンド検索が行われる。 In the propernoun determination process 4f, an expression that is associated with a concept indicating a proper noun and that is not attached with an excluded connection expression such as “to more”, “compared to”, and “to ratio” is a document to be analyzed. It is obtained from the data, and the document data to be analyzed is divided immediately before the obtained expression. In the propernoun determination process 4f, an AND search is performed within a range delimited with respect to expressions or concepts to which the excluded connection expression is not attached among the expressions or concepts extracted by theconcept extraction function 3.

例えば、「ＡスナックはＢスナックと比べて売れています。」という文書データを分析するとする。 For example, it is assumed that document data “A snack sells more than B snack” is analyzed.

この文書データからは概念抽出機能３によって「Ａスナック」「Ｂスナック」「売れています」が抽出される。 From this document data, “A snack”, “B snack”, and “Sold” are extracted by theconcept extraction function 3.

ここで、この「ＡスナックはＢスナックと比べて売れています。」という文書データを単に固有名詞の前で区切る文書分割規則を用いて区切ると、「Ａスナックは」と「Ｂスナックと比べて売れています。」に区切られる。この場合、上記の固有名詞区切り処理４ｅによって得られる複合概念は「Ｂスナック−売れています」となり、適合率、再現率はともに０％となる。 Here, if the document data “A snack is sold in comparison with B snack” is divided using the document division rule that simply separates the proper nouns before “A snack” and “B snack”. It is sold. " In this case, the composite concept obtained by the propernoun delimiter processing 4e is “B snack-selling”, and both the precision and the recall are 0%.

これに対し、固有名詞判断処理４ｆでは、除外接続表現の付されている表現が区切りの基準とされない。したがって、上記の文書データにおいては、「Ｂスナック」に「と比べて」が付されているためこの表現「Ｂスナック」は区切りの基準とされない。また、除外接続表現の付されている表現「Ｂスナック」は、複合概念を求める際に組み合わせの対象から排除される。 On the other hand, in the propernoun determination process 4f, the expression with the excluded connection expression is not used as a delimiter. Therefore, in the above document data, “compared with” is added to “B snack”, so this expression “B snack” is not used as a delimiter. In addition, the expression “B snack” to which the excluded connection expression is attached is excluded from the combination target when the composite concept is obtained.

この結果、固有名詞判断処理４ｆによって得られる複合要素は、「Ａスナック−売れています」となり、適合率及び再現率が１００％となる。 As a result, the composite element obtained by the propernoun determination process 4f is “A snack—selling”, and the relevance ratio and the recall ratio are 100%.

なお、この固有名詞判断処理４ｆを用いて「Ｂスナックは試食会で完売、Ｃスナックは今一つだそうです。Ａスナックも試食会開催。好評です。」という上記の文書データについて分析を行った場合にも、「Ｂスナック−試食会−完売」「Ｃスナック−今一つ」「Ａスナック−試食会−好評」という複合概念が得られ、適合率及び再現率が１００％となる。 In addition, using this propernoun judgment processing 4f, when analyzing the above document data, “B snacks are sold out at the tasting party, C snacks are only one. A snack is also held at the tasting party. In addition, the combined concept of “B snack—tasting party—sold out”, “C snack—immediately” and “A snack—tasting party—popular” is obtained, and the relevance rate and the recall rate are 100%.

結果区切り処理４ｇでは、分析対象の文書データが、文頭又は分類軸に属する概念であり結果を示す概念に関連付けされている表現の直後から始まり、次に出現した結果を示す概念に関連付けされている表現の直後で終わる区間で区切られる。そして、結果区切り処理４ｇでは、概念抽出機能３によって抽出された表現又は概念について区切られた範囲内でアンド検索が行われる。 In theresult separation process 4g, the document data to be analyzed starts immediately after the expression associated with the concept indicating the result that is a concept belonging to the sentence head or the classification axis, and is associated with the concept indicating the next appearing result. It is separated by a section that ends immediately after the expression. In theresult delimiting process 4g, an AND search is performed within the range delimited for the expression or concept extracted by theconcept extraction function 3.

上記固有名詞区切り処理４ｅと固有名詞判断処理４ｆとでは、例えば商品名などのような、固有名詞を示す概念に関連付けされている表現を区切りの基準としている。しかしながら、商品名は入れ替わりが激しいため、概念定義辞書７のメンテナンスが遅れた場合には一部商品名が概念抽出機能３が抽出できない場合がある。 In the propernoun separation process 4e and the propernoun determination process 4f, for example, an expression associated with a concept indicating a proper noun such as a product name is used as a reference for separation. However, since product names are frequently changed, theconcept extraction function 3 may not be able to extract some product names when maintenance of the concept definition dictionary 7 is delayed.

例えば、「Ｂスナックは試食会で完売、Ｃスナックは今一つだそうです。Ａスナックも試食会開催。好評です。」という文書データから表現「Ｃスナック」が抽出されなかった場合、固有名詞区切り処理４ｅと固有名詞判断処理４ｆにより得られる複合概念は「Ｂスナック−試食会−完売」「Ｂスナック−試食会−今一つ」「Ａスナック−試食会−好評」の３つとなり、再現率及び適合率が約６６％となる。 For example, if the expression “C snack” is not extracted from the document data, “B snacks are sold out at the tasting party, and C snacks are one more. A snack is also held. The combined concept obtained by 4e and propernoun judgment processing 4f is "B snack-tasting party-sold out", "B snack-tasting party-now one", and "A snack-tasting party-popular". Is about 66%.

これに対し、結果又はアクションなどを示す概念に関連付けされている表現は、入れ替わりが緩やかであり、安定して分析対象の文書データから表現を抽出可能である。 On the other hand, the expression associated with the concept indicating the result or the action is not changed easily, and the expression can be stably extracted from the document data to be analyzed.

この結果区切り処理４ｇを用いて「Ｂスナックは試食会で完売、Ｃスナックは今一つだそうです。Ａスナックも試食会開催。好評です。」という文書データを結果を示す概念に関連付けされている表現で区切った結果は、「Ｂスナックは試食会で完売」「、Ｃスナックは今一つ」「だそうです。Ａスナックも試食会開催。好評です。」となる。 Using thisresult delimiter 4g, the document data “B snacks sold out at the tasting party, C snacks are still there. A snack is also held at the tasting party. The results separated by are “B snacks sold out at the tasting party”, “C snacks are just one”, “It seems. A snacks are also held at the tasting party.

したがって、「Ｃスナック」が概念定義辞書１４に未登録であっても、結果区切り処理４ｇによって得られる複合概念は、「Ｂスナック−試食会−完売」「Ａスナック−試食会−好評」であり、適合率１００％、再現率は約６６％となる。 Therefore, even if “C snack” is not registered in theconcept definition dictionary 14, the composite concept obtained by theresult separation process 4g is “B snack-tasting party-sold out” and “A snack-tasting party-popular”. The matching rate is 100% and the recall rate is about 66%.

結果判断処理４ｇでは、分析対象の文書データが、文頭又は分類軸に属する概念であり結果を示す概念に関連付けされている表現の直後から始まり、次に出現した結果を示す概念に関連付けされている表現の直後で終わる区間で区切られる。そして、結果判断処理４ｇでは、概念抽出機能３によって抽出された表現又は概念のうち除外接続表現が付されていない表現又は概念について区切られた範囲内でアンド検索が行われる。 In theresult determination process 4g, the document data to be analyzed starts immediately after the expression that is a concept belonging to the sentence head or the classification axis and is associated with the concept indicating the result, and is associated with the concept indicating the next appearing result. It is separated by a section that ends immediately after the expression. Then, in theresult determination process 4g, an AND search is performed within a range delimited with respect to expressions or concepts to which the excluded connection expression is not attached among the expressions or concepts extracted by theconcept extraction function 3.

これにより、例えば分析対象の文書データが「ＡスナックはＢスナックと比べて売れています。」の場合であっても適合率及び再現率を向上させることができる。 Thereby, for example, even when the document data to be analyzed is “A snack is sold in comparison with B snack”, the relevance ratio and the reproduction ratio can be improved.

以上説明した各種処理４ｅ〜４ｇを利用可能とすることにより、分析対象の文書データの質やユーザのニーズに合わせた分析を行うことができる。 By making it possible to use thevarious processes 4e to 4g described above, it is possible to perform analysis in accordance with the quality of document data to be analyzed and the needs of users.

（第３の実施の形態）
本実施の形態においては、商品名のように切り替わる表現（用語）を登録する辞書と、「試食会」「完売」「売れています」のように長い期間使うことのできる表現を登録する辞書を区別する文書データ分析プログラムについて説明する。(Third embodiment)
In the present embodiment, a dictionary for registering expressions (terms) to be switched like a product name and a dictionary for registering expressions that can be used for a long period of time such as “tasting party”, “sold out”, “sold” A document data analysis program to be distinguished will be described.

図６は、本実施の形態に係る文書データ分析プログラムによって計算機上で実現される機能を示すブロック図である。 FIG. 6 is a block diagram showing functions realized on a computer by the document data analysis program according to the present embodiment.

この文書データ分析プログラム１３によって実行されるマスク概念抽出機能１４は、マスク用語定義辞書１５を参照する。 The maskconcept extraction function 14 executed by the documentdata analysis program 13 refers to the maskterm definition dictionary 15.

マスク用語定義辞書１５は、使用期間が所定の基準より短い表現に対して、長期間使用可能なマスク用語（マスクデータ）と概念と概念ＩＤとを割り当てた構造を持つ。この例では、具体的な商品名は使用期間が所定の基準より短いため（更新されやすいため）、「自社商品」又は「他社商品」というマスク用語を割り当てている。マスク用語には、更新されやすい表現に対して、その性質を示す用語が利用可能である。 The maskterm definition dictionary 15 has a structure in which a mask term (mask data) that can be used for a long period of time, a concept, and a concept ID are assigned to an expression whose usage period is shorter than a predetermined reference. In this example, since a specific product name has a usage period shorter than a predetermined standard (because it is easily updated), a mask term “own product” or “other company product” is assigned. As the mask term, a term indicating the nature of the expression that is easily updated can be used.

マスク概念抽出機能１４は、分析対象の文書データからマスク用語定義辞書１５に登録されている表現を抽出し、その文書要素に対応するマスク用語に置き換える。 The maskconcept extraction function 14 extracts an expression registered in the maskterm definition dictionary 15 from the document data to be analyzed, and replaces it with a mask term corresponding to the document element.

本実施の形態に係る文書データ分析プログラム１３の処理内容について以下に説明する。 The processing contents of the documentdata analysis program 13 according to this embodiment will be described below.

例えば、分析対象の文書データ「Ｂスナックは試食会で完売。Ａスナック情報。対Ｃスナック比１２０％で売れています」には、自社商品である「Ａスナック」が他社商品である「Ｃスナック」に対して１２０％で売れているという意味が含まれている。 For example, the document data “B snack is sold out at the tasting party. A snack information. It sells at 120% of C snacks” is “A snack”, which is our company ’s product, “C snack” "Is sold at 120%.

この意味を判定可能とするために、変更されやすい他社の商品名を含む「対Ｃスナック比」などの表現を辞書登録すると、以下のような問題が生ずる。 In order to make it possible to determine this meaning, the following problems arise when an expression such as “to C snack ratio” including a product name of another company that is easily changed is registered in the dictionary.

第１に、概念定義辞書に対して「対Ｂスナック比」「対Ｃスナック比」などのように他社商品数に応じて辞書登録を行わなければならず、辞書構築の負荷が大きくなる。 First, dictionary registration must be performed in accordance with the number of products of other companies such as “vs. B snack ratio” and “vs. C snack ratio” for the concept definition dictionary, which increases the load of dictionary construction.

第２に、変更されやすい商品名を概念定義辞書に登録すると本来は長く使える表現を商品サイクルに合わせてメンテナンスする必要が生じる。 Secondly, when a product name that is easily changed is registered in the concept definition dictionary, it is necessary to maintain an expression that can be used for a long time according to the product cycle.

そこで、マスク概念抽出機能１４は、図７に示すように、マスク用語定義辞書１５に登録されており分析対象の文書データに含まれている表現を抽出し、その位置とマスク用語の概念ＩＤを抽出して記録する。 Therefore, as shown in FIG. 7, the maskconcept extraction function 14 extracts the expression registered in the maskterm definition dictionary 15 and included in the document data to be analyzed, and sets the position and the concept ID of the mask term. Extract and record.

次に、マスク概念抽出機能１４は、図８に示すように、抽出された表現をマスク用語定義辞書１５で定義されたマスク用語に変換する。 Next, the maskconcept extraction function 14 converts the extracted expression into a mask term defined in the maskterm definition dictionary 15 as shown in FIG.

そして、概念抽出機能３は、図９に示すように、マスク概念抽出機能１４によって変更された文書データから概念定義辞書１６に登録されている表現を抽出し、その概念と意味内容とを取得する。これにより、元の文書データにどのような意味が含まれているか判断可能である。この例では、元の文書データが「他社商品に勝っている」という意味を含むことを把握可能である。 Then, as shown in FIG. 9, theconcept extraction function 3 extracts the expression registered in theconcept definition dictionary 16 from the document data changed by the maskconcept extraction function 14, and acquires the concept and meaning content. . This makes it possible to determine what meaning is included in the original document data. In this example, it can be grasped that the original document data includes the meaning of “winning the products of other companies”.

なお、本実施の形態において、分析機能４及び表示機能６は、上記第１の実施の形態と同様の処理を実行する。 In the present embodiment, theanalysis function 4 and thedisplay function 6 perform the same processing as in the first embodiment.

以上説明したように、本実施の形態に係る文書データ分析プログラム１３を利用する場合には、固有名を定義するマスク用語定義辞書１５と、長く使える概念定義辞書１６を分けている。これにより、概念定義辞書１６をコンパクトに構築することができる。また、自社商品名、他社商品名を含む表現であっても概念抽出に有効に利用できる。さらに更新されやすさに応じて辞書が分けられているので辞書のメンテナンスを容易に行うことができる。 As described above, when the documentdata analysis program 13 according to the present embodiment is used, the maskterm definition dictionary 15 that defines proper names and theconcept definition dictionary 16 that can be used for a long time are separated. Thereby, theconcept definition dictionary 16 can be constructed in a compact manner. In addition, even expressions including in-house product names and other company product names can be effectively used for concept extraction. Furthermore, since the dictionaries are divided according to the easiness of updating, dictionary maintenance can be easily performed.

（第４の実施の形態）
本実施の形態においては、マスク用語を階層的に設定した場合の上記第３の実施の形態に係る文書データ分析プログラムについて説明する。(Fourth embodiment)
In the present embodiment, a document data analysis program according to the third embodiment when mask terms are set hierarchically will be described.

例えば、マスク用語を「自社商品」又は「他社商品」などのように単一ではなく、「商品：自社製品：スナック：商品名」又は「商品：他社商品：スナック：商品名」などのように所属する複数の階層全てを特定できるように定義する。 For example, the mask term is not single, such as “in-house product” or “other company product”, but “product: own product: snack: product name” or “product: other company product: snack: product name”, etc. Define so that all the multiple hierarchies can be specified.

そして、「商品」「自社製品」「スナック」などの複数階層のうち、ユーザのニーズに合う階層でマスク用語の置き換えを実行する。 Then, the mask term replacement is executed in a hierarchy that meets the needs of the user among a plurality of hierarchies such as “product”, “in-house product” and “snack”.

これにより、分析対象の文書データの分類力を向上させることができる。 Thereby, the classification power of the document data to be analyzed can be improved.

（第５の実施の形態）
本実施の形態においては、文書データからユーザの指定した表現又は概念が抽出されなかった場合に、その文書データを表示する機能を実現する文書データ分析プログラムについて説明する。(Fifth embodiment)
In the present embodiment, a document data analysis program for realizing a function of displaying document data when a user-specified expression or concept is not extracted from the document data will be described.

図１０は、本実施の形態に係る文書データ分析プログラムによって計算機上で実現される機能を示すブロック図である。 FIG. 10 is a block diagram showing functions realized on a computer by the document data analysis program according to the present embodiment.

この文書データ分析プログラム１７により実現される表示機能１８は、ユーザに指定された自社商品名が抽出されなかった文書データ又はユーザに指定された他社商品名が抽出されなかった文書データをまとめて表示する。 Thedisplay function 18 realized by the documentdata analysis program 17 collectively displays the document data from which the company product name designated by the user has not been extracted or the document data from which the other company product name designated by the user has not been extracted. To do.

この表示機能１８による表示内容の中に、ユーザが新たに辞書に登録したい商品名がある場合、ユーザは辞書登録機能１９にその表現と登録する項目（概念、マスク用語など）を入力する。登録する項目の入力手法としては、リストをユーザに提示して選択させる手法が利用可能である。 If there is a product name that the user wants to newly register in the dictionary in the display contents by thedisplay function 18, the user inputs the expression and the item (concept, mask term, etc.) to be registered in the dictionary registration function 19. As an input method of items to be registered, a method of presenting a list to the user and selecting it can be used.

辞書登録機能１９は、ユーザから入力された内容をマスク用語定義辞書１５に登録する。なお、辞書登録機能１９は、概念定義辞書１６に対してもユーザから入力された内容を登録可能である。 The dictionary registration function 19 registers the contents input by the user in the maskterm definition dictionary 15. The dictionary registration function 19 can also register the contents input by the user in theconcept definition dictionary 16.

以上説明した本実施の形態に係る文書データ分析プログラム１７を利用すると、メンテナンスの容易な固有名称などに関する辞書をユーザが更新可能である。したがって、ユーザも有効に辞書をメンテナンスすることができる。 When the documentdata analysis program 17 according to the present embodiment described above is used, the user can update a dictionary relating to unique names that are easy to maintain. Therefore, the user can also maintain the dictionary effectively.

例えば、自社商品の抽出されなかった文書データの中に、自社商品の略称がある場合に、この略称も自社商品として扱うことが可能となる。 For example, if there is an abbreviation of the company product in the document data from which the company product has not been extracted, this abbreviation can be handled as the company product.

なお、上記各実施の形態で説明した発明は、それぞれを自由に組み合わせることができる。また、上記各実施の形態で説明した各機能、各要素は、同様の作用・機能を実現可能であれば配置を変更させてもよく、また各機能、各要素を自由に組み合わせてもよく、分解してもよい。 Note that the inventions described in the above embodiments can be freely combined. In addition, each function and each element described in each of the above embodiments may be rearranged as long as the same action and function can be realized, and each function and each element may be freely combined, It may be decomposed.

上記各実施の形態で説明した文書データ分析プログラム１、１３、１７は、例えば磁気ディスク（フレキシブルディスク、ハードディスク等）、光ディスク（ＣＤ−ＲＯＭ、ＤＶＤ等）、半導体メモリなどの記録媒体１２に書き込んでコンピュータに適用可能である。また、このプログラムは通信媒体により伝送して、コンピュータに適用可能である。 The documentdata analysis programs 1, 13, and 17 described in the above embodiments are written in a recording medium 12 such as a magnetic disk (flexible disk, hard disk, etc.), an optical disk (CD-ROM, DVD, etc.), and a semiconductor memory. Applicable to computers. In addition, this program can be transmitted by a communication medium and applied to a computer.

コンピュータは、記録媒体１２に記録された文書データ分析プログラム１、１３、１７を記録媒体１２から読み込み、プログラムによって動作が制御されることにより、上述した機能を実現する。 The computer reads the documentdata analysis programs 1, 13, and 17 recorded in the recording medium 12 from the recording medium 12, and the operation is controlled by the program, thereby realizing the above-described functions.

（第６の実施の形態）
本実施の形態においては、上記各実施の形態に係る文書データ分析プログラム１、１３、１７の利用態様について説明する。(Sixth embodiment)
In the present embodiment, usage modes of the documentdata analysis programs 1, 13, and 17 according to the above embodiments will be described.

例えば、企業の日報データ、月報データ、営業報告データ等の文書データの数は膨大になる。 For example, the number of document data such as company daily report data, monthly report data, and business report data becomes enormous.

この膨大な文書データの中からユーザの求める内容を持つ文書データを読み出して参考にしたい場合、又は文書データの要約データを求めたい場合、ユーザは計算機上で文書データ分析プログラム１、１３、１７を起動し、この文書データ分析プログラム１、１３、１７に文書データを読み込ませ、その内容を分析させる。 When the user wants to read out and refer to document data having the contents requested by the user from the enormous amount of document data, or to obtain the summary data of the document data, the user uses the documentdata analysis programs 1, 13, and 17 on the computer. The documentdata analysis program 1, 13, 17 is read and the contents are analyzed.

その結果、ユーザは、文書データに含まれている表現を所定の形式で組み合わせて要約データを作成し、出力することが可能である。 As a result, the user can create and output summary data by combining expressions included in the document data in a predetermined format.

また、ユーザは、例えば「自社商品が売れている」という内容の文書データのみを出力することも可能である。 In addition, the user can output only document data such as “the company's products are sold”, for example.

図１１は、上記各実施の形態に係る文書データ分析プログラム１、１３、１７により実施されるサービスをＡＳＰ（アプリケーション・サービス・プロバイダ）が提供する形態を例示するブロック図である。 FIG. 11 is a block diagram exemplifying a form in which an ASP (Application Service Provider) provides services implemented by the documentdata analysis programs 1, 13, and 17 according to the above embodiments.

ユーザ２０は、自己の端末２１からネットワーク２２を経由してＡＳＰ２３の管理する文書データ分析プログラム１、１３、１７を利用することで、文書データの分析を容易に実施できる。 Theuser 20 can easily analyze the document data by using the documentdata analysis programs 1, 13, and 17 managed by theASP 23 from his / herterminal 21 via thenetwork 22.

また、ＡＳＰ２３のサービスの提供を受けることで、ユーザ２０は、自己で文書データ分析プログラム１、１３、１７を保守、運用することなく、効率的に分析サービスを利用できる。 Further, by receiving the service provided by theASP 23, theuser 20 can efficiently use the analysis service without maintaining and operating the documentdata analysis programs 1, 13, and 17 by himself / herself.

１，１３，１７…文書データ分析プログラム，２…取得機能、３…概念抽出機能、４…分析機能、４ａ…アンド検索機能、４ｂ…文書区切り処理、４ｃ…係り受け分析処理、４ｄ…対応付け分析処理、４ｅ…固有名詞区切り処理、４ｆ…固有名詞判断処理、４ｇ…結果区切り処理、４ｈ…結果判断処理、５…選択機能、６，１８…表示機能、７，１６…概念定義辞書、８…対応付けテーブル、１４…マスク概念抽出機能、１５…マスク用語定義辞書、１９…辞書登録機能。 DESCRIPTION OFSYMBOLS 1,13,17 ... Document data analysis program, 2 ... Acquisition function, 3 ... Concept extraction function, 4 ... Analysis function, 4a ... AND search function, 4b ... Document separation process, 4c ... Dependency analysis process, 4d ... Correspondence Analysis processing, 4e ... proper noun separation processing, 4f ... proper noun judgment processing, 4g ... result separation processing, 4h ... result judgment processing, 5 ... selection function, 6, 18 ... display function, 7, 16 ... concept definition dictionary, 8 ... correspondence table, 14 ... mask concept extraction function, 15 ... mask term definition dictionary, 19 ... dictionary registration function.

Claims

Translated fromJapanese

コンピュータに、
データベースに記憶されており、文書を構成する文書要素のうち使用期間が所定の基準より短い文書要素とその性質を示すマスクデータとを関連付けた第１の辞書情報を参照し、前記コンピュータによって取得された分析対象の文書データに含まれている文書要素のうち前記第１の辞書情報に含まれている文書要素をその文書要素に関連するマスクデータに変換するマスク機能と、
前記データベースに記憶されており、使用期間が所定の基準より長い文書要素とその属性データとを関連付けた第２の辞書情報を参照し、前記マスク機能によって変換された後の文書データに含まれておりかつ前記第２の辞書情報に含まれている複数の文書要素とその属性データとを抽出する抽出機能と
を実現させるための文書データ分析プログラム。On the computer,
Reference is made to the first dictionary information stored in the database, in which the document elements that make up the document have a usage period shorter than a predetermined standard, and the mask data indicating the properties, and are obtained by the computer. A mask function for converting a document element included in the first dictionary information among document elements included in the document data to be analyzed into mask data related to the document element;
Referring to second dictionary information stored in the database and associated with document elements whose use period is longer than a predetermined standard and their attribute data, and included in the document data after being converted by the mask function A document data analysis program for realizing an extraction function for extracting a plurality of document elements and their attribute data included in the second dictionary information.

請求項１記載の文書データ分析プログラムにおいて、
前記コンピュータに、組み合わせを行う複数の属性データを定めた分類軸と、前記抽出機能によって抽出された複数の文書要素とその属性データとに基づいて、前記分類軸にしたがって前記抽出機能によって抽出された第１の文書要素又はその第１の属性データと、第２の文書要素又はその第２の属性データとを組み合わせた要約データを作成する分析機能をさらに実現させることを特徴とする文書データ分析プログラム。The document data analysis program according to claim 1,
Based on the classification axis defining a plurality of attribute data to be combined in the computer, the plurality of document elements extracted by the extraction function, and the attribute data, the extraction function is extracted according to the classification axis A document data analysis program for further realizing an analysis function for creating summary data combining the first document element or its first attribute data and the second document element or its second attribute data .

請求項２記載の文書データ分析プログラムにおいて、
前記分析機能は、
前記抽出機能よって抽出された複数の文書要素とその属性データの中から、前記マスク機能によって変換された後の文書データにおいて、比較を意味する除外接続表現が続く文書要素とその属性データを除き、前記分類軸にしたがって前記除外接続表現が続かない前記第１の文書要素又は前記第１の属性データと、前記除外接続表現が続かない前記第２の文書要素又は前記第２の属性データとを組み合わせる
ことを特徴とする文書データ分析プログラム。The document data analysis program according to claim 2,
The analysis function is
Among the plurality of document elements extracted by the extraction function and their attribute data, in the document data after being converted by the mask function, excluding the document element followed by an excluded connection expression meaning comparison and its attribute data, Combining the first document element or the first attribute data not followed by the excluded connection expression with the second document element or the second attribute data not followed by the excluded connection expression according to the classification axis. Document data analysis program characterized by that.

請求項１乃至請求項３のいずれか１項に記載の文書データ分析プログラムにおいて、
前記マスクデータは、使用期間が所定の基準より短い文書要素の性質を階層的に特定していることを特徴とする文書データ分析プログラム。The document data analysis program according to any one of claims 1 to 3,
The document data analysis program characterized in that the mask data hierarchically specifies the properties of document elements whose usage period is shorter than a predetermined reference.

請求項１乃至請求項４のいずれか１項に記載の文書データ分析プログラムにおいて、
前記コンピュータに、前記分析対象の文書データのうち前記第２の辞書情報に含まれていない部分を表示する機能をさらに実現させることを特徴とする文書データ分析プログラム。In the document data analysis program according to any one of claims 1 to 4,
A document data analysis program for causing the computer to further realize a function of displaying a portion of the document data to be analyzed that is not included in the second dictionary information.

請求項１乃至請求項５のいずれか１項に記載の文書データ分析プログラムにおいて、
前記コンピュータに、ユーザの指定内容にしたがって前記第２の辞書情報を更新する登録機能をさらに実現させることを特徴とする文書データ分析プログラム。The document data analysis program according to any one of claims 1 to 5,
A document data analysis program for causing the computer to further realize a registration function for updating the second dictionary information in accordance with user-specified contents.

コンピュータによる文書データ分析方法において、
前記コンピュータが、データベースに記憶されており、文書を構成する文書要素のうち使用期間が所定の基準より短い文書要素とその性質を示すマスクデータとを関連付けた第１の辞書情報を参照し、前記コンピュータによって取得された分析対象の文書データに含まれている文書要素のうち前記第１の辞書情報に含まれている文書要素をその文書要素に関連するマスクデータに変換する変換ステップと、
前記コンピュータが、前記データベースに記憶されており、使用期間が所定の基準より長い文書要素とその属性データとを関連付けた第２の辞書情報を参照し、前記変換ステップによって変換された後の文書データに含まれておりかつ前記第２の辞書情報に含まれている複数の文書要素とその属性データとを抽出する抽出ステップと
を含む文書データ分析方法。In a document data analysis method using a computer,
The computer is stored in a database, and refers to first dictionary information that associates document elements whose use period is shorter than a predetermined standard among the document elements constituting the document and mask data indicating the nature thereof, and A conversion step of converting a document element included in the first dictionary information among document elements included in the document data to be analyzed obtained by the computer into mask data related to the document element;
Document data that has been converted by the conversion step with reference to second dictionary information that is stored in the database and that associates a document element whose use period is longer than a predetermined standard and its attribute data. A document data analysis method including an extraction step of extracting a plurality of document elements and attribute data included in the second dictionary information.

請求項７記載の文書データ分析方法において、
前記コンピュータが、組み合わせを行う複数の属性データを定めた分類軸と、前記抽出ステップによって抽出された複数の文書要素とその属性データとに基づいて、前記分類軸にしたがって前記抽出ステップによって抽出された第１の文書要素又はその第１の属性データと、第２の文書要素又はその第２の属性データとを組み合わせた要約データを作成する分析ステップをさらに含むことを特徴とする文書データ分析方法。The document data analysis method according to claim 7,
The computer is extracted by the extraction step according to the classification axis based on a classification axis that defines a plurality of attribute data to be combined, a plurality of document elements extracted by the extraction step, and its attribute data A document data analysis method further comprising an analysis step of creating summary data in which the first document element or the first attribute data thereof and the second document element or the second attribute data thereof are combined.

データベースに記憶されており、文書を構成する文書要素のうち使用期間が所定の基準より短い文書要素とその性質を示すマスクデータとを関連付けた第１の辞書情報を参照し、コンピュータによって取得された分析対象の文書データに含まれている文書要素のうち前記第１の辞書情報に含まれている文書要素をその文書要素に関連するマスクデータに変換するマスク手段と、
前記データベースに記憶されており、使用期間が所定の基準より長い文書要素とその属性データとを関連付けた第２の辞書情報を参照し、前記マスク手段によって変換された後の文書データに含まれておりかつ前記第２の辞書情報に含まれている複数の文書要素とその属性データとを抽出する抽出手段と
を具備する文書データ分析システム。Reference is made to the first dictionary information stored in the database, in which the document elements that make up the document and whose usage period is shorter than a predetermined standard and the mask data indicating the nature of the document elements are associated with each other. Mask means for converting a document element included in the first dictionary information among document elements included in the document data to be analyzed into mask data related to the document element;
Included in the document data after being converted by the mask means with reference to the second dictionary information stored in the database and referring to the document data whose use period is longer than a predetermined reference and its attribute data And a document data analysis system comprising: an extracting means for extracting a plurality of document elements and their attribute data included in the second dictionary information.

請求項９記載の文書データ分析システムにおいて、
組み合わせを行う複数の属性データを定めた分類軸と、前記抽出手段によって抽出された複数の文書要素とその属性データとに基づいて、前記分類軸にしたがって前記抽出ス手段によって抽出された第１の文書要素又はその第１の属性データと、第２の文書要素又はその第２の属性データとを組み合わせた要約データを作成する分析手段をさらに具備する文書データ分析システム。The document data analysis system according to claim 9,
Based on a classification axis that defines a plurality of attribute data to be combined, a plurality of document elements extracted by the extraction means, and its attribute data, a first extracted by the extraction means according to the classification axis A document data analysis system further comprising analysis means for creating summary data combining the document element or the first attribute data thereof and the second document element or the second attribute data thereof.