JP4010058B2

Movatterモバイル変換

Info

Publication number: JP4010058B2
Application number: JP22293498A
Authority: JP
Inventors: 賢一沼田
Original assignee: Fuji Xerox Co Ltd; Fujifilm Business Innovation Corp
Current assignee: Fujifilm Business Innovation Corp
Priority date: 1998-08-06
Filing date: 1998-08-06
Publication date: 2007-11-21
Anticipated expiration: 2018-08-06
Also published as: JP2000057152A

Description

【０００１】
【発明の属する技術分野】
本発明は文書関連付け装置、文書閲覧装置、文書関連付けプログラムを記録したコンピュータ読み取り可能な記録媒体、及び文書閲覧プログラムを記録したコンピュータ読み取り可能な記録媒体に関し、特に文書中のあるキーワードとそのキーワードに関連する他の文書の内容を関連付ける文書関連付け装置、文書中のあるキーワードとそのキーワードに関連する他の文書の内容とが関連付けられた文書群中の文書を閲覧する文書閲覧装置、前記文書関連付け装置をコンピュータ上で実現するための文書関連付けプログラムを記録したコンピュータ読み取り可能な記録媒体、及び前記文書閲覧装置をコンピュータ上で実現するための文書閲覧プログラムを記録したコンピュータ読み取り可能な記録媒体に関する。
【０００２】
【従来の技術】
ネットワーク上に散在する電子文書群をリンクによって関連付けることが可能な、いわゆるハイパーテキストシステムが、World Wide Web(WWW) の普及により、一般に広く利用されるようになってきている。ハイパーテキストシステムでは、ある文書中のキーワードに対して、より詳しい情報を持つ他の文書の内容へのハイパーリンクを付与しておく。これによって、利用者がその文書を閲覧していて、ハイパーリンクが付与された記述に関してより詳しく知りたいと思ったときには、そのハイパーリンクを辿ることによって関連情報を知ることができる。
【０００３】
ところが、一般的にこのようなハイパーテキスト文書を作成するためには、文書の作成者が手作業でキーワードと他の文書との関連付けを行ってハイパーリンクを作成する必要があり、多大の労力と時間を要する。そこで、この問題を解決するために、文書中のキーワードを自動抽出して、他の文書から同一または同義のキーワードを含むものを検索することによって、文書間の関連付けすなわちハイパーリンクを自動的に作成することが考えられている。
【０００４】
このとき単純に同一または同義のキーワードを手がかりとして文書を関連付けるだけでは、ハイパーリンクを辿ることによって、より詳しい説明が得られるという保証がない。なぜならば、関連付けられた文書のいずれにおいても同一または同義のキーワードが一言参照されているだけでそのキーワードの説明に当たる記述がない場合が往々にしてあり得るからである。
【０００５】
この問題を解決する１つの方法として、特開平５−２０３６２号公報に開示された「文書テキスト間の連鎖自動作成システム」がある。この公報に開示された方法では、まず、文書テキストから重要キーワードを抽出し、抽出したキーワードの文書における重要度を算出する。その上で、同一のキーワードを共有する文書どうしで、キーワードの重要度の低い方の文書からキーワードの重要度の高い方の文書への、単方向の関連付けを自動生成する。この方法では、同一のキーワードを手がかりとして文書を関連付けているが、同一キーワードの文書における重要度の高い文書のほうが、重要度の低い文書よりも、そのキーワードに関してより詳しく説明されているものと仮定している。これによって、文書中のあるキーワードから、より詳しい説明が記述された他の文書に対するハイパーリンクが自動的に生成される。以下のこの方法を第１の従来技術とする。
【０００６】
また、上記問題を解決する別の方法として特開平７−３２５８２７号公報に開示された「ハイパーテキスト自動生成装置」がある。この公報には、同一または同義のキーワードを持つ文書どうしを関連付ける際に、一方の文書のキーワードから、他の文書の同一または同義のキーワードを持つ章や節の見出しに対してハイパーリンクを生成する方法が示されている。この方法では、あるキーワードが見出しに含まれる場合、見出し以下の内容において、そのキーワードについて詳しく説明されている可能性が高いと仮定している。これによって、文書中のあるキーワードから、より詳しい説明に対するハイパーリンクが自動的に生成される。以下のこの方法を第２の従来技術とする。
【０００７】
【発明が解決しようとする課題】
しかし、いずれの従来技術においても、以下のような問題点があった。
第１の従来技術では、関連付けの対象はある文書中のキーワードと他の文書全体である。そのため、関連付けられる他の文書の記述量が多い場合には、たとえ関連付けられたキーワードに対する詳しい説明が文書中に記述されていたとしても、文書中で関連する記述を見つけ出すことが困難である。
【０００８】
第２の従来技術では、ある文書中のキーワードに対して、同一または同義のキーワードが含まれる他の文書が複数存在する場合には、予め与えられた戦略に従って候補をいずれか１つに絞るようになっている。そのため、利用者が実際に知りたい情報が関連付けの対象から洩れてしまうおそれがある。なお、この問題については、例えば関連付けの対象となる候補が複数存在する場合にその候補全てを関連付けてしまうことによって洩れを防ぐことができる。しかし、この場合には、利用者が複数の関連付けられた記述を順次閲覧し、必要な情報を探すという手間がかかる。
【０００９】
さらに、上記２つの従来技術のいずれにおいても、関連付けの対象となるキーワードを自動抽出するために、文書全体に対して形態素解析を行う必要がある。形態素解析を高精度に行うには、かなり複雑な処理を行わなければならない。そのため、従来の技術を用いて大量の文書間のハイパーリンクを自動作成するには、処理に非常に時間がかかってしまうという問題点があった。
【００１０】
本発明はこのような点に鑑みてなされたものであり、文書中のキーワードを他の文書中の最小限の関連記述に関連付ける処理を高速に行うことができる文書関連付け装置を提供することを目的とする。
【００１１】
また、本発明の第２の目的は、文書中のキーワードを他の文書中の最小限の関連記述に関連付けられた文書群内の文書を閲覧するための文書閲覧装置を提供することである。
【００１２】
また、本発明の第３の目的は、文書中のキーワードを他の文書中の最小限の関連記述に関連付ける処理をコンピュータに高速に行わせることができる文書関連付けプログラムを記録したコンピュータ読み取り可能な記録媒体を提供することである。
【００１３】
また、本発明の第４の目的は、文書中のキーワードを他の文書中の最小限の関連記述に関連付けられた文書群内の文書をコンピュータを用いて閲覧するための文書閲覧プログラムを記録したコンピュータ読み取り可能な記録媒体を提供することである。
【００１４】
【課題を解決するための手段】
本発明では上記課題を解決するために、文書間の関連付けを行う文書関連付け装置において、属性が設定された複数の要素で構成され、文書識別子が設定された文書を複数格納する文書蓄積手段と、前記文書蓄積手段に格納されている複数の文書を被関連付け対象文書として順次選択し、選択した当該被関連付け対象文書中の要素のタグに対して、当該要素を一意に識別するための要素識別子を設定する識別子設定手段と、前記識別子設定手段によって前記要素識別子が設定された前記被関連付け対象文書中の特定の属性を有する処理対象要素に含まれる内容からキーワードを抽出するキーワード抽出手段と、前記キーワード抽出手段により抽出されたキーワードごとに、当該キーワードを含む、当該キーワードの抽出元となる前記被関連付け対象文書以外の文書を、前記文書蓄積手段内より検索する文書内容検索手段と、前記文書内容検索手段により検索されたキーワードを含む文書ごとに、当該文書中の当該キーワードに対応付けられ、当該キーワードに関連する他の文書中の要素を示す属性値が複数登録可能なタグに、当該キーワードの抽出元となる前記被関連付け対象文書の文書識別子と、当該被関連付け対象文書内の当該キーワードを含む前記処理対象要素の要素識別子との組からなる属性値を複数設定するキーワード関連付け手段と、を有することを特徴とする文書関連付け装置が提供される。
【００１５】
このような文書関連付け装置によれば、階層構造関連付け手段により、前記文書蓄積手段に格納されている文書が被関連付け対象文書とされ、被関連付け対象文書中の要素のタグに対して、その要素を一意に識別するための要素識別子が設定される。また、キーワード抽出手段により、被関連付け対象文書中の特定の属性を有する処理対象要素に含まれる内容からキーワードが抽出される。すると、文書内容検索手段により、キーワード抽出手段が抽出したキーワードを含む文書が文書蓄積手段内から検索される。そして、キーワード関連付け手段により、文書内容検索手段により検索された文書中のキーワードの複数登録可能なタグに、キーワードの抽出元となる被関連付け対象文書の文書識別子と、被関連付け対象文書内のキーワードを含む処理対象要素の要素識別子との組からなる属性値が複数設定される。
【００１６】
また上記課題を解決するために、構造化文書の内容を閲覧する文書閲覧装置において、属性が設定された複数の要素で構成され、文書識別子が設定された文書を複数格納する文書蓄積手段と、前記文書蓄積手段に格納されている複数の文書を被関連付け対象文書として順次選択し、選択した当該被関連付け対象文書中の要素のタグに対して、当該要素を一意に識別するための要素識別子を設定する識別子設定手段と、前記識別子設定手段によって前記要素識別子が設定された前記被関連付け対象文書中の特定の属性を有する処理対象要素に含まれる内容からキーワードを抽出するキーワード抽出手段と、前記キーワード抽出手段により抽出されたキーワードごとに、当該キーワードを含む、当該キーワードの抽出元となる前記被関連付け対象文書以外の文書を、前記文書蓄積手段内より検索する文書内容検索手段と、前記文書内容検索手段により検索されたキーワードを含む文書ごとに、当該文書中の当該キーワードに対応付けられ、当該キーワードに関連する他の文書中の要素を示す属性値が複数登録可能なタグに、当該キーワードの抽出元となる前記被関連付け対象文書の文書識別子と、当該被関連付け対象文書内の当該キーワードを含む前記処理対象要素の要素識別子との組からなる属性値を複数設定するキーワード関連付け手段と、文書閲覧要求に応じて、当該文書閲覧要求で指定された文書を前記文書蓄積手段から抽出し、表示する文書表示手段と、操作入力により、前記文書表示手段にて表示された文書中で、他の文書中の要素を示す１つ以上の属性値が登録されたタグを有する前記キーワードが選択されると、当該キーワードのタグに設定された属性値に基づいて、当該キーワードに関連する前記被関連付け対象文書中の処理対象要素を抽出する要素抽出手段と、前記要素抽出手段により抽出された前記処理対象要素の内容を表示する内容表示手段と、を有することを特徴とする文書閲覧装置が提供される。
【００１７】
このような文書閲覧装置によれば、階層構造関連付け手段により、前記文書蓄積手段に格納されている文書が被関連付け対象文書とされ、被関連付け対象文書中の要素のタグに対して、その要素を一意に識別するための要素識別子が設定される。また、キーワード抽出手段により、被関連付け対象文書中の特定の属性を有する処理対象要素に含まれる内容からキーワードが抽出される。すると、文書内容検索手段により、キーワード抽出手段が抽出したキーワードを含む文書が文書蓄積手段内から検索される。そして、キーワード関連付け手段により、文書内容検索手段により検索された文書中のキーワードの複数登録可能なタグに、キーワードの抽出元となる被関連付け対象文書の文書識別子と、被関連付け対象文書内のキーワードを含む処理対象要素の要素識別子との組からなる属性値が複数設定される。さらに、文書閲覧要求が入力されると、文書表示手段により、文書閲覧要求に応じた文書が文書蓄積手段から抽出される。この文書表示手段にて抽出された文書中で、キーワード関連付け手段により関連付けられたキーワードが選択されると、要素抽出手段により、キーワードに対して関連付けられた１又は複数の被関連付け対象文書中の関連要素が抽出される。
【００１８】
さらに、内容表示手段により、前記要素抽出手段により抽出された前記関連要素の内容が抽出され、表示される。
また上記課題を解決するために、文書間の関連付けを行うための文書関連付けプログラムを記録したコンピュータ読み取り可能な記録媒体において、属性が設定された複数の要素で構成され、文書識別子が設定された文書を複数格納する文書蓄積手段、前記文書蓄積手段に格納されている複数の文書を被関連付け対象文書として順次選択し、選択した当該被関連付け対象文書中の要素のタグに対して、当該要素を一意に識別するための要素識別子を設定する識別子設定手段、前記識別子設定手段によって前記要素識別子が設定された前記被関連付け対象文書中の特定の属性を有する処理対象要素に含まれる内容からキーワードを抽出するキーワード抽出手段、前記キーワード抽出手段により抽出されたキーワードごとに、当該キーワードを含む、当該キーワードの抽出元となる前記被関連付け対象文書以外の文書を、前記文書蓄積手段内より検索する文書内容検索手段、前記文書内容検索手段により検索されたキーワードを含む文書ごとに、当該文書中の当該キーワードに対応付けられ、当該キーワードに関連する他の文書中の要素を示す属性値が複数登録可能なタグに、当該キーワードの抽出元となる前記被関連付け対象文書の文書識別子と、当該被関連付け対象文書内の当該キーワードを含む前記処理対象要素の要素識別子との組からなる属性値を複数設定するキーワード関連付け手段、としてコンピュータを機能させることを特徴とする文書関連付けプログラムを記録したコンピュータ読み取り可能な記録媒体が提供される。
【００１９】
この記録媒体に記録された文書関連付けプログラムをコンピュータに実行させれば、上記本発明に係る文書関連付け装置の機能がコンピュータ上に構築される。
【００２０】
また上記課題を解決するために、構造化文書の内容を閲覧するための文書閲覧プログラムを記録したコンピュータ読み取り可能な記録媒体において、コンピュータを、属性が設定された複数の要素で構成され、文書識別子が設定された文書を複数格納する文書蓄積手段、前記文書蓄積手段に格納されている複数の文書を被関連付け対象文書として順次選択し、選択した当該被関連付け対象文書中の要素のタグに対して、当該要素を一意に識別するための要素識別子を設定する識別子設定手段、前記識別子設定手段によって前記要素識別子が設定された前記被関連付け対象文書中の特定の属性を有する処理対象要素に含まれる内容からキーワードを抽出するキーワード抽出手段、前記キーワード抽出手段により抽出されたキーワードごとに、当該キーワードを含む、当該キーワードの抽出元となる前記被関連付け対象文書以外の文書を、前記文書蓄積手段内より検索する文書内容検索手段、前記文書内容検索手段により検索されたキーワードを含む文書ごとに、当該文書中の当該キーワードに対応付けられ、当該キーワードに関連する他の文書中の要素を示す属性値が複数登録可能なタグに、当該キーワードの抽出元となる前記被関連付け対象文書の文書識別子と、当該被関連付け対象文書内の当該キーワードを含む前記処理対象要素の要素識別子との組からなる属性値を複数設定するキーワード関連付け手段、文書閲覧要求に応じて、当該文書閲覧要求で指定された文書を前記文書蓄積手段から抽出し、表示する文書表示手段、操作入力により、前記文書表示手段にて表示された文書中で、他の文書中の要素を示す１つ以上の属性値が登録されたタグを有する前記キーワードが選択されると、当該キーワードのタグに設定された属性値に基づいて、当該キーワードに関連する前記被関連付け対象文書中の処理対象要素を抽出する要素抽出手段、前記要素抽出手段により抽出された前記処理対象要素の内容を表示する内容表示手段、として機能させることを特徴とする文書閲覧プログラムを記録したコンピュータ読み取り可能な記録媒体が提供される。
【００２１】
この記録媒体に記録された文書閲覧プログラムをコンピュータに実行させれば、上記本発明に係る文書閲覧装置の機能がコンピュータ上に構築される。
【００２２】
【発明の実施の形態】
以下、本発明の実施の形態を図面を参照して説明する。
図１は、本発明の原理構成図である。本発明の文書関連付け装置は、以下の要素で構成される。
【００２３】
文書蓄積手段１は、階層的な論理構造の文書群を蓄積する。構造化された文書としては、ＳＧＭＬの規定に従って作成された文書などがある。
階層構造関連付け手段２は、文書蓄積手段１から被関連付け対象文書２ａを読み込み、読み込んだ被関連付け対象文書２ａを構成する各要素の上位構造と下位構造とを関連付ける。例えば、各要素に識別子を与える。そして、各要素に対して、その要素の下位構造となる要素の識別子の情報を持たせる。要素間の関連付けを行った被関連付け対象文書２ａは、文書蓄積手段１に戻す。
【００２４】
キーワード抽出手段３は、被関連付け対象文書２ａ中の特定の属性を有する処理対象要素に含まれる内容からキーワードを抽出する。例えば、表題としての属性を有する要素と、見出しとしての属性を有する要素とを、処理対象要素とする。すると、キーワード抽出手段３は、抽出元の要素の識別子と、その要素から抽出されたキーワードの集合とを対応づけたキーワード対応表３ａを内部で生成する。そして、被関連付け対象文書２ａに関するキーワード対応表３ａを文書内容検索手段４に渡す。
【００２５】
文書内容検索手段４は、キーワード抽出手段３により抽出されたキーワードに基づいて、文書蓄積手段１に蓄積されている他の文書の内容を検索する。見つけ出した文書４ａは、キーワード関連付け手段５に渡す。
【００２６】
キーワード関連付け手段５は、文書内容検索手段４により検出された文書４ａの内容中のキーワードと、キーワードの抽出元となる被関連付け対象文書２ａの処理対象要素とを関連付ける。被関連付け対象文書２ａの特定の要素への関連付けを行った文書５ａは、文書蓄積手段１に格納する。
【００２７】
このような文書関連付け装置によれば、階層構造関連付け手段２に読み込まれた被関連付け対象文書２ａは、各要素の上位構造と下位構造との関連付けが行われ、文書蓄積手段１に戻される。このとき、キーワード抽出手段３により、各要素の内容の中からキーワードが抽出される。すると、文書内容検索手段４により、抽出されたキーワードに基づいて文書蓄積手段１内の文書が検索される。検出された文書４ａはキーワード関連付け手段５に渡され、文書４ａの内容中のキーワードと、キーワードの抽出元となる被関連付け対象文書２ａの処理対象要素とが関連付けられる。そして、処理対象要素との関連付けが行われた文書５ａは、文書蓄積手段１に戻される。
【００２８】
このような処理を、文書蓄積手段１に格納されている全ての文書を被関連付け対象文書２ａとして実行すれば、ある文書中のキーワードが他の文書中の特定の要素（表題や見出し）に関連付けられ、さらに、その要素から下位構造に関連付けられる。そのため、文書蓄積手段１内の文書を閲覧する場合には、文書中のキーワードから他の文書中の必要最小限の関連付けられた内容を参照することができる。
【００２９】
しかも、関連付けに際して文書中の表題もしくは見出しなどの特定の要素だけを対象としてキーワード抽出処理を行うので、形態素解析のようなキーワード抽出に必要な煩雑な処理を文書全体に対して施す必要がなくなる。その結果、関連付けの処理効率が向上する。
【００３０】
次に、本発明の文書関連付け装置によって文書間の関連付けを行い、それらの文書を閲覧することができる文書閲覧装置を第１の実施の形態として以下に説明する。
【００３１】
図２は、本発明を適用した文書閲覧装置の構成を示す図である。この文書閲覧装置は、文書蓄積部１１、階層構造関連付け部１２、キーワード抽出部１３、文書内容検索部１４、キーワード関連付け部１５、文書抽出部１６、見出し抽出部１７、見出し選択部１８、内容抽出部１９、表示部２０、及び入力部２１から構成されている。
【００３２】
文書蓄積部１１は、表題、章の見出し、節の見出し、段落等の論理構造を有する文書群を蓄積する。
階層構造関連付け部１２は、文書蓄積部１１に蓄積された文書を読み込み、表題、見出しの階層( 章見出し、節見出しなど) 、見出しに対応する内容( 例えばある節の段落の並び) を関連付ける。
【００３３】
キーワード抽出部１３は、階層構造関連付け部１２にて関連付けられた表題および見出しの階層からキーワードを抽出する。
文書内容検索部１４は、キーワード抽出部１３にて抽出されたキーワードを用いて、文書蓄積部１１に蓄積された文書群を対象に、与えられたキーワードを内容に持つ文書を検索する。
【００３４】
キーワード関連付け部１５は、文書内容検索部１４にて検索された文書中のキーワードと、該キーワードを抽出した表題および見出しの階層を関連付ける。
文書抽出部１６は、文書蓄積部１１に蓄積された文書群から、入力部２１で利用者が入力した要求に応じて文書を抽出し、表示部２０に表示する。
【００３５】
見出し抽出部１７は、文書抽出部１６により抽出され、表示部２０に表示された文書中で、利用者が入力部２１によりキーワードを指定した場合に、指定されたキーワードで関連付けられている他の文書の表題もしくは見出しを文書蓄積部１１から抽出し、表示部２０に表示する。また、抽出された前記表題もしくは見出しのさらに下位の見出しを文書蓄積部１１から抽出し、表示部２０に表示する。
【００３６】
見出し選択部１８は、入力部２１で利用者が入力した要求に応じて、見出し抽出部１７により表題もしくは見出しが複数抽出された場合にはそのうちの１つの表題もしくは見出しを選択し、前記表題もしくは見出しに下位の見出しが複数存在する場合にはそのうちの１つの見出しを選択する。
【００３７】
内容抽出部１９は、見出し抽出部１７により抽出された表題、見出しもしくは順次抽出された下位の見出しが、その見出しに対応する内容と関連付けられている場合に、文書蓄積部１１からその内容を抽出し、表示部２０に表示する。
【００３８】
表示部２０は、文書抽出部１６により抽出された文書、見出し抽出部１７により抽出された他の文書の表題もしくは見出し、および内容抽出部１９により抽出された他の文書の内容を、画面上に表示する。
【００３９】
入力部２１は、文書抽出部１６により抽出する文書の指定、文書抽出部１６により抽出された文書中でのキーワードの選択、見出し抽出部１７により抽出された表題もしくは見出しが複数存在する場合の選択の指示等を行う。
【００４０】
次に、このような構成の文書閲覧装置により、文書蓄積部１１に格納されている文書群に対して文書間の関連付けを行う手順について説明する。
図３は、文書間の関連付けを行う手順を示すフローチャートである。以下の処理をステップ番号に沿って説明する。
［Ｓ１］階層構造関連付け部１２が、文書蓄積部１１から未処理の文書を１つ読み込む。
［Ｓ２］階層構造関連付け部１２が、読み込んだ文書の構造を解析する。
［Ｓ３］階層構造関連付け部１２が、表題、見出し、及び内容を関連付ける。
［Ｓ４］キーワード抽出部１３が、表題及び見出しの内容の中からキーワードを抽出する。
［Ｓ５］文書内容検索部１４が、キーワード抽出部１３が抽出したキーワードを含む文書を、文書蓄積部１１の中から検索する。
［Ｓ６］キーワード関連付け部１５が、文書内容検索部１４によって検出された文書内のキーワードに合致した部分に対して、そのキーワードの抽出元となった表題もしくは見出しを関連付ける。
［Ｓ７］キーワード関連付け部１５が、キーワードの関連付けの終了した文書を文書蓄積部１１へ格納する。
［Ｓ８］階層構造関連付け部１２は、文書蓄積部１１に格納されている全ての文書の処理を行ったか否かを判断し、全ての文書に対する処理が終了していれば文書間の関連付け処理を終了し、そうでなければステップＳ１に進み未処理の文書に対する処理を行う。
【００４１】
このような処理を行うことにより、各文書の内容に含まれるキーワードから、そのキーワードを表題もしくは見出しとして含む文書の該当する表題若しくは見出しへリンクを張ることができる。
【００４２】
以下に、具体例を用いて処理内容の詳細を説明する。なお、以下の例では、表題、見出し等の論理構造を有する文書の一例として、国際規格であるＳＧＭＬ(Standard Generalized Markup Language; ISO8879) に基づく表現を用いているが、表題、見出し、見出しに対応する内容が表現できる体系であればＳＧＭＬでなくともよい。
【００４３】
まず、階層構造関連付け部１２が、文書蓄積部１１に蓄積された文書を１つ読み込む（ステップＳ１）。ここで、以下のような文書を読み込んだものとする。図４は、関連付けの対象となるキーワードを見出しに含む文書の第１の例を示す図である。この文書３１は、以下のような構造定義に従って作成されている。
【００４４】
文書中の各要素は、その開始と終了を示すタグによって囲まれている。ある要素Ａについて、開始タグは＜Ａ＞、終了タグは＜／Ａ＞で示される。文書は、文書の開始を示すタグ＜doc ＞と、文書の終了を示すタグ＜／doc ＞によって囲まれている。文書要素(doc) は表題を示す要素(title) と章を示す要素(sect1) の並びとを包含している。章要素(sect1) は見出しを示す要素(head)と段落を示す要素(para)の並びとを包含しているか、もしくは、見出し要素(head)と節を示す要素(sect2) の並びを包含している。節要素(sect2) は見出し要素(head)と段落要素(para)の並びを包含している。また、表題要素(title) 、見出し要素(head)、段落要素(para)は、その内容としてテキスト（文字列）を持つ。
【００４５】
なお、本実施の形態で例示する文書では、要素の名前としてdoc 、title 、sect1 、sect2 、head、paraを用いているが、文書中で表題、見出し、本文が特定できれば、名前はなんでもよい。また、章や節の構造はさらに深く入れ子になっていてもよい。例えば、節要素(sect2) がさらに下位の節要素(sect3) を含むようになっていてもよい。
【００４６】
このような文書３１を読み込んだ階層構造関連付け部１２は、読み込んだ文書の表題、見出し、段落等の文書構造を解析し、文書中の各要素に一意な識別子を付与する（ステップＳ２）。
【００４７】
図５は、各要素に一意な識別子を付与した文書を示す図である。この図では、各要素に属性名「id」の値として識別子を付与している。この文書３２では、文書要素(doc) に「d1」という識別子を付与している。文書要素の識別子が、文書３２自身の識別子となる。そのため、文書要素の識別子は、文書蓄積部１１に格納されている文書の中で一意に識別できるような記号が用いられる。
【００４８】
文書３２中の文書要素以外の要素に関しては、文書３２内において一意に識別できればよい。ここでは、表題要素(title) に「t1」という識別子を付与し、章要素(sect1) にそれぞれ「s1」、「s2」、「s3」という識別子を付与し、見出し要素(head)にそれぞれ「h1」、「h2」、「h3」という識別子を付与し、段落要素(para)にそれぞれ「p1」、「p2」、「p3」、「p4」という識別子を付与している。
【００４９】
次に、階層構造関連付け部１２は文書３２の表題、見出し、もしあれば下位の見出し、見出しに対応する段落の並びを関連付ける( ステップＳ３) 。本実施の形態では、文書の表題から見出しへの関連付けを、表題要素(title) の属性として見出しの識別子の並びを設定することによって表現する。また、見出しから下位の見出しへの関連付けもしくは見出しから対応する内容への関連付けは、見出し要素(head)の属性として下位の見出し要素の識別子もしくは内容となる段落要素(para)の識別子の並びを設定することによって表現する。
【００５０】
図６は、表題、見出し、内容を関連付けた文書の例を示す図である。この文書３３は、図５に示す文書３２の表題要素および見出し要素に、関連付ける見出し要素もしくは段落要素の識別子の並びを属性名「ref 」の値として付与したものである。この例では、識別子の並びを空白文字によって区切っている。例えば、表題要素(title) の下位には３つの見出し要素(head)があるため、表題要素(title) の属性名「ref 」の値は、「h1 h2 h3」となる。
【００５１】
次に、キーワード抽出部１３が階層構造関連付け部１２によって関連付けられた表題もしくは見出しからキーワードを抽出する（ステップＳ４）。キーワードの抽出方法としては、従来の形態素解析などの手法を利用すればよい。本実施の形態では、形態素解析の結果から名詞と判定された単語をキーワードとして利用する。また、ひらがな語など、キーワードになりにくいものは、予めストップワードとして登録しておき、キーワードの抽出対象から外す。キーワード抽出部１３は、要素と、その要素に含まれるキーワードとの対応関係を示すキーワード対応表を作成し、一時的に保持する。
【００５２】
図７は、キーワード対応表の例を示す図である。これは、図６に示した文書３３の表題要素(title) および見出し要素(head)と、そこから抽出したキーワードとの対応関係を示すキーワード対応表４１である。キーワード対応表４１には、「要素の種類」、「識別子」、および「キーワード」の項目が設けられている。「要素の種類」の項目には、キーワードの抽出を行った要素の種類が設定される。この例は、「表題」か「見出し」のいずれかである。「識別子」の項目には、キーワードの抽出を行った要素の識別子が設定される。「キーワード」の項目には、キーワードの抽出を行った要素に含まれていたキーワードの集合が設定される。
【００５３】
このように、文書中の表題要素および見出し要素のみに対して形態素解析処理を行うので、文書全体に対して形態素解析処理を行う必要はない。一般に文書の表題や見出しに含まれるテキストの量は、文書全体のテキスト量に比して非常に少ないので、形態素解析の処理コストを大幅に削減することができる。
【００５４】
次に、文書内容検索部１４は、キーワード抽出部１３により抽出されたキーワードを用いて、文書蓄積部１１に蓄積された他の文書の内容を検索する（ステップＳ５）。例えば、表題要素(title) から抽出された「SGML」というキーワードを用いて、文書蓄積部１１内の文書を検索を行った場合、以下のような文書が検出される。
【００５５】
図８は、関連付けの対象となるキーワードを本文中に含む文書の例を示す図である。この文書５１は、段落要素(para)の内容に含まれるテキスト「...SGML へ変換する。... 」の「SGML」が一致したことにより、検出される。なお、この文書５１は、図４に示した文書３１と同様の構造定義に従って作成された文書である。
【００５６】
図８のような文書５１が見つかったら、そのキーワード関連付け部１５はキーワードと一致する文書５１の内容と、そのキーワードを含む表題もしくは見出しを関連付ける（ステップＳ６）。具体的には、テキスト「...SGML へ変換する。... 」中の「SGML」を参照元要素としてタグ付けし、図６に示した文書３３の表題要素(title) の識別子を、参照元の要素の属性として設定する。
【００５７】
図９は、キーワードと表題との関連付けが行われた文書の例を示す図である。この文書５２では、キーワード「SGML」は関連付けを示す要素(link)の開始タグと終了タグによって囲まれ、link要素の属性「ref 」の値として文書「d1」の表題「t1」への関連付けが設定されている。ここで属性「ref 」の値として、文書要素の識別子「d1」と表題要素の識別子「t1」を「. 」によって接続しているのは、識別子「t1」が他の文書のある要素においてたまたま使われている場合に、関連付けの対象を一意に決定できなくなることを防ぐためである。
【００５８】
なお、本実施の形態では文書要素の識別子と表題要素もしくは見出し要素とを接続するために「. 」を用いているので、要素に識別子を付与する際には識別子自身に「. 」を含めないようにする。
【００５９】
また、本実施の形態では、文書要素(doc) の識別子が、文書蓄積部１１に蓄積されている文書を一意に識別できるように付与されているため、この文書要素を用いて文書を識別しているが、文書を識別するための識別子を文書全体に対して付与して、それを関連付けの識別子として用いてもよい。このような識別子としては、文書の実体がファイルである場合にはファイル名を用いたり、文書がＷＷＷ(World Wide Web)上で公開される場合にはＵＲＬ(Uniform Resource Locator)を用いたりすることができる。
【００６０】
ステップＳ４にて抽出された全てのキーワードに対して他の文書内容を検索し、ステップＳ６にてキーワードの関連付けが終了したら、関連付けされた文書は文書蓄積部１１に格納される（ステップＳ７）。このとき、関連付けの対象となった元の文書の内容は上書きされる。
【００６１】
そして、文書蓄積部１１に蓄積された全ての文書について、上記ステップＳ１〜ステップＳ７の処理が行われたかどうかを調べ（ステップＳ８）、まだ処理されていない文書があればステップＳ１へ戻って処理を継続し、全ての文書について処理が終了していれば、文書間の関連付けの処理を終了する。
【００６２】
以上の処理が行われることにより、図９に示した文書５２に対しても、階層構造の関連付けが行われる。
図１０は、図９の文書に対して階層構造の関連付けを行った結果を示す図である。この文書５３は、文書要素(doc) の識別子として「d2」が付与されている。
【００６３】
次に、本発明に基づく文書関連付け装置により、関連付けを利用して、文書中のあるキーワードから、そのキーワードに対する説明記述を参照する手順について説明する。
【００６４】
図１１は、関連付けの利用手順を示すフローチャートである。このフローチャートをステップ番号に沿って簡単に説明する。
［Ｓ１１］利用者が入力部２１を用いて文書の表示要求を入力すると、文書抽出部１６が該当する文書を文書蓄積部１１内から抽出する。抽出した文書の内容は、表示部２０の画面に表示される。
［Ｓ１２］利用者が入力部２１を用いてキーワードを選択する。
［Ｓ１３］見出し抽出部１７が、ステップＳ１２にて選択されたキーワードの関連付け情報すなわちlink要素の属性「ref 」の識別子を参照し、文書蓄積部１１から該当する識別子を持つ文書の表題もしくは見出しを抽出する。あるいは後述するステップＳ１４，Ｓ１５で見出し選択部１８によって選択された表題もしくは見出しの下位の見出しを、文書蓄積部１１から抽出する。そして、抽出した表題もしくは見出しを表示部２０に表示する。
［Ｓ１４］見出し選択部１８が、見出し抽出部１７によって抽出された見出しが複数か否かを判断し、複数であればステップＳ１５へ処理を進め、１つだけであればその表題もしくは見出しを選択してステップＳ１６へ処理を進める。
［Ｓ１５］見出し選択部１８が、入力部２１で利用者が入力した要求に応じて、見出し抽出部１７により表題もしくは見出しが複数抽出された場合にはそのうちの１つの表題もしくは見出しを選択する。
［Ｓ１６］見出し選択部１８は、選択された表題もしくは見出しに関して、下位の見出しが存在するか否かを判断する。この実施の形態では、ステップＳ１３にて抽出された表題要素(title) もしくは見出し要素(head)の属性「ref 」の値として設定されている識別子を持つ要素を特定し、その要素が見出し要素(title) であるかないかを判定する。下位の見出しが存在していればステップＳ１３に進み、存在していなければステップＳ１７に進む。
［Ｓ１７］内容抽出部１９が、ステップＳ１５にて選択された見出し要素に関連付けられた内容に対応する要素を抽出し、表示部２０の画面に表示する。
【００６５】
以下に、関連付けの利用に関する処理を具体例を用いて説明する。
まず利用者が図１０に示した文書５３の表示要求を入力部２１により指示したものとする。すると、文書５３の内容が表示部２０の画面に表示される。
【００６６】
図１２は、文書の内容を表示した際の表示画面の例を示す図である。この表示画面６１では、文書中のタグにより表題、見出し、段落、関連付けられたキーワードなどを識別し、それぞれに対して適切なレイアウトを定めて画面表示を行っている。例えば表題は大きめのフォントでセンタリングして表示し、見出しは大きめのフォントで番号を付与して表示し、他の文書の見出し等に関連付けられたキーワードは下線を付与して強調している。
【００６７】
次に、利用者が、表示部２０に表示された文書を参照し、関連付けの付与された「SGML」の表示箇所をマウスでクリックするなどの方法で選択したものとする（ステップＳ１２）。すると、見出し抽出部１７が、選択されたキーワード「SGML」の関連付け情報すなわちlink要素の属性「ref 」の識別子を参照し、文書蓄積部１１から該当する識別子「d1」を持つ文書３３内の該当する表題「t1」を抽出し、表示部２０に表示する（ステップＳ１３）。
【００６８】
図１３は、見出しを表示した際の表示画面の例を示す図である。前述の関連付けの処理によりキーワード「SGML」は関連付けを示すlink要素によってタグ付けされており、その属性「ref 」の値として「d1.t1 」が設定されているので、図６に示した文書３３の表題要素( 識別子は「t1」) が見出し抽出部１７により抽出され、表題要素の内容「SGMLによる電子出版」を含む表示画面６２が、表示部２０により表示される。
【００６９】
このとき、抽出された表題が複数か否かの判定が見出し抽出部１７によって行われるが（ステップＳ１４）、この例では抽出された表題もしくは見出しが１つだけである。そこで、見出し抽出部１７は、抽出された見出しに関連付けられた下位の見出しが存在するかどうかを判定する（ステップＳ１６）。この例では、識別子「t1」を持つ表題要素の属性「ref 」の値として、「h1 h2 h3」の３つの要素が関連付けられており、いずれも見出し要素である。従って、ステップＳ１３へ戻り見出しの抽出が行われる。
【００７０】
図１４は、下位の見出しを表示した際の表示画面の例を示す図である。これは、図１３に示した表示画面６２の例から、「SGMLによる電子出版」を内容に持つ表題要素に関連付けられている下位の見出しを表示部２０に表示したときの表示画面６３の例を示したものである。すなわち、図６に示した文書３３において、識別子「t1」を持つ表題要素の属性「ref 」の値として設定されている３つの見出し要素( 識別子はh1、h2、h3) の内容「はじめに」「電子出版の歴史」「関連ツール」を抽出し、表示部２０の画面に表示している。
【００７１】
ここで、再び見出し選択部１８が、抽出された見出しが複数であるか否かの判断を行う（ステップＳ１４）。ここでは、３つの見出しが抽出されているので、利用者は表示部２０に表示されている複数の表題もしくは見出しから入力部２１により１つを選択する（ステップＳ１５）。この例では、図１４に表示されている３つの見出しの内容のうち「関連ツール」をマウス等で選択したものとする。
【００７２】
すると、見出し選択部１８が、選択された見出し「関連ツール」に関連付けられた下位の見出しが存在するかどうかを判定する（ステップＳ１６）。図６に示した文書３３において、「関連ツール」を内容に持つ見出し要素( 識別子は「h3」) の属性「ref 」の値として設定されている識別子p3、p4、．．．の要素はいずれも見出しではない。したがって、内容抽出部１９が、内容の抽出を行う（ステップＳ１７）。
【００７３】
図１５は、内容を表示した際の表示画面の例を示す図である。これは、図１４に示した表示画面６３の例から、「関連ツール」を内容に持つ見出し要素に関連付けられている内容を表示部２０に表示したときの表示画面６４の例である。すなわち、図６に示した文書３３において、識別子「h3」を持つ見出し要素の属性「ref 」の値として設定されている段落要素（識別子p3、p4、．．．）の内容を抽出し、表示部２０に表示する。
【００７４】
このように、関連する内容の候補が複数存在する場合にも、見出しを表示して選択することにより必要最小限の関連付けられた内容を参照することができる。また、表示部２０に表示される表題もしくは見出しから、利用者が内容を参照する必要がないと判断した場合は、内容の参照を行う前に処理を中断することも可能である。したがって、利用者は内容の詳細を全て読むことなく必要な情報を効率良く見つけることが可能である。
【００７５】
次に、第２の実施の形態について説明する。第２の実施の形態は、ある文書内容中のキーワードに対して、他の文書の表題もしくは見出しが複数関連付けられている場合に、関連付けられた内容をさらに効率的に抽出できるようにした文書閲覧装置である。なお、第２の実施の形態の構成要素は、図２に示した第１の実施の形態の構成要素と同じであるため、図２に示した構成を用いて第２の実施の形態を説明する。また、第２の実施の形態における文書間の関連付け処理は、第１の実施の形態と同様であるため説明を省略する。
【００７６】
そこで、第２の実施の形態による関連付け参照処理について、以下に説明する。
図１６は、第２の実施の形態における関連付け参照の処理の流れを示すフローチャートである。以下の処理をステップ番号に沿って説明する。
［Ｓ２１］利用者が文書蓄積部１１に蓄積された文書群から抽出する文書を入力部２１により指示すると、文書抽出部１６は、指示された文書を抽出し、表示部２０に表示する。
［Ｓ２２］利用者が表示部２０に表示された文書を参照し、入力部２１より関連付けの付与されたキーワードの表示箇所をマウスでクリックするなどの方法で選択する。
［Ｓ２３］見出し抽出部１７は、ステップＳ２２にて選択されたキーワードの関連付け情報すなわちlink要素の属性「ref 」の識別子を参照し、文書蓄積部１１から該当する識別子を持つ文書の表題もしくは見出しを抽出する。
［Ｓ２４］見出し抽出部１７は、ステップＳ２３にて抽出された表題もしくは見出しが１つであるか複数であるかを判定し、抽出された表題もしくは見出しが複数あれば、ステップＳ２５へ進み、１つしかなければステップＳ２９へ進む。
［Ｓ２５］見出し抽出部１７は、ステップＳ２４にて抽出された表題もしくは見出しが複数あると判定されると、それらの表題もしくは見出しを文書ごとにグループ化する。
［Ｓ２６］見出し抽出部１７は、ステップＳ２５にてまとめられた文書ごとの関連付けのグループを、同一文書内への関連付けの数、および関連付けられる表題もしくは見出しの階層の深さから算出される重要度に応じて並べ替える。
［Ｓ２７］見出し抽出部１７は、ステップＳ２５にて文書ごとにグループ化された関連付けを、関連付けられる表題もしくは見出しの階層の深さから算出される重要度に応じて各グループ内で並び替える。
［Ｓ２８］利用者は表示部２０に表示されている複数の表題もしくは見出しから入力部２１により１つを選択する。
［Ｓ２９］見出し抽出部１７は、ステップＳ２３にて抽出された表題もしくは見出しが１つである場合またはステップＳ２８にて見出しが選択された場合に、その表題もしくは見出しに関連付けられた下位の見出しが存在するかどうかを判定する。もし下位の見出しが存在すればステップＳ２３に戻って下位の見出しを抽出する。下位の見出しが存在しなければステップＳ３０へ進む。
［Ｓ３０］内容抽出部１９が、ステップＳ２８にて選択された見出し要素に関連付けられた内容に対応する要素を抽出し、表示部２０の画面に表示する。
【００７７】
このようにして、ある文書内容中のキーワードに対して、他の文書の表題もしくは見出しが複数関連付けられている場合に、関連付けられた内容を効率的に抽出することができる。以下にこの処理の詳細を、具体例を用いて説明する。
【００７８】
本実施の形態では、第１の実施の形態で示した文書以外に、関連付けの対象となるキーワード「SGML」を表題に含む次のような文書が、文書蓄積部１１に格納されているものとする。
【００７９】
図１７は、関連付けの対象となるキーワードを表題に含む文書の第２の例を示す図である。この文書７１には、文書要素(doc) に「d3」という識別子が付与されている。また、「id="t1" 」の表題要素(title) 、「id="h2" 」の見出し要素(head)、および「id="h3" 」の見出し要素(head)の内容に「SGML」のキーワードが含まれている。
【００８０】
図１８は、関連付けの対象となるキーワードを表題に含む文書の第３の例を示す図である。この文書８１には、文書要素(doc) に「d4」という識別子が付与されている。また、「id="h21"」の見出し要素(head)と「id="h22"」の見出し要素(head)との内容に「SGML」のキーワードが含まれている。
【００８１】
図４に示した文書３１に加え、図１７，図１８に示した文書７１，８１に対して関連付け処理が行われると、図８に示した文書５１は以下のように、他の文書の表題もしくは見出しに関連付けられる。
【００８２】
図１９は、キーワードと表題もしくは見出しとの関連付けを行った文書の例を示す図である。この図に示すように、文書５４は、他の複数の文書の表題もしくは見出しに関連付けられている。すなわち、図１９において、キーワード「SGML」に対してそれをタグ付けするlink要素の属性によって、文書「d1」の表題「t1」( 内容は「SGMLによる電子出版」) 、文書「d3」の表題「t1」( 内容は「SGMLへの招待」) 、見出し「h2」( 内容は「SGMLとHTML」) および見出し「h3」( 内容は「SGMLとXML 」) 、文書「d4」の見出し「h21 」( 内容は「SGML文書の検索」) および見出し「h22 」( 内容は「SGMLデータベースシステム」) の合計６個の表題もしくは見出しが関連付けられている。
【００８３】
以下、このように関連付けられている文書群を対象として、図１６に示したフローチャートに沿って関連付け参照の処理の流れを説明する。
まず利用者が文書蓄積部１１に蓄積された文書群から抽出する文書を入力部２１により指示すると、文書抽出部１６は、指示された文書を抽出し、表示部２０に表示する（ステップＳ２１）。ここで表示部２０に表示される文書は図１９に示した文書５４であるものとする。図１９に示す文書５４を表示部２０に表示した場合、link要素の属性値は画面上に表示されないので、第１の実施の形態の場合と同じく図１２に示すように表示画面６１が表示される。
【００８４】
次に、利用者が表示部２０に表示された文書５４を参照し、入力部２１より関連付けの付与されたキーワード「SGML」の表示箇所をマウスでクリックするなどの方法で選択する（ステップＳ２２）。見出し抽出部１７は、ステップＳ２２にて選択されたキーワードの関連付け情報すなわちlink要素の属性「ref 」の識別子を参照し、文書蓄積部１１から該当する識別子を持つ文書の表題もしくは見出しを抽出する（ステップＳ２３）。
【００８５】
次に、見出し抽出部１７は、ステップＳ２３にて抽出された表題もしくは見出しが１つであるか複数であるかを判定する（ステップＳ２４）。図１９に示した例では、合計６個の表題もしくは見出しが抽出されるので、ステップＳ２５へ進む。
【００８６】
次に、見出し抽出部１７は、ステップＳ２４にて抽出された表題もしくは見出しが複数あると判定されると、それらの表題もしくは見出しを文書ごとにグループ化する（ステップＳ２５）。図１９の文書５４では、文書「d1」の表題「t1」を１つのグループに、文書「d2」の表題「t1」、見出し「h2」および見出し「h3」を１つのグループに、文書「d3」の見出し「h21 」および見出し「h22 」を１つのグループにまとめる。
【００８７】
このように、抽出された表題もしくは見出しを文書ごとにグループ化することで、同一文書内の関連する記述を連続して参照することができるようになる。
次に、見出し抽出部１７は、ステップＳ２５にてまとめられた文書ごとの関連付けのグループを、同一文書内への関連付けの数、および関連付けられる表題もしくは見出しの階層の深さから算出される重要度に応じて並べ替える（ステップＳ２６）。本実施の形態では文書ごとの重要度を次の式によって算出する。
【００８８】
【数１】

【００８９】
式（１）において、ｎは、その文書で関連付けられている表題もしくは見出しに対して１から順に割り振られた数字の最大値を表す。ｄｉは、数字（ｉ）が割り振られた表題もしくは見出しの階層構造における深さを表す( 表題の深さを０とする) 。すなわち、表題についてはｄｉ＝０、第１レベルの見出しについてはｄｉ＝１、第２レベルの見出しについてはｄｉ＝２などとなる。式（１）に従って各文書の重要度を計算すると、図６に示した文書３３は表題「t1」が１つだけ関連付けられているので重要度＝２^-0＝１、図１７に示した文書７１は表題「t1」、見出し「h2」および見出し「h3」の３つが関連付けられているので重要度＝２^-0＋２^-1＋２^-1＝２、図１８に示した文書８１は見出し「h21 」および見出し「h22 」の２つが関連付けられているので重要度＝２^-2＋２^-2＝０．５となる。したがって、文書ごとの重要度にしたがって文書「d2」、文書「d1」、文書「d3」の順に関連付けのグループを並べ替える。
【００９０】
なお、文書ごとの重要度の算出方法は、式（１）に示したものに限定されるわけではない。関連付けられる表題もしくは見出しが多いほうが重要度がより高くなるように、また、関連付けられる表題もしくは見出しの階層の深さが浅いほうが重要度がより高くなるように重要度を決めればよい。このような重要度の決定方法は、同一文書内で関連付けられる表題もしくは見出しが多いほうが、そのキーワードが文書全体の主題に関係する可能性が高いと考えられ、また、関連付けられる表題もしくは見出しの階層の深さが浅いほうが、そのキーワードについてより包括的に説明されている可能性が高いと考えられるので、有効な方法である。
【００９１】
次に、見出し抽出部１７は、ステップＳ２５にて文書ごとにグループ化された関連付けを、関連付けられる表題もしくは見出しの階層の深さから算出される重要度に応じて各グループ内で並び替える（ステップＳ２７）。本実施の形態では、階層の深さが浅いほうが重要度が高いものとする。また、階層の深さが同一である場合には、文書中で先に出現するほうが重要度が高いものとする。あるいは、文書中での出現順序を優先した重要度を用いてもよい。
【００９２】
以上の処理が行われた後、抽出された表題もしくは見出しが表示部２０に表示される。
図２０は、複数の見出しを表示する表示画面の例を示す図である。これは、図１２に示した表示画面６１中でキーワード「SGML」を選択したときに表示される表示画面１０１の例を示したものである。図２０に表示されている表題もしくは見出しは、上記処理により、文書ごとにグループ化され、重要度順に並べ替えられている。
【００９３】
次に、利用者は表示部２０に表示されている複数の表題もしくは見出しから入力部２１により１つを選択する（ステップＳ２８）。すると、見出し抽出部１７は、ステップＳ２３にて抽出された表題もしくは見出しが１つである場合またはステップＳ２８にて見出しが選択された場合に、その表題もしくは見出しに関連付けられた下位の見出しが存在するかどうかを判定する（ステップＳ２９）。
【００９４】
このように、関連付けられる表題もしくは見出しが同一文書内に複数存在する場合や、関連付けられる表題もしくは見出しを持つ文書が複数存在する場合に、重要なものから優先的に参照できるので、たとえ１つのキーワードに多量の文書の表題や見出しが関連付けられている場合でも、効率的に関連付けられた内容を参照することができる。
【００９５】
なお、上記の処理機能は、コンピュータによって実現することができる。その場合、文書関連付け装置及び文書閲覧装置が有すべき機能の処理内容は、コンピュータで読み取り可能な記録媒体に記録されたプログラムに記述しておく。そして、このプログラムをコンピュータで実行することにより、上記処理がコンピュータで実現される。コンピュータで読み取り可能な記録媒体としては、磁気記録装置や半導体メモリ等がある。市場に流通させる場合には、ＣＤ−ＲＯＭ(Compact Disk Read Only Memory) やフロッピーディスク等の可搬型記録媒体にプログラムを格納して流通させたり、ネットワークを介して接続されたコンピュータの記憶装置に格納しておき、ネットワークを通じて他のコンピュータに転送することもできる。コンピュータで実行する際には、コンピュータ内のハードディスク装置等にプログラムを格納しておき、メインメモリにロードして実行する。
【００９６】
【発明の効果】
以上説明したように、本発明の文書関連付け装置では、文書中のキーワードと被関連付け対象文書の処理対象要素とを関連付けるとともに、被関連付け対象文書中の要素の上位構造と下位構造とを関連付けるようにしたため、文書中のキーワードから他の文書中の要素及びその要素の下位構造を順次辿ることができ、必要最小限の関連付けられた内容を参照することができる。しかも、特定の要素からのみキーワードの抽出を行うため、キーワード抽出に伴う複雑な処理を限られた範囲に対して実行することができ、関連付け処理を高速に行うことが可能となる。
【００９７】
また、本発明の文書閲覧装置では、文書中のキーワードと被関連付け対象文書の処理対象要素とを関連付けるとともに、被関連付け対象文書中の要素の上位構造と下位構造とを関連付けておき、文書中のキーワードが指定されると、そのキーワードの関連要素の内容とその下位構造の内容を抽出するようにしたため、キーワードを指定したユーザは、そのキーワードに関する必要最小限の関連要素の内容を参照することができる。
【００９８】
また、本発明の文書関連付けプログラムを記録したコンピュータ読み取り可能な記録媒体では、記録された文書関連付けプログラムをコンピュータに実行させることにより、文書中のキーワードと被関連付け対象文書の処理対象要素とを関連付けるとともに、被関連付け対象文書中の要素の上位構造と下位構造とを関連付ける処理を、コンピュータに高速に行わせることが可能となる。すなわち、文書中のキーワードを他の文書の最小限の関連記述に関連付ける処理を、コンピュータに高速に行わせることができる。
【００９９】
また、本発明の文書閲覧プログラムを記録したコンピュータ読み取り可能な記録媒体では、記録された文書閲覧プログラムをコンピュータに実行させることにより、文書中のキーワードと被関連付け対象文書の処理対象要素とを関連付けるとともに、被関連付け対象文書中の要素の上位構造と下位構造とを関連付けておき、文書中のキーワードが指定されると、そのキーワードの関連要素の内容とその下位構造の内容を抽出するような処理をコンピュータに行わせることが可能となる。すなわち、コンピュータに対してキーワードを指定したユーザは、そのキーワードに関する必要最小限の関連要素の内容を参照することができる。
【図面の簡単な説明】
【図１】本発明の原理構成図である。
【図２】本発明を適用した文書閲覧装置の構成を示す図である。
【図３】文書間の関連付けを行う手順を示すフローチャートである。
【図４】関連付けの対象となるキーワードを見出しに含む文書の第１の例を示す図である。
【図５】各要素に一意な識別子を付与した文書を示す図である。
【図６】表題、見出し、内容を関連付けた文書の例を示す図である。
【図７】キーワード対応表の例を示す図である。
【図８】関連付けの対象となるキーワードを本文中に含む文書の例を示す図である。
【図９】キーワードと表題との関連付けが行われた文書の例を示す図である。
【図１０】図９の文書に対して階層構造の関連付けを行った結果を示す図である。
【図１１】関連付けの利用手順を示すフローチャートである。
【図１２】文書の内容を表示した際の表示画面の例を示す図である。
【図１３】見出しを表示した際の表示画面の例を示す図である。
【図１４】下位の見出しを表示した際の表示画面の例を示す図である。
【図１５】内容を表示した際の表示画面の例を示す図である。
【図１６】第２の実施の形態における関連付け参照の処理の流れを示すフローチャートである。
【図１７】関連付けの対象となるキーワードを表題に含む文書の第２の例を示す図である。
【図１８】関連付けの対象となるキーワードを表題に含む文書の第３の例を示す図である。
【図１９】キーワードと表題もしくは見出しとの関連付けを行った文書の例を示す図である。
【図２０】複数の見出しを表示する表示画面の例を示す図である。
【符号の説明】
１文書蓄積手段
２階層構造関連付け手段
２ａ被関連付け対象文書
３キーワード抽出手段
３ａキーワード対応表
４文書内容検索手段
４ａ文書
５キーワード関連付け手段
５ａ文書[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a document association apparatus, a document browsing apparatus, a computer-readable recording medium that records a document association program, and a computer-readable recording medium that records a document browsing program, and particularly relates to a certain keyword in the document and the keyword. A document correlating device for associating the contents of another document to be viewed, a document browsing device for browsing a document in a document group in which a certain keyword in the document is associated with the content of another document related to the keyword, and the document correlating device. The present invention relates to a computer-readable recording medium that records a document association program to be realized on a computer, and a computer-readable recording medium that records a document browsing program to realize the document browsing apparatus on a computer.
[0002]
[Prior art]
A so-called hypertext system capable of associating a group of electronic documents scattered on a network by a link has been widely used with the spread of the World Wide Web (WWW). In the hypertext system, a hyperlink to the content of another document having more detailed information is given to a keyword in a document. As a result, when the user browses the document and wants to know more about the description to which the hyperlink is added, the related information can be known by tracing the hyperlink.
[0003]
However, in general, in order to create such a hypertext document, it is necessary for the creator of the document to manually create a hyperlink by associating a keyword with another document. It takes time. Therefore, in order to solve this problem, keywords in documents are automatically extracted, and by searching for documents containing the same or synonymous keywords from other documents, associations between documents, that is, hyperlinks are automatically created. It is considered to be.
[0004]
At this time, simply associating documents with the same or synonymous keywords as clues, there is no guarantee that a more detailed explanation can be obtained by following the hyperlink. This is because it is often the case that the same or synonymous keyword is referred to in one word in any of the associated documents, and there is no description that explains the keyword.
[0005]
As one method for solving this problem, there is a “system for automatically creating a chain between document texts” disclosed in Japanese Patent Laid-Open No. 5-20362. In the method disclosed in this publication, first, an important keyword is extracted from the document text, and the importance of the extracted keyword in the document is calculated. In addition, a unidirectional association is automatically generated between documents sharing the same keyword from a document having a lower keyword importance to a document having a higher keyword importance. In this method, documents are associated with the same keyword as a clue, but it is assumed that a document with a higher importance in a document with the same keyword is described in more detail with respect to that keyword than a document with a lower importance. is doing. Thereby, a hyperlink to another document in which a more detailed description is described is automatically generated from a certain keyword in the document. This method described below is the first prior art.
[0006]
Another method for solving the above problem is an “hypertext automatic generation device” disclosed in Japanese Patent Laid-Open No. 7-325827. In this publication, when documents having the same or synonymous keywords are associated with each other, a hyperlink is generated from the keyword of one document to the chapter or section heading of the other document having the same or synonymous keyword. The method is shown. In this method, when a certain keyword is included in the headline, it is assumed that there is a high possibility that the keyword is described in detail in the contents below the headline. As a result, hyperlinks to more detailed explanations are automatically generated from certain keywords in the document. This method described below is the second prior art.
[0007]
[Problems to be solved by the invention]
However, each of the conventional techniques has the following problems.
In the first prior art, the target of association is a keyword in a document and the entire other document. For this reason, when there is a large amount of description in another document associated with the document, it is difficult to find a related description in the document even if a detailed description for the associated keyword is described in the document.
[0008]
In the second prior art, when there are a plurality of other documents containing the same or synonymous keywords with respect to the keywords in a certain document, the candidates are narrowed down to any one according to a predetermined strategy. It has become. Therefore, there is a possibility that information that the user actually wants to know is leaked from the association target. As for this problem, for example, when there are a plurality of candidates to be associated, it is possible to prevent leakage by associating all the candidates. However, in this case, it takes time and effort for the user to sequentially browse a plurality of associated descriptions and search for necessary information.
[0009]
Furthermore, in any of the above two conventional techniques, it is necessary to perform morphological analysis on the entire document in order to automatically extract keywords to be associated. In order to perform morphological analysis with high accuracy, it is necessary to perform fairly complicated processing. For this reason, there is a problem that it takes a very long time to automatically create hyperlinks between a large number of documents using the conventional technique.
[0010]
The present invention has been made in view of these points, and an object of the present invention is to provide a document association apparatus capable of performing a process of associating a keyword in a document with a minimum related description in another document at high speed. And
[0011]
A second object of the present invention is to provide a document browsing apparatus for browsing a document in a document group in which a keyword in a document is associated with a minimum related description in another document.
[0012]
A third object of the present invention is a computer-readable record recording a document association program that can cause a computer to perform a process of associating a keyword in a document with a minimum related description in another document at high speed. To provide a medium.
[0013]
A fourth object of the present invention is to record a document browsing program for browsing a document in a document group in which a keyword in a document is associated with a minimum related description in another document using a computer. It is to provide a computer-readable recording medium.
[0014]
[Means for Solving the Problems]
In the present invention, in order to solve the above problem, in a document association apparatus for performing association between documents,A document accumulating unit configured to store a plurality of documents each having an attribute set and a document identifier set; and a plurality of documents stored in the document accumulating unit are sequentially selected as related documents; An identifier setting unit that sets an element identifier for uniquely identifying the element with respect to a tag of the element in the selected document to be associated, and the associated item in which the element identifier is set by the identifier setting unit A keyword extracting unit that extracts a keyword from contents included in a processing target element having a specific attribute in the target document, and a keyword extraction source that includes the keyword for each keyword extracted by the keyword extracting unit. Document content search means for searching for a document other than the related target document from the document storage means; For each document containing a keyword searched by the search means, a tag associated with the keyword in the document and indicating a plurality of attribute values indicating elements in other documents related to the keyword can be registered in the tag. And a keyword association unit that sets a plurality of attribute values including a set of a document identifier of the associated target document as an extraction source and an element identifier of the processing target element including the keyword in the associated target document. Document association apparatus characterized by the above Is provided.
[0015]
According to such a document associating apparatus, the document stored in the document accumulating unit is set as a related document by the hierarchical structure associating unit,An element identifier for uniquely identifying the element is set for the tag of the element in the associated document. . Further, the keyword extraction means extracts a keyword from the content included in the processing target element having a specific attribute in the associated target document. Thendocuments A document including the keyword extracted by the keyword extraction unit is searched from the document storage unit by the content search unit. Then, by the keyword association means, the document content search meansSearch Keywords in the selected documentA plurality of attribute values consisting of a set of a document identifier of an associated target document from which a keyword is extracted and an element identifier of a processing target element including the keyword in the associated target document are set in the plurality of tags that can be registered. .
[0016]
In order to solve the above problems,In a document browsing apparatus for browsing the contents of a structured document, a document storage unit configured to store a plurality of documents each having an attribute set and a document identifier set, and stored in the document storage unit An identifier setting unit that sequentially selects a plurality of documents as associated documents, and sets an element identifier for uniquely identifying the element in a tag of the element in the selected associated document; and the identifier For each keyword extracted by the keyword extracting means, a keyword extracting means for extracting a keyword from the content included in the processing target element having a specific attribute in the associated target document set with the element identifier; , Storing the document including the keyword other than the related target document from which the keyword is extracted. Document content search means for searching from within the stage, and for each document including the keyword searched by the document content search means, elements in other documents associated with the keyword and associated with the keyword From a set of a document identifier of the associated target document that is the source of the keyword and an element identifier of the processing target element that includes the keyword in the associated target document in a tag that can be registered with a plurality of attribute values A keyword association unit that sets a plurality of attribute values, a document display unit that extracts and displays a document specified in the document browsing request from the document storage unit in response to the document browsing request; In the document displayed by the display unit, the keyword having a tag in which one or more attribute values indicating elements in another document are registered is selected. The element extraction means for extracting the processing target element in the related target document related to the keyword based on the attribute value set in the tag of the keyword, and the processing extracted by the element extraction means A document browsing device comprising: content display means for displaying the content of the target element Is provided.
[0017]
According to such a document browsing apparatus, the document stored in the document storage unit is set as a related target document by the hierarchical structure association unit,An element identifier for uniquely identifying the element is set for the tag of the element in the associated document. . Further, the keyword extraction means extracts a keyword from the content included in the processing target element having a specific attribute in the associated target document. Thendocuments A document including the keyword extracted by the keyword extraction unit is searched from the document storage unit by the content search unit. Then, by the keyword association means, the document content search meansSearch Keywords in the selected documentA plurality of attribute values consisting of a set of a document identifier of an associated target document from which a keyword is extracted and an element identifier of a processing target element including the keyword in the associated target document are set in the plurality of tags that can be registered. . Furthermore, when a document browsing request is entered, the documentdisplay The means extracts a document corresponding to the document browsing request from the document storage means. This documentdisplay In the document extracted by the means, when the keyword associated by the keyword association means is selected, the element extraction means associates the keyword with the keyword.One or more Related elements in related documentsBut Extracted.
[0018]
In addition, the contentdisplay Of the related elements extracted by the element extracting means.Yong ExtractedDisplayed The
In order to solve the above problems,Document storage means for storing a plurality of documents each having an attribute set and a document identifier in a computer-readable recording medium recording a document association program for performing association between documents, A plurality of documents stored in the document storage means are sequentially selected as related documents, and element identifiers for uniquely identifying the elements are set for the tags of the elements in the selected related documents. An identifier setting unit that extracts a keyword from contents included in a processing target element having a specific attribute in the associated target document in which the element identifier is set by the identifier setting unit, and the keyword extraction unit. For each extracted keyword, it is the source of the keyword, including the keyword A document content search unit that searches a document other than the related target document from the document storage unit, and a document that includes a keyword searched by the document content search unit is associated with the keyword in the document, To a tag that can register a plurality of attribute values indicating elements in other documents related to the keyword, the document identifier of the associated document to be extracted from the keyword and the keyword in the associated document A computer-readable recording medium having a document association program recorded thereon, wherein the computer functions as a keyword association unit that sets a plurality of attribute values including pairs of element identifiers of the processing target elements. Is provided.
[0019]
If the computer executes the document association program recorded on the recording medium, the function of the document association apparatus according to the present invention is constructed on the computer.
[0020]
In order to solve the above problems,In a computer-readable recording medium recording a document browsing program for browsing the contents of a structured document, the computer is composed of a plurality of elements having attributes set, and stores a plurality of documents having document identifiers set. Document storage means, sequentially selecting a plurality of documents stored in the document storage means as related documents, and uniquely identifying the elements with respect to the tags of the elements in the selected related documents Identifier setting means for setting an element identifier, keyword extraction means for extracting a keyword from contents included in a processing target element having a specific attribute in the associated target document in which the element identifier is set by the identifier setting means, For each keyword extracted by the keyword extraction means, the keyword including the keyword For each document including a keyword searched by the document content search unit, a document content search unit that searches the document storage unit for a document other than the associated target document from which the document is extracted. A tag associated with a keyword and capable of registering a plurality of attribute values indicating elements in other documents related to the keyword, a document identifier of the related target document from which the keyword is extracted, and the related target Keyword associating means for setting a plurality of attribute values consisting of element identifiers of the processing target elements including the keyword in the document, and in response to the document browsing request, the document specified in the document browsing request is stored as the document storage means. The document display means to be extracted and displayed, and the elements in other documents in the document displayed by the document display means are indicated by operation input When the keyword having a tag in which two or more attribute values are registered is selected, the processing target element in the related target document related to the keyword is selected based on the attribute value set in the keyword tag. A computer-readable recording medium having a document browsing program recorded thereon, which functions as element extracting means for extracting, and content display means for displaying the contents of the processing target element extracted by the element extracting means Is provided.
[0021]
If the computer executes the document browsing program recorded on the recording medium, the function of the document browsing apparatus according to the present invention is constructed on the computer.
[0022]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
FIG. 1 is a principle configuration diagram of the present invention. The document association apparatus of the present invention includes the following elements.
[0023]
Thedocument storage unit 1 stores a document group having a hierarchical logical structure. As a structured document, there is a document created in accordance with the SGML rules.
The hierarchicalstructure associating unit 2 reads the associatedtarget document 2a from thedocument storage unit 1, and associates the upper structure and the lower structure of each element constituting the read associatedtarget document 2a. For example, an identifier is given to each element. Each element is provided with information on an identifier of an element that is a substructure of the element. Therelated target document 2 a that has been associated with each other is returned to thedocument storage unit 1.
[0024]
The keyword extraction means 3 extracts a keyword from the content included in the processing target element having a specific attribute in therelated target document 2a. For example, an element having an attribute as a title and an element having an attribute as a heading are set as process target elements. Then, thekeyword extracting means 3 internally generates a keyword correspondence table 3a in which the identifier of the element being extracted is associated with the set of keywords extracted from the element. Then, the keyword correspondence table 3 a related to the associateddocument 2 a is passed to the document content search means 4.
[0025]
The document content search means 4 searches the contents of other documents stored in the document storage means 1 based on the keywords extracted by the keyword extraction means 3. The founddocument 4 a is transferred to the keyword association means 5.
[0026]
Thekeyword association unit 5 associates the keyword in the content of thedocument 4a detected by the documentcontent search unit 4 with the processing target element of therelated target document 2a from which the keyword is extracted. Thedocument 5a in which theassociation target document 2a is associated with a specific element is stored in thedocument storage unit 1.
[0027]
According to such a document associating apparatus, the associatingtarget document 2 a read into the hierarchicalstructure associating means 2 is associated with the upper structure and the lower structure of each element and returned to thedocument accumulating means 1. At this time, keywords are extracted from the content of each element by the keyword extraction means 3. Then, the document content search means 4 searches for documents in the document storage means 1 based on the extracted keywords. The detecteddocument 4a is transferred to thekeyword associating means 5, and the keyword in the content of thedocument 4a is associated with the processing target element of therelated target document 2a from which the keyword is extracted. Then, thedocument 5a associated with the processing target element is returned to thedocument storage unit 1.
[0028]
If such a process is executed for all the documents stored in thedocument storage unit 1 as theassociation target document 2a, a keyword in one document is associated with a specific element (title or heading) in another document. And associated with the substructure from the element. Therefore, when browsing a document in thedocument storage unit 1, a minimum necessary associated content in another document can be referred to from a keyword in the document.
[0029]
Moreover, since the keyword extraction processing is performed only for specific elements such as titles or headings in the document at the time of association, it is not necessary to perform complicated processing necessary for keyword extraction such as morphological analysis on the entire document. As a result, the association processing efficiency is improved.
[0030]
Next, a document browsing apparatus capable of associating documents with the document association apparatus of the present invention and browsing these documents will be described as a first embodiment.
[0031]
FIG. 2 is a diagram showing a configuration of a document browsing apparatus to which the present invention is applied. This document browsing apparatus includes adocument storage unit 11, a hierarchicalstructure association unit 12, akeyword extraction unit 13, a documentcontent search unit 14, akeyword association unit 15, adocument extraction unit 16, aheadline extraction unit 17, aheadline selection unit 18, and content extraction. Theunit 19 includes adisplay unit 20 and aninput unit 21.
[0032]
Thedocument storage unit 11 stores a document group having a logical structure such as a title, a chapter heading, a section heading, and a paragraph.
The hierarchicalstructure associating unit 12 reads the document accumulated in thedocument accumulating unit 11 and associates titles and heading hierarchies (chapter headings, section headings, etc.) and contents corresponding to the headings (for example, paragraphs in a certain section).
[0033]
Thekeyword extracting unit 13 extracts keywords from the title and heading hierarchies associated by the hierarchicalstructure associating unit 12.
Using the keywords extracted by thekeyword extraction unit 13, the documentcontent search unit 14 searches the document group stored in thedocument storage unit 11 for documents having the given keyword as content.
[0034]
Thekeyword association unit 15 associates the keyword in the document searched by the documentcontent search unit 14 with the title and heading hierarchy from which the keyword is extracted.
Thedocument extraction unit 16 extracts a document from the document group stored in thedocument storage unit 11 according to a request input by the user using theinput unit 21 and displays the document on thedisplay unit 20.
[0035]
In the document extracted by thedocument extraction unit 16 and displayed on thedisplay unit 20, theheadline extraction unit 17 is associated with the specified keyword when the user specifies a keyword using theinput unit 21. The title or heading of the document is extracted from thedocument storage unit 11 and displayed on thedisplay unit 20. Further, a heading lower than the extracted title or heading is extracted from thedocument storage unit 11 and displayed on thedisplay unit 20.
[0036]
Theheadline selection unit 18 selects one of the titles or headings when theheadline extraction unit 17 extracts a plurality of titles or headings in response to a request input by the user through theinput unit 21, and selects the title or heading. If there are a plurality of subordinate headings, one of them is selected.
[0037]
Thecontent extraction unit 19 extracts the content from thedocument storage unit 11 when the title extracted by theheadline extraction unit 17, the headline, or the subordinate headings extracted sequentially are associated with the content corresponding to the heading. And displayed on thedisplay unit 20.
[0038]
Thedisplay unit 20 displays the document extracted by thedocument extraction unit 16, the title or heading of another document extracted by theheadline extraction unit 17, and the content of the other document extracted by thecontent extraction unit 19 on the screen. indicate.
[0039]
Theinput unit 21 specifies a document to be extracted by thedocument extraction unit 16, selects a keyword in the document extracted by thedocument extraction unit 16, and selects when there are a plurality of titles or headings extracted by theheadline extraction unit 17. Instructions etc. are performed.
[0040]
Next, a procedure for associating documents with the document group stored in thedocument storage unit 11 by the document browsing apparatus having such a configuration will be described.
FIG. 3 is a flowchart showing a procedure for associating documents. The following processing will be described along with step numbers.
[S1] The hierarchicalstructure associating unit 12 reads one unprocessed document from thedocument storage unit 11.
[S2] The hierarchicalstructure associating unit 12 analyzes the structure of the read document.
[S3] The hierarchicalstructure associating unit 12 associates titles, headings, and contents.
[S4] Thekeyword extraction unit 13 extracts keywords from the title and headline contents.
[S5] The documentcontent search unit 14 searches thedocument storage unit 11 for a document including the keyword extracted by thekeyword extraction unit 13.
[S6] Thekeyword associating unit 15 associates the title or headline from which the keyword is extracted with a portion that matches the keyword in the document detected by the document content searching unit.
[S7] Thekeyword association unit 15 stores the document for which the keyword association has been completed in thedocument storage unit 11.
[S8] The hierarchicalstructure associating unit 12 determines whether all the documents stored in thedocument storage unit 11 have been processed. If processing for all the documents has been completed, the hierarchicalstructure associating unit 12 If not, the process proceeds to step S1 to process an unprocessed document.
[0041]
By performing such processing, it is possible to link from the keyword included in the contents of each document to the corresponding title or heading of the document including the keyword as the title or heading.
[0042]
Details of processing contents will be described below using specific examples. In the following example, an expression based on the international standard SGML (Standard Generalized Markup Language: ISO8879) is used as an example of a document having a logical structure such as a title and a heading. If it is a system that can express the contents to be processed, it is not necessary to be SGML.
[0043]
First, the hierarchicalstructure associating unit 12 reads one document stored in the document storage unit 11 (step S1). Here, it is assumed that the following document is read. FIG. 4 is a diagram illustrating a first example of a document including a keyword to be associated in a heading. Thisdocument 31 is created in accordance with the following structure definition.
[0044]
Each element in the document is surrounded by a tag indicating its start and end. For a certain element A, the start tag is indicated by <A> and the end tag is indicated by </A>. The document is surrounded by a tag <doc> indicating the start of the document and a tag </ doc> indicating the end of the document. The document element (doc) includes an element (title) indicating a title and a sequence of elements (sect1) indicating chapters. The chapter element (sect1) contains either a heading element (head) and a paragraph element (para) sequence, or a heading element (head) and a section element (sect2) sequence. ing. The section element (sect2) contains a sequence of head elements (head) and paragraph elements (para). The title element (title), heading element (head), and paragraph element (para) have text (character string) as their contents.
[0045]
In the document exemplified in the present embodiment, doc, title, sect1, sect2, head, and para are used as element names. However, any name may be used as long as the title, heading, and body can be specified in the document. Also, the structure of chapters and sections may be more deeply nested. For example, the node element (sect2) may include a lower-level node element (sect3).
[0046]
The hierarchicalstructure associating unit 12 that has read thedocument 31 analyzes the document structure such as the title, heading, and paragraph of the read document, and assigns a unique identifier to each element in the document (step S2).
[0047]
FIG. 5 is a diagram showing a document in which a unique identifier is assigned to each element. In this figure, an identifier is assigned to each element as the value of the attribute name “id”. In thisdocument 32, an identifier “d1” is assigned to the document element (doc). The identifier of the document element becomes the identifier of thedocument 32 itself. Therefore, a symbol that can be uniquely identified in the document stored in thedocument storage unit 11 is used as the identifier of the document element.
[0048]
Any element other than the document element in thedocument 32 may be uniquely identified in thedocument 32. Here, an identifier of `` t1 '' is assigned to the title element (title), identifiers of `` s1 '', `` s2 '', and `` s3 '' are assigned to the chapter element (sect1), respectively, and `` Identifiers “h1”, “h2”, and “h3” are assigned, and identifiers “p1”, “p2”, “p3”, and “p4” are assigned to the paragraph elements (para), respectively.
[0049]
Next, the hierarchicalstructure associating unit 12 associates the title and heading of thedocument 32, and the paragraphs corresponding to the subordinate heading and heading, if any (step S3). In the present embodiment, the association between the title of a document and the headline is expressed by setting the sequence of headline identifiers as the attribute of the title element (title). In addition, for the association from a heading to a subordinate heading or the association from a heading to the corresponding content, set the identifier of the subordinate heading element or the paragraph element (para) as the content as the heading element (head) attribute. Express by doing.
[0050]
FIG. 6 is a diagram illustrating an example of a document in which titles, headings, and contents are associated with each other. Thisdocument 33 is a title element and a heading element of thedocument 32 shown in FIG. 5, and a list of identifiers of heading elements or paragraph elements to be associated is assigned as a value of the attribute name “ref”. In this example, the sequence of identifiers is separated by a space character. For example, since there are three heading elements (head) below the title element (title), the value of the attribute name “ref” of the title element (title) is “h1 h2 h3”.
[0051]
Next, thekeyword extraction unit 13 extracts keywords from the titles or headings associated by the hierarchical structure association unit 12 (step S4). As a keyword extraction method, a conventional technique such as morphological analysis may be used. In this embodiment, a word determined as a noun from the result of morphological analysis is used as a keyword. In addition, words that are difficult to become keywords, such as hiragana words, are registered in advance as stop words and excluded from keyword extraction targets. Thekeyword extraction unit 13 creates a keyword correspondence table indicating the correspondence between elements and keywords included in the elements, and temporarily holds the keyword correspondence table.
[0052]
FIG. 7 is a diagram illustrating an example of the keyword correspondence table. This is a keyword correspondence table 41 showing the correspondence between the title element (title) and heading element (head) of thedocument 33 shown in FIG. 6 and the keywords extracted therefrom. The keyword correspondence table 41 has items of “element type”, “identifier”, and “keyword”. In the “element type” item, the type of element from which the keyword has been extracted is set. An example of this is either “Title” or “Heading”. In the “identifier” item, an identifier of an element from which a keyword has been extracted is set. In the “keyword” item, a set of keywords included in the element from which the keyword is extracted is set.
[0053]
Thus, since the morphological analysis process is performed only on the title element and the heading element in the document, it is not necessary to perform the morphological analysis process on the entire document. In general, the amount of text included in the title and heading of a document is very small compared to the amount of text in the entire document, so that the processing cost of morphological analysis can be greatly reduced.
[0054]
Next, the documentcontent search unit 14 searches the content of other documents stored in thedocument storage unit 11 using the keywords extracted by the keyword extraction unit 13 (step S5). For example, when a document in thedocument storage unit 11 is searched using the keyword “SGML” extracted from the title element (title), the following document is detected.
[0055]
FIG. 8 is a diagram illustrating an example of a document including a keyword to be associated in the text. Thisdocument 51 is detected when “SGML” in the text “... convert to SGML.” Included in the content of the paragraph element (para) matches. Thedocument 51 is a document created according to the same structure definition as thedocument 31 shown in FIG.
[0056]
When thedocument 51 as shown in FIG. 8 is found, thekeyword associating unit 15 associates the content of thedocument 51 that matches the keyword with the title or headline including the keyword (step S6). Specifically, “SGML” in the text “... convert to SGML ....” is tagged as a reference source element, and the identifier of the title element (title) of thedocument 33 shown in FIG. Set as an attribute of the referencing element.
[0057]
FIG. 9 is a diagram illustrating an example of a document in which an association between a keyword and a title is performed. In thisdocument 52, the keyword “SGML” is surrounded by the start tag and end tag of the element (link) indicating the association, and the association with the title “t1” of the document “d1” is made as the value of the attribute “ref” of the link element. Is set. Here, as the value of the attribute “ref”, the identifier “d1” of the document element and the identifier “t1” of the title element are connected by “.” Because the identifier “t1” happens to be in an element of another document. This is to prevent the case where the target of association cannot be determined uniquely when used.
[0058]
In this embodiment, “.” Is used to connect the identifier of the document element to the title element or heading element. Therefore, when the identifier is given to the element, “.” Is not included in the identifier itself. Like that.
[0059]
In the present embodiment, since the document element (doc) identifier is assigned so that the document stored in thedocument storage unit 11 can be uniquely identified, this document element is used to identify the document. However, an identifier for identifying the document may be assigned to the entire document and used as an association identifier. As such an identifier, a file name is used when the document is a file, and a URL (Uniform Resource Locator) is used when the document is published on the WWW (World Wide Web). Can do.
[0060]
When other document contents are searched for all the keywords extracted in step S4 and the keyword association is completed in step S6, the associated document is stored in the document storage unit 11 (step S7). At this time, the contents of the original document to be associated are overwritten.
[0061]
Then, for all the documents stored in thedocument storage unit 11, it is checked whether or not the processing of the above steps S1 to S7 has been performed (step S8). If there is a document that has not been processed yet, the processing returns to step S1 and processed. If all the documents have been processed, the association processing between documents is ended.
[0062]
By performing the above processing, the hierarchical structure is also associated with thedocument 52 shown in FIG.
FIG. 10 is a diagram showing a result of associating a hierarchical structure with the document of FIG. Thisdocument 53 is given “d2” as the identifier of the document element (doc).
[0063]
Next, a procedure for referring to an explanation description for a keyword from a certain keyword in the document by using the association by the document association apparatus according to the present invention will be described.
[0064]
FIG. 11 is a flowchart showing the association use procedure. This flowchart will be briefly described along with step numbers.
[S11] When the user inputs a document display request using theinput unit 21, thedocument extraction unit 16 extracts the corresponding document from thedocument storage unit 11. The contents of the extracted document are displayed on the screen of thedisplay unit 20.
[S12] The user selects a keyword using theinput unit 21.
[S13] Theheadline extraction unit 17 refers to the keyword association information selected in step S12, that is, the identifier of the attribute “ref” of the link element, and obtains the title or headline of the document having the corresponding identifier from thedocument storage unit 11. Extract. Alternatively, the title selected by theheadline selection unit 18 in steps S14 and S15, which will be described later, or a headline below the headline is extracted from thedocument storage unit 11. Then, the extracted title or headline is displayed on thedisplay unit 20.
[S14] Theheadline selection unit 18 determines whether there are a plurality of headlines extracted by theheadline extraction unit 17. If there are a plurality of headings, the process proceeds to step S15, and if there is only one, the title or heading is selected. Then, the process proceeds to step S16.
[S15] Theheadline selection unit 18 selects one of the titles or headings when a plurality of titles or headings are extracted by theheadline extraction unit 17 in response to a request input by the user via theinput unit 21.
[S16] Theheadline selection unit 18 determines whether or not a lower-level headline exists for the selected title or headline. In this embodiment, an element having an identifier set as the value of the attribute “ref” of the title element (title) or heading element (head) extracted in step S13 is specified, and the element is the heading element ( title) or not. If there is a lower heading, the process proceeds to step S13, and if not, the process proceeds to step S17.
[S17] Thecontent extraction unit 19 extracts an element corresponding to the content associated with the heading element selected in step S15 and displays it on the screen of thedisplay unit 20.
[0065]
In the following, processing relating to the use of association will be described using a specific example.
First, it is assumed that the user instructs thedisplay unit 53 to display thedocument 53 shown in FIG. Then, the content of thedocument 53 is displayed on the screen of thedisplay unit 20.
[0066]
FIG. 12 is a diagram illustrating an example of a display screen when the contents of a document are displayed. On thedisplay screen 61, titles, headings, paragraphs, associated keywords, and the like are identified by tags in the document, and an appropriate layout is determined for each to display the screen. For example, titles are displayed centered in a larger font, headings are displayed with a larger font number, and keywords associated with headings of other documents are underlined for emphasis.
[0067]
Next, it is assumed that the user refers to the document displayed on thedisplay unit 20 and selects the associated “SGML” display location by clicking with the mouse (step S12). Then, theheadline extracting unit 17 refers to the association information of the selected keyword “SGML”, that is, the identifier of the attribute “ref” of the link element, and the corresponding in thedocument 33 having the corresponding identifier “d1” from thedocument storage unit 11. The title “t1” to be extracted is extracted and displayed on the display unit 20 (step S13).
[0068]
FIG. 13 is a diagram illustrating an example of a display screen when a headline is displayed. The keyword “SGML” is tagged with the link element indicating the association by the above-described association processing, and “d1.t1” is set as the value of the attribute “ref”. Therefore, thedocument 33 shown in FIG. The title element (identifier is “t1”) is extracted by theheadline extraction unit 17, and adisplay screen 62 including the content of the title element “electronic publication by SGML” is displayed by thedisplay unit 20.
[0069]
At this time, theheadline extraction unit 17 determines whether or not there are a plurality of extracted titles (step S14). In this example, only one title or headline is extracted. Therefore, theheadline extraction unit 17 determines whether there is a lower level headline associated with the extracted headline (step S16). In this example, as the value of the attribute “ref” of the title element having the identifier “t1”, three elements “h1 h2 h3” are associated, and all are heading elements. Accordingly, the process returns to step S13 to extract a headline.
[0070]
FIG. 14 is a diagram illustrating an example of a display screen when a lower-level headline is displayed. This is an example of thedisplay screen 63 when the lower-level heading associated with the title element having “SGML electronic publishing” is displayed on thedisplay unit 20 from the example of thedisplay screen 62 shown in FIG. It is shown. That is, in thedocument 33 shown in FIG. 6, the contents “Introduction” and “Contents” of the three heading elements (identifiers are h1, h2, and h3) set as the value of the attribute “ref” of the title element having the identifier “t1”. The “history of electronic publication” and “related tools” are extracted and displayed on the screen of thedisplay unit 20.
[0071]
Here, theheadline selection unit 18 again determines whether there are a plurality of extracted headlines (step S14). Here, since three headings are extracted, the user selects one from the plurality of titles or headings displayed on thedisplay unit 20 by the input unit 21 (step S15). In this example, it is assumed that “related tool” among the contents of the three headings displayed in FIG. 14 is selected with a mouse or the like.
[0072]
Then, theheadline selection unit 18 determines whether or not there is a subordinate headline associated with the selected headline “related tool” (step S16). In thedocument 33 shown in FIG. 6, identifiers p3, p4,... Set as the value of the attribute “ref” of the heading element (the identifier is “h3”) having “related tool” as the content. . . None of these elements are headings. Therefore, thecontent extraction unit 19 extracts the content (step S17).
[0073]
FIG. 15 is a diagram illustrating an example of a display screen when content is displayed. This is an example of thedisplay screen 64 when the content associated with the heading element having “related tool” as the content is displayed on thedisplay unit 20 from the example of thedisplay screen 63 illustrated in FIG. 14. That is, in thedocument 33 shown in FIG. 6, the contents of the paragraph elements (identifiers p3, p4,...) Set as the value of the attribute “ref” of the heading element having the identifier “h3” are extracted and displayed. Displayed on theunit 20.
[0074]
As described above, even when there are a plurality of related content candidates, the minimum necessary associated content can be referred to by displaying and selecting the headline. If it is determined from the title or heading displayed on thedisplay unit 20 that the user does not need to refer to the content, the processing can be interrupted before referring to the content. Therefore, the user can efficiently find necessary information without reading all the details of the contents.
[0075]
Next, a second embodiment will be described. According to the second embodiment, when a plurality of titles or headings of other documents are associated with a keyword in a certain document content, the document browsing can be extracted more efficiently. Device. Note that the components of the second embodiment are the same as those of the first embodiment shown in FIG. 2, and therefore the second embodiment will be described using the configuration shown in FIG. To do. In addition, the association process between documents in the second embodiment is the same as that in the first embodiment, and a description thereof will be omitted.
[0076]
Therefore, the association reference process according to the second embodiment will be described below.
FIG. 16 is a flowchart illustrating a flow of association reference processing according to the second embodiment. The following processing will be described along with step numbers.
[S21] When the user designates a document to be extracted from the document group accumulated in thedocument accumulation unit 11 by theinput unit 21, thedocument extraction unit 16 extracts the designated document and displays it on thedisplay unit 20.
[S22] The user refers to the document displayed on thedisplay unit 20, and selects the display location of the associated keyword from theinput unit 21 with a mouse or the like.
[S23] Theheadline extraction unit 17 refers to the keyword association information selected in step S22, that is, the identifier of the attribute “ref” of the link element, and obtains the title or headline of the document having the corresponding identifier from thedocument storage unit 11. Extract.
[S24] Theheadline extraction unit 17 determines whether the title or headline extracted in step S23 is one or more, and if there are a plurality of extracted titles or headlines, the process proceeds to step S25. If there is only one, the process proceeds to step S29.
[S25] If it is determined that there are a plurality of titles or headings extracted in step S24, theheadline extraction unit 17 groups the titles or headings for each document.
[S26] Theheadline extraction unit 17 calculates the association group for each document collected in step S25 from the number of associations in the same document and the depth of the associated title or headline hierarchy. Sort according to.
[S27] Theheadline extraction unit 17 rearranges the associations grouped for each document in step S25 in each group according to the importance calculated from the associated title or the depth of the heading hierarchy.
[S28] The user selects one from the plurality of titles or headings displayed on thedisplay unit 20 by theinput unit 21.
[S29] When there is one title or headline extracted in step S23 or when a headline is selected in step S28, theheadline extraction unit 17 selects a lower-level headline associated with the title or headline. Determine if it exists. If there is a lower heading, the process returns to step S23 to extract the lower heading. If there is no lower heading, the process proceeds to step S30.
[S30] Thecontent extraction unit 19 extracts an element corresponding to the content associated with the heading element selected in step S28 and displays it on the screen of thedisplay unit 20.
[0077]
In this way, when a plurality of titles or headings of other documents are associated with a keyword in a certain document content, the associated content can be efficiently extracted. Details of this processing will be described below using a specific example.
[0078]
In the present embodiment, in addition to the document shown in the first embodiment, the following document including the keyword “SGML” to be associated in the title is stored in thedocument storage unit 11. To do.
[0079]
FIG. 17 is a diagram illustrating a second example of a document including a keyword to be associated in the title. In thisdocument 71, an identifier “d3” is assigned to the document element (doc). In addition, the title element (title) of "id =" t1 "", the heading element (head) of "id =" h2 "", and the content of the heading element (head) of "id =" h3 "" are "SGML" The keywords are included.
[0080]
FIG. 18 is a diagram illustrating a third example of a document including a keyword to be associated in the title. In thedocument 81, an identifier “d4” is given to the document element (doc). Also, the keyword “SGML” is included in the contents of the heading element (head) of “id =“ h21 ”” and the heading element (head) of “id =“ h22 ””.
[0081]
When the association processing is performed on the

documents

71 and 81 shown in FIGS. 17 and 18 in addition to thedocument 31 shown in FIG. 4, thedocument 51 shown in FIG. Or it is associated with a headline.
[0082]
FIG. 19 is a diagram illustrating an example of a document in which a keyword is associated with a title or a headline. As shown in this figure, thedocument 54 is associated with the titles or headings of other documents. That is, in FIG. 19, the title “t1” of the document “d1” (content is “electronic publication by SGML”), the title of the document “d3”, depending on the attribute of the link element that tags the keyword “SGML”. "T1" (content is "invitation to SGML"), heading "h2" (content is "SGML and HTML") and heading "h3" (content is "SGML and XML"), heading "h21" of document "d4" ”(Content is“ SGML document search ”) and heading“ h22 ”(content is“ SGML database system ”).
[0083]
Hereinafter, the flow of the association reference process will be described with reference to the flowchart shown in FIG. 16 for the document group associated in this way.
First, when the user designates a document to be extracted from the document group accumulated in thedocument accumulation unit 11 by theinput unit 21, thedocument extraction unit 16 extracts the designated document and displays it on the display unit 20 (step S21). . Here, it is assumed that the document displayed on thedisplay unit 20 is thedocument 54 shown in FIG. When thedocument 54 shown in FIG. 19 is displayed on thedisplay unit 20, since the attribute value of the link element is not displayed on the screen, thedisplay screen 61 is displayed as shown in FIG. 12, as in the case of the first embodiment. The
[0084]
Next, the user refers to thedocument 54 displayed on thedisplay unit 20, and selects the display location of the keyword “SGML” to which the association is given from theinput unit 21 by a method such as clicking with the mouse (step S22). . Theheadline extraction unit 17 refers to the keyword association information selected in step S22, that is, the identifier of the attribute “ref” of the link element, and extracts the title or headline of the document having the corresponding identifier from the document storage unit 11 ( Step S23).
[0085]
Next, theheadline extraction unit 17 determines whether the title or headline extracted in step S23 is one or more (step S24). In the example shown in FIG. 19, since a total of six titles or headings are extracted, the process proceeds to step S25.
[0086]
Next, when it is determined that there are a plurality of titles or headings extracted in step S24, theheadline extraction unit 17 groups the titles or headings for each document (step S25). In thedocument 54 of FIG. 19, the title “t1” of the document “d1” belongs to one group, the title “t1”, the heading “h2”, and the heading “h3” of the document “d2” belong to one group. The heading “h21” and the heading “h22” are combined into one group.
[0087]
Thus, by grouping the extracted titles or headings for each document, related descriptions in the same document can be continuously referred to.
Next, theheadline extraction unit 17 calculates the importance group calculated in step S25 based on the number of associations in the same document and the depth of the associated title or heading hierarchy. Rearrange according to (step S26). In the present embodiment, the importance for each document is calculated by the following equation.
[0088]
[Expression 1]

[0089]
In Expression (1), n represents the maximum value of numbers assigned in order from 1 to the title or headline associated with the document. di represents the depth in the hierarchical structure of the title or heading to which the number (i) is assigned (the depth of the title is 0). That is, di = 0 for the title, di = 1 for the first level heading, di = 2 for the second level heading, and so on. When the importance of each document is calculated according to the equation (1), thedocument 33 shown in FIG. 6 is associated with only one title “t1”, so importance = 2.^-0 = 1, thedocument 71 shown in FIG. 17 is associated with the title “t1”, the heading “h2”, and the heading “h3”.^-0 +2^-1 +2^-1 = 2, since thedocument 81 shown in FIG. 18 has two headings “h21” and “h22” associated with each other, importance = 2^-2 +2^-2 = 0.5. Therefore, the association groups are rearranged in the order of the document “d2”, the document “d1”, and the document “d3” according to the importance of each document.
[0090]
Note that the method of calculating the importance for each document is not limited to the one shown in Expression (1). The importance may be determined so that the more important the titles or headings are related, the higher the importance is, and the more important the titles or headings that are related, the deeper the depth is. In this method of determining importance, it is considered that the more titles or headings that are associated in the same document, the more likely that the keyword is related to the subject of the entire document, and the hierarchy of associated titles or headings. The shallower the depth of is, the more likely it is that the keyword is explained more comprehensively, so it is an effective method.
[0091]
Next, theheadline extraction unit 17 rearranges the associations grouped for each document in step S25 in each group according to the importance calculated from the associated title or the depth of the headline hierarchy (step S25). S27). In the present embodiment, it is assumed that the importance is higher when the depth of the hierarchy is shallower. In addition, when the depth of the hierarchy is the same, it is assumed that the importance is higher when it appears first in the document. Or you may use the importance which gave priority to the order of appearance in a document.
[0092]
After the above processing is performed, the extracted title or heading is displayed on thedisplay unit 20.
FIG. 20 is a diagram illustrating an example of a display screen that displays a plurality of headings. This is an example of thedisplay screen 101 displayed when the keyword “SGML” is selected on thedisplay screen 61 shown in FIG. The titles or headings displayed in FIG. 20 are grouped for each document by the above processing, and are rearranged in order of importance.
[0093]
Next, the user selects one from the plurality of titles or headings displayed on thedisplay unit 20 by the input unit 21 (step S28). Then, when there is one title or headline extracted in step S23 or when a headline is selected in step S28, theheadline extraction unit 17 has a lower level headline associated with the title or headline. It is determined whether or not to perform (step S29).
[0094]
In this way, when there are a plurality of associated titles or headings in the same document, or when there are a plurality of documents having associated titles or headings, it is possible to preferentially refer to the important ones. Even when a large number of document titles and headings are associated with each other, the associated contents can be referred to efficiently.
[0095]
The above processing functions can be realized by a computer. In this case, the processing contents of the functions that the document association apparatus and the document browsing apparatus should have are described in a program recorded on a computer-readable recording medium. Then, by executing this program on a computer, the above processing is realized by the computer. Examples of the computer-readable recording medium include a magnetic recording device and a semiconductor memory. When distributing to the market, store the program in a portable recording medium such as a CD-ROM (Compact Disk Read Only Memory) or floppy disk, or store it in a computer storage device connected via a network. In addition, it can be transferred to another computer through the network. When executed by a computer, the program is stored in a hard disk device or the like in the computer, loaded into the main memory, and executed.
[0096]
【The invention's effect】
As described above, in the document association apparatus of the present invention, the keyword in the document is associated with the processing target element of the associated document, and the upper structure and the lower structure of the element in the associated document are associated. Therefore, the elements in other documents and the substructures of the elements can be sequentially traced from the keywords in the document, and the minimum associated contents can be referred to. Moreover, since keywords are extracted only from specific elements, complicated processing associated with keyword extraction can be performed on a limited range, and association processing can be performed at high speed.
[0097]
In the document browsing apparatus of the present invention, the keyword in the document is associated with the processing target element of the associated document, and the upper structure and the lower structure of the element in the associated document are associated with each other. When a keyword is specified, the contents of the related elements of the keyword and the contents of the substructure are extracted. Therefore, the user who specified the keyword can refer to the contents of the minimum necessary related elements related to the keyword. it can.
[0098]
In the computer-readable recording medium on which the document association program of the present invention is recorded, the recorded document association program is executed by the computer, thereby associating the keyword in the document with the processing target element of the associated document. Thus, it is possible to cause the computer to perform the process of associating the upper structure and the lower structure of the element in the associated document at high speed. That is, it is possible to cause a computer to perform a process of associating a keyword in a document with a minimum related description of another document at high speed.
[0099]
In the computer-readable recording medium on which the document browsing program of the present invention is recorded, the recorded document browsing program is executed by the computer, thereby associating the keyword in the document with the processing target element of the related target document. , The process of extracting the contents of the related elements of the keyword and the contents of the substructure when the keyword in the document is specified by associating the upper structure and the lower structure of the element in the target document. It is possible to make the computer perform. That is, a user who has specified a keyword for the computer can refer to the contents of the minimum necessary related elements related to the keyword.
[Brief description of the drawings]
FIG. 1 is a principle configuration diagram of the present invention.
FIG. 2 is a diagram illustrating a configuration of a document browsing apparatus to which the present invention is applied.
FIG. 3 is a flowchart showing a procedure for associating documents.
FIG. 4 is a diagram showing a first example of a document including a keyword to be associated in a heading.
FIG. 5 is a diagram illustrating a document in which a unique identifier is assigned to each element.
FIG. 6 is a diagram illustrating an example of a document in which titles, headings, and contents are associated with each other.
FIG. 7 is a diagram illustrating an example of a keyword correspondence table.
FIG. 8 is a diagram illustrating an example of a document including a keyword to be associated in a text.
FIG. 9 is a diagram illustrating an example of a document in which a keyword and a title are associated with each other.
10 is a diagram showing a result of associating a hierarchical structure with the document of FIG. 9. FIG.
FIG. 11 is a flowchart showing an association usage procedure;
FIG. 12 is a diagram illustrating an example of a display screen when the contents of a document are displayed.
FIG. 13 is a diagram illustrating an example of a display screen when a heading is displayed.
FIG. 14 is a diagram showing an example of a display screen when a lower heading is displayed.
FIG. 15 is a diagram showing an example of a display screen when content is displayed.
FIG. 16 is a flowchart showing a flow of association reference processing in the second embodiment;
FIG. 17 is a diagram illustrating a second example of a document including a keyword to be associated in the title.
FIG. 18 is a diagram illustrating a third example of a document including a keyword to be associated in a title.
FIG. 19 is a diagram illustrating an example of a document in which a keyword is associated with a title or a heading.
FIG. 20 is a diagram illustrating an example of a display screen that displays a plurality of headings.
[Explanation of symbols]
1 Document storage means
2 Hierarchical structure association means
2a Target document
3 Keyword extraction means
3a Keyword correspondence table
4 Document content search means
4a Document
5 Keyword association means
5a Document

Claims

Translated fromJapanese

構造化文書の内容を閲覧する文書閲覧装置において、In a document browsing device that browses the contents of structured documents,
属性が設定された複数の要素で構成され、文書識別子が設定された文書を複数格納する文書蓄積手段と、  A document accumulating unit configured to store a plurality of documents each having an attribute set and a document identifier set;
前記文書蓄積手段に格納されている複数の文書を被関連付け対象文書として順次選択し、選択した当該被関連付け対象文書中の要素のタグに対して、当該要素を一意に識別するための要素識別子を設定する識別子設定手段と、  A plurality of documents stored in the document storage means are sequentially selected as related documents, and element identifiers for uniquely identifying the elements are selected for the tags of the elements in the selected related documents. Identifier setting means to be set;
前記識別子設定手段によって前記要素識別子が設定された前記被関連付け対象文書中の特定の属性を有する処理対象要素に含まれる内容からキーワードを抽出するキーワード抽出手段と、  Keyword extracting means for extracting a keyword from the content included in the processing target element having a specific attribute in the associated target document in which the element identifier is set by the identifier setting means;
前記キーワード抽出手段により抽出されたキーワードごとに、当該キーワードを含む、当該キーワードの抽出元となる前記被関連付け対象文書以外の文書を、前記文書蓄積手段内より検索する文書内容検索手段と、  For each keyword extracted by the keyword extraction unit, a document content search unit that searches the document storage unit for a document other than the associated target document that is the source of the keyword, including the keyword,
前記文書内容検索手段により検索されたキーワードを含む文書ごとに、当該文書中の当該キーワードに対応付けられ、当該キーワードに関連する他の文書中の要素を示す属性値が複数登録可能なタグに、当該キーワードの抽出元となる前記被関連付け対象文書の文書識別子と、当該被関連付け対象文書内の当該キーワードを含む前記処理対象要素の要素識別子との組からなる属性値を複数設定するキーワード関連付け手段と、  For each document including a keyword searched by the document content search means, a tag that is associated with the keyword in the document and that can register a plurality of attribute values indicating elements in other documents related to the keyword, A keyword associating means for setting a plurality of attribute values consisting of a set of a document identifier of the associated target document from which the keyword is extracted and an element identifier of the processing target element including the keyword in the associated target document; ,
文書閲覧要求に応じて、当該文書閲覧要求で指定された文書を前記文書蓄積手段から抽出し、表示する文書表示手段と、  In response to a document browsing request, a document display unit that extracts and displays the document specified in the document browsing request from the document storage unit;
操作入力により、前記文書表示手段にて表示された文書中で、他の文書中の要素を示す１つ以上の属性値が登録されたタグを有する前記キーワードが選択されると、当該キーワードのタグに設定された属性値に基づいて、当該キーワードに関連する前記被関連付け対象文書中の処理対象要素を抽出する要素抽出手段と、  When the keyword having a tag in which one or more attribute values indicating elements in another document are registered is selected in the document displayed by the document display unit by operation input, the tag of the keyword is selected. Element extraction means for extracting a processing target element in the related target document related to the keyword based on the attribute value set in
前記要素抽出手段により抽出された前記処理対象要素の内容を表示する内容表示手段と、  Content display means for displaying the contents of the processing target element extracted by the element extraction means;
を有することを特徴とする文書閲覧装置。  A document browsing apparatus comprising:

前記要素抽出手段は、選択されたキーワードのタグに複数の属性値が設定されており、当該属性値に基づいて複数の処理対象要素が抽出された場合には、前記内容表示手段によって表示された複数の処理対象要素のうちの１つを選択する操作入力を受け付け、選択された処理対象要素に関してのみ、下位構造の要素を抽出し、下位構造の要素が複数抽出された場合には、前記内容表示手段によって表示された複数の下位構造の要素のうちの１つを選択する操作入力を受け付け、選択された下位構造の要素に関してのみ、さらに下位構造の要素の抽出を行う、
ことを特徴とする請求項４記載の文書閲覧装置。The element extraction means has aplurality of attribute values set in the tag of the selected keyword, and when a plurality of processing target elements are extracted based on the attribute values, the element display means displays When an operation input for selecting one of a plurality of processing target elements is received, only the selected processing target element is extracted as a substructure element, and when a plurality of substructure elements are extracted, the contents Receiving an operation input for selecting one of a plurality of substructure elements displayed by the display means, and further extracting substructure elements only for the selected substructure elements;
The document browsing apparatus according to claim4, wherein:

前記要素抽出手段は、選択されたキーワードのタグに複数の属性値が設定されている場合、各属性値に含まれる文書識別子によって、当該属性値に対応する処理対象要素が含まれる文書を判断し、同一の文書に含まれる複数の処理対象要素を同じグループにグループ化し、同じグループに属する複数の処理対象要素が連続するように処理対象要素を並べ、
前記文書表示手段は、前記要素抽出手段で並べられた順に前記処理対象要素の内容を表示することを特徴とする請求項４記載の文書閲覧装置。When aplurality of attribute values are set for the tag of the selected keyword , the element extraction unitdetermines a document including a processing target element corresponding to the attribute value based on a document identifier included in each attribute value. , Group multiple processing elements included in the same document into the same group, arrange the processing elements so that multiple processing elements belonging to the same group are continuous,
5. The document browsing apparatus according to claim4, wherein the document display means displays the contents of the processing target elements in the order arranged by the element extraction means .

前記要素抽出手段は、前記処理対象要素をグループ化した場合には、グループに属する処理対象要素の数、およびグループに属する処理対象要素の文書内論理構造上の階層の深さに基づいて、各グループの重要度を算出し、各グループの重要度に応じてグループを並べ替えることを特徴とする請求項６記載の文書閲覧装置。Inthe case where the processing target elements are grouped , the element extraction unit is configuredto calculate each of the processing target elements based on the number of processing target elements belonging to the group and the depth of the hierarchy in the document logical structure of the processing target elements belonging to the group. 7. The document browsing apparatus according to claim6, wherein the importance of the group is calculated, and the groups are rearrangedaccording to the importance of each group .

前記要素抽出手段は、文書ごとにグループ化された処理対象要素を、処理対象要素の論理構造における階層の深さおよび文書中での出現順序に応じてグループ内で並べ替えることを特徴とする請求項６記載の文書閲覧装置。The element extraction unit rearranges the processing target elements grouped for each document in the group according to the depth of the hierarchy in the logical structure of the processing target elements and the appearance order in the document. Item 7. The document browsing device according to Item 6.

構造化文書の内容を閲覧するための文書閲覧プログラムを記録したコンピュータ読み取り可能な記録媒体において、In a computer-readable recording medium that records a document browsing program for browsing the contents of a structured document,
コンピュータを、  Computer
属性が設定された複数の要素で構成され、文書識別子が設定された文書を複数格納する文書蓄積手段、  A document storage unit configured to store a plurality of documents having document identifiers, each of which includes a plurality of elements having attributes set;
前記文書蓄積手段に格納されている複数の文書を被関連付け対象文書として順次選択し、選択した当該被関連付け対象文書中の要素のタグに対して、当該要素を一意に識別するための要素識別子を設定する識別子設定手段、  A plurality of documents stored in the document storage means are sequentially selected as related documents, and an element identifier for uniquely identifying the element is selected with respect to the tag of the element in the selected related document. Identifier setting means to set,
前記識別子設定手段によって前記要素識別子が設定された前記被関連付け対象文書中の特定の属性を有する処理対象要素に含まれる内容からキーワードを抽出するキーワード抽出手段、  Keyword extracting means for extracting a keyword from the content included in the processing target element having a specific attribute in the associated target document in which the element identifier is set by the identifier setting means;
前記キーワード抽出手段により抽出されたキーワードごとに、当該キーワードを含む、当該キーワードの抽出元となる前記被関連付け対象文書以外の文書を、前記文書蓄積手段内より検索する文書内容検索手段、  For each keyword extracted by the keyword extraction unit, a document content search unit that searches the document storage unit for a document other than the related target document that is the keyword extraction source, including the keyword,
前記文書内容検索手段により検索されたキーワードを含む文書ごとに、当該文書中の当該キーワードに対応付けられ、当該キーワードに関連する他の文書中の要素を示す属性値が複数登録可能なタグに、当該キーワードの抽出元となる前記被関連付け対象文書の文書識別子と、当該被関連付け対象文書内の当該キーワードを含む前記処理対象要素の要素識別子との組からなる属性値を複数設定するキーワード関連付け手段、  For each document including a keyword searched by the document content search means, a tag that is associated with the keyword in the document and that can register a plurality of attribute values indicating elements in other documents related to the keyword, Keyword association means for setting a plurality of attribute values consisting of a set of a document identifier of the associated target document from which the keyword is extracted and an element identifier of the processing target element including the keyword in the associated target document;
文書閲覧要求に応じて、当該文書閲覧要求で指定された文書を前記文書蓄積手段から抽出し、表示する文書表示手段、  In response to a document browsing request, a document display unit that extracts and displays the document specified in the document browsing request from the document storage unit,
操作入力により、前記文書表示手段にて表示された文書中で、他の文書中の要素を示す１つ以上の属性値が登録されたタグを有する前記キーワードが選択されると、当該キーワードのタグに設定された属性値に基づいて、当該キーワードに関連する前記被関連付け対象文書中の処理対象要素を抽出する要素抽出手段、  When the keyword having a tag in which one or more attribute values indicating elements in another document are registered is selected in the document displayed by the document display unit by operation input, the tag of the keyword is selected. Element extracting means for extracting a processing target element in the related target document related to the keyword based on the attribute value set in
前記要素抽出手段により抽出された前記処理対象要素の内容を表示する内容表示手段、  Content display means for displaying the contents of the processing target element extracted by the element extraction means;
として機能させることを特徴とする文書閲覧プログラムを記録したコンピュータ読み取り可能な記録媒体。  A computer-readable recording medium on which a document browsing program is recorded.

構造化文書の内容を閲覧するための文書閲覧プログラムを記録したコンピュータ読み取り可能な記録媒体において、In a computer-readable recording medium that records a document browsing program for browsing the contents of a structured document,
コンピュータを  Computer
属性が設定された複数の要素による階層的な論理構造が形成されており、文書識別子が設定された文書を複数格納する文書蓄積手段、  A document accumulating means for storing a plurality of documents having document identifiers formed in a hierarchical logical structure by a plurality of elements having attributes set;
前記文書蓄積手段に格納されている複数の文書を被関連付け対象文書として順次選択し、選択した当該被関連付け対象文書中の要素のタグに対して、当該要素を一意に識別するための要素識別子を設定し、当該被関連付け対象文書の階層構造における上位構造の要素のタグに対して、下位構造の要素の要素識別子を設定する階層構造関連付け手段、  A plurality of documents stored in the document storage means are sequentially selected as related documents, and element identifiers for uniquely identifying the elements are selected for the tags of the elements in the selected related documents. A hierarchical structure associating means for setting and setting an element identifier of an element of a lower structure for a tag of an element of a higher structure in the hierarchical structure of the related target document
前記階層構造関連付け手段によって前記要素識別子が設定された前記被関連付け対象文書中の特定の属性を有する処理対象要素に含まれる内容から、キーワードを抽出するキーワード抽出手段、  Keyword extracting means for extracting a keyword from the content included in the processing target element having a specific attribute in the related target document in which the element identifier is set by the hierarchical structure associating means;
前記キーワード抽出手段により抽出されたキーワードごとに、当該キーワードを含む、当該キーワードの抽出元となる前記被関連付け対象文書以外の文書を、前記文書蓄積手段内より検索する文書内容検索手段、  For each keyword extracted by the keyword extraction unit, a document content search unit that searches the document storage unit for a document other than the related target document that is the keyword extraction source, including the keyword,
前記文書内容検索手段により検索されたキーワードを含む文書ごとに、当該文書中の当該キーワードに対応付けられ、当該キーワードに関連する他の文書中の要素を示す属性値が複数登録可能なタグに、当該キーワードの抽出元となる前記被関連付け対象文書の文書識別子と、当該被関連付け対象文書内の当該キーワードを含む前記処理対象要素の要素識別子との組からなる属性値を複数設定するキーワード関連付け手段、  For each document including a keyword searched by the document content search means, a tag that is associated with the keyword in the document and that can register a plurality of attribute values indicating elements in other documents related to the keyword, Keyword association means for setting a plurality of attribute values consisting of a set of a document identifier of the associated target document from which the keyword is extracted and an element identifier of the processing target element including the keyword in the associated target document;
文書閲覧要求に応じて、当該文書閲覧要求で指定された文書を前記文書蓄積手段から抽出し、表示する文書表示手段、  In response to a document browsing request, a document display unit that extracts and displays the document specified in the document browsing request from the document storage unit,
操作入力により、前記文書表示手段にて表示された文書中で、他の文書中の要素を示す１つ以上の属性値が登録されたタグを有する前記キーワードが選択されると、当該キーワードのタグに設定された属性値に基づいて、当該キーワードに関連する前記被関連付け対象文書中の処理対象要素を抽出すると共に、抽出した前記処理対象要素のタグに設定された要素識別子に基づいて、当該処理対象要素の下位構造の要素を順次抽出する要素抽出手段、  When the keyword having a tag in which one or more attribute values indicating elements in another document are registered is selected in the document displayed by the document display unit by operation input, the tag of the keyword is selected. Based on the attribute value set in the process, the process target element in the related target document related to the keyword is extracted, and the process is performed based on the element identifier set in the tag of the extracted process target element. Element extraction means for sequentially extracting elements of a substructure of the target element;
前記要素抽出手段により抽出された前記処理対象要素の内容及び前記処理対象要素に関連付けられている下位の要素の内容を表示する内容表示手段、  Content display means for displaying the content of the processing target element extracted by the element extraction means and the content of a lower element associated with the processing target element;
として機能させることを特徴とする文書閲覧プログラムを記録したコンピュータ読み取り可能な記録媒体。  A computer-readable recording medium on which a document browsing program is recorded.

構造化文書の内容を閲覧するための文書閲覧方法において、In the document browsing method for browsing the contents of structured documents,
文書蓄積手段が、属性が設定された複数の要素で構成され、文書識別子が設定された文書を複数格納し、  The document storage means is composed of a plurality of elements with attributes set, stores a plurality of documents with document identifiers set,
識別子設定手段が、前記文書蓄積手段に格納されている複数の文書を被関連付け対象文書として順次選択し、選択した当該被関連付け対象文書中の要素のタグに対して、当該要素を一意に識別するための要素識別子を設定し、  The identifier setting unit sequentially selects a plurality of documents stored in the document storage unit as the association target document, and uniquely identifies the element with respect to the tag of the element in the selected association target document. Set the element identifier for
キーワード抽出手段が、前記識別子設定手段によって前記要素識別子が設定された前記被関連付け対象文書中の特定の属性を有する処理対象要素に含まれる内容からキーワードを抽出し、  A keyword extracting unit extracts a keyword from the content included in the processing target element having a specific attribute in the associated target document in which the element identifier is set by the identifier setting unit;
文書内容検索手段が、前記キーワード抽出手段により抽出されたキーワードごとに、当該キーワードを含む、当該キーワードの抽出元となる前記被関連付け対象文書以外の文書を、前記文書蓄積手段内より検索し、  For each keyword extracted by the keyword extraction unit, the document content search unit searches the document storage unit for a document other than the associated target document that is the keyword extraction source, including the keyword,
キーワード関連付け手段が、前記文書内容検索手段により検索されたキーワードを含む文書ごとに、当該文書中の当該キーワードに対応付けられ、当該キーワードに関連する他の文書中の要素を示す属性値が複数登録可能なタグに、当該キーワードの抽出元となる前記被関連付け対象文書の文書識別子と、当該被関連付け対象文書内の当該キーワードを含む前記処理対象要素の要素識別子との組からなる属性値を複数設定し、  For each document that includes the keyword searched by the document content search means, the keyword association means associates with the keyword in the document and registers a plurality of attribute values indicating elements in other documents related to the keyword A plurality of attribute values including a set of a document identifier of the associated target document from which the keyword is extracted and an element identifier of the processing target element including the keyword in the associated target document are set in a possible tag And
文書表示手段が、文書閲覧要求に応じて、当該文書閲覧要求で指定された文書を前記文書蓄積手段から抽出し、表示し、  In response to the document browsing request, the document display means extracts and displays the document designated by the document browsing request from the document storage means,
要素抽出手段が、操作入力により、前記文書表示手段にて表示された文書中で、他の文書中の要素を示す１つ以上の属性値が登録されたタグを有する前記キーワードが選択されると、当該キーワードのタグに設定された属性値に基づいて、当該キーワードに関連する前記被関連付け対象文書中の処理対象要素を抽出し、  When the element extraction unit selects the keyword having a tag in which one or more attribute values indicating an element in another document are registered in the document displayed by the document display unit by an operation input. , Based on the attribute value set in the tag of the keyword, extract the processing target element in the related target document related to the keyword,
内容表示手段が、前記要素抽出手段により抽出された前記処理対象要素の内容を表示する、  A content display means displays the content of the processing target element extracted by the element extraction means.
ことを特徴とする文書閲覧方法。  A document browsing method characterized by the above.

構造化文書の内容を閲覧するための文書閲覧方法において、In the document browsing method for browsing the contents of structured documents,
文書蓄積手段が、属性が設定された複数の要素による階層的な論理構造が形成されており、文書識別子が設定された文書を複数格納し、  The document storage means has a hierarchical logical structure formed by a plurality of elements having attributes set, stores a plurality of documents having document identifiers set,
階層構造関連付け手段が、前記文書蓄積手段に格納されている複数の文書を被関連付け対象文書として順次選択し、選択した当該被関連付け対象文書中の要素のタグに対して、当該要素を一意に識別するための要素識別子を設定し、当該被関連付け対象文書の階層構造における上位構造の要素のタグに対して、下位構造の要素の要素識別子を設定し、  The hierarchical structure associating means sequentially selects a plurality of documents stored in the document storage means as related documents, and uniquely identifies the elements with respect to the tags of the elements in the selected related documents. Set the element identifier of the substructure element for the tag of the upper structure element in the hierarchical structure of the related target document,
キーワード抽出手段が、前記階層構造関連付け手段によって前記要素識別子が設定された前記被関連付け対象文書中の特定の属性を有する処理対象要素に含まれる内容から、キーワードを抽出し、  A keyword extracting unit extracts a keyword from the content included in the processing target element having a specific attribute in the related target document in which the element identifier is set by the hierarchical structure associating unit;
文書内容検索手段が、前記キーワード抽出手段により抽出されたキーワードごとに、当該キーワードを含む、当該キーワードの抽出元となる前記被関連付け対象文書以外の文書を、前記文書蓄積手段内より検索し、  For each keyword extracted by the keyword extraction unit, the document content search unit searches the document storage unit for a document other than the associated target document that is the keyword extraction source, including the keyword,
キーワード関連付け手段が、前記文書内容検索手段により検索されたキーワードを含む文書ごとに、当該文書中の当該キーワードに対応付けられ、当該キーワードに関連する他の文書中の要素を示す属性値が複数登録可能なタグに、当該キーワードの抽出元となる前記被関連付け対象文書の文書識別子と、当該被関連付け対象文書内の当該キーワードを含む前記処理対象要素の要素識別子との組からなる属性値を複数設定し、  For each document that includes the keyword searched by the document content search means, the keyword association means associates with the keyword in the document and registers a plurality of attribute values indicating elements in other documents related to the keyword A plurality of attribute values including a set of a document identifier of the associated target document from which the keyword is extracted and an element identifier of the processing target element including the keyword in the associated target document are set in a possible tag And
文書表示手段が、文書閲覧要求に応じて、当該文書閲覧要求で指定された文書を前記文書蓄積手段から抽出し、表示し、  In response to the document browsing request, the document display means extracts and displays the document designated by the document browsing request from the document storage means,
要素抽出手段が、操作入力により、前記文書表示手段にて表示された文書中で、他の文書中の要素を示す１つ以上の属性値が登録されたタグを有する前記キーワードが選択されると、当該キーワードのタグに設定された属性値に基づいて、当該キーワードに関連する前記被関連付け対象文書中の処理対象要素を抽出すると共に、抽出した前記処理対象要素のタグに設定された要素識別子に基づいて、当該処理対象要素の下位構造の要素を順次抽出し、  When the element extraction unit selects the keyword having a tag in which one or more attribute values indicating an element in another document are registered in the document displayed by the document display unit by an operation input. Based on the attribute value set in the tag of the keyword, the processing target element in the related target document related to the keyword is extracted, and the element identifier set in the extracted tag of the processing target element Based on the subordinate elements of the processing target element,
内容表示手段が、前記要素抽出手段により抽出された前記処理対象要素の内容及び前記  Content display means, the content of the processing target element extracted by the element extraction means, and the処理対象要素に関連付けられている下位の要素の内容を表示する、Display the contents of the lower element associated with the element to be processed,
ことを特徴とする文書閲覧方法。  A document browsing method characterized by the above.