JP4378106B2

Movatterモバイル変換

Info

Publication number: JP4378106B2
Application number: JP2003130785A
Authority: JP
Inventors: 哲郎長束
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2003-05-08
Filing date: 2003-05-08
Publication date: 2009-12-02
Anticipated expiration: 2023-05-08
Also published as: JP2004334602A

Description

【０００１】
【発明の属する技術分野】
本発明は、情報抽出技術及び情報検索技術を用いた文書検索装置、文書検索方法及びプログラムに関し、例えば、文書検索システム、文書分類システム、文書分析システム等に好適な、情報抽出技術を用いた文書検索装置、文書検索方法及びプログラムに関する。
【０００２】
【従来の技術】
近年、アンケートデータやコールセンターデータなどの大量のテキストデータを分析することを目的としたテキストマイニング技術が注目されている。
このような大量のテキストデータの分析を行う際には、文書集合内に含まれる特徴的な概念を抽出することが大きな課題の１つとなる。概念情報を含む情報抽出技術は、大量の文書データから、何らかの知見を見出す方法として研究が進んでいる。また、テキストデータの分析では、テキストデータにどのような概念が含まれているのかを知ることは重要なことである。
【０００３】
従来の技術では、特徴的な概念や概念間関係を抽出するために、テキストデータ内で頻度情報を利用したり、予め概念辞書やカテゴリー辞書などを準備して、その情報を利用したりしている。
【０００４】
しかしながら、これでは統計的に有意な概念や、辞書に登録されている概念しか抽出できない。テキストデータの分析においては、アイデアの発見など、統計的に有意な概念や、辞書に登録可能な既知の概念ではなくても、ユーザにとっては重要な情報もある。従って、テキストデータの分析では、ユーザが自由にテキストデータ内に含まれる概念を探索できる機能も必要である。
【０００５】
また、システムが提供する概念表現が、ユーザが自由に概念を拡張したり、絞り込んだりすることができるような概念表現になっていないという問題もある。テキストデータにおける概念は、単語の組み合わせで表現されているが、その組み合わせのバリエーションは様々であり、また、ユーザにとって有益な概念というのもユーザの要求や観点により様々である。
【０００６】
そのため、テキストデータに含まれる概念をユーザにわかりやすく、また、ユーザが操作しやすい形式で表現する方法、つまり、ユーザが特別な文法知識を必要とせずに、テキストに含まれる概念の概観、あるいは、概念の検索や拡張、絞込みなどの操作ができる概念表現が必要となる。
【０００７】
なお、本発明より先に出願された技術文献として、テキスト集合に対して、各テキストから重要文を抽出し、その重要文からキーワードを抽出するとともに、その重要文の係り受け構造とシソーラス辞書とに基づいて文をグループ化し、キーワードと文グループとを軸にした頻度情報を用いて統計的処理を行い、特徴的な文やキーワードを抽出する発明がある（例えば、特許文献１参照）。
【０００８】
また、テキスト集合に対して、各テキストからカテゴリー辞書を利用してカテゴリー付キーワードを抽出する。そして、文節係り受け関係に基づいて、キーワード間の組み合わせを抽出し、その相関関係を統計的に算出するシステムがある（例えば、特許文献２参照）。
【０００９】
また、テキスト集合に対して、各テキストから単語を抽出し、文節係り受け関係に基づいて構文木を生成する。そして、与えられたパターンの制約に基づいて、頻出するパターンを抽出し、そのパターンを含む構文木を持つ文書を出力する発明がある（例えば、特許文献３参照）。
【００１０】
また、文書集合に対して、アクションや結果などの分類軸を予め記述した概念定義辞書を用いて、各文書から概念を抽出する。そして、異なる分類に属する概念を組み合わせた複合概念を用いて文書を分類する発明がある（例えば、特許文献４参照）。
【００１１】
【特許文献１】
特開２０００−１７２６９１号公報
【特許文献２】
特開２００１−７５９６６号公報
【特許文献３】
特開２００１−８４２５０号公報
【特許文献４】
特開２００１−１４７９３７号公報
【００１２】
【発明が解決しようとする課題】
しかしながら、特許文献１の発明は、頻度情報を利用しているので、統計的に有意な情報しか抽出できない。またシソーラス辞書が必要となる。
【００１３】
また、特許文献２におけるシステムは、概念とはキーワードにカテゴリーを付与したものであり、予めカテゴリー辞書が必要となる。また、概念が基本的に１つの単語により表現されており、概念間の関係も２つの概念（キーワード）間の関係を利用しているに過ぎず、複数単語による１つの概念の表現は不可能である。また、特徴的なカテゴリー間の関係を統計的に求めているので、統計的に有意な情報しか抽出できない。
【００１４】
また、特許文献３における知識抽出方法は、知識とは構文木のパターンであり、また、知識抽出とは頻出するパターンの抽出である。このため、統計的に有意な情報しか抽出できない。また、知識あるいはその表現を、ユーザが自由に操作することは考慮されていない。
【００１５】
また、特許文献４におけるシステムは、概念はあらかじめ辞書として記述されている必要がある。このため、ユーザが自由に概念を表現あるいは指定して、テキスト集合に含まれる概念を探索することは不可能である。
【００１６】
本発明は、上記事情に鑑みてなされたものであり、拡張した概念表現をテキストデータ内から検索する文書検索装置、文書検索方法及びプログラムを提供することを目的とする。
【００１７】
【課題を解決するための手段】
かかる目的を達成するために本発明は以下のような特徴を有する。
＜文書検索装置＞
本発明にかかる文書検索装置は、テキストデータを構成する文節に含まれる単語から抽出されたそれ自体で１つの意味を表す単語であるトークンと、前記文節に含まれる付属語の特定の表現パターンから得られる前記文節に付加された意図表現情報と、を文節単位で対応付けて管理するテキストデータ構造記憶手段と、少なくとも１つのトークンを含む概念表現情報を指定する概念表現指定手段と、前記概念表現指定手段により指定された概念表現情報に含まれるトークンが対応付けられた文節を前記テキストデータ構造記憶手段から特定し、該特定した文節に対応付けられた意図表現情報と、前記指定された概念表現情報と、を組み合わせて構成した拡張概念表現情報を抽出する拡張概念表現抽出手段と、を有することを特徴とする。
【００１８】
＜文書検索方法＞
本発明にかかる文書検索方法は、テキストデータを構成する文節に含まれる単語から抽出されたそれ自体で１つの意味を表す単語であるトークンと、前記文節に含まれる付属語の特定の表現パターンから得られる前記文節に付加された意図表現情報と、を文節単位で対応付けて管理するテキストデータ構造記憶手段を有する文書検索装置で行う文書検索方法であって、少なくとも１つのトークンを含む概念表現情報を指定する概念表現指定工程と、前記概念表現指定工程により指定された概念表現情報に含まれるトークンが対応付けられた文節を前記テキストデータ構造記憶手段から特定し、該特定した文節に対応付けられた意図表現情報と、前記指定された概念表現情報と、を組み合わせて構成した拡張概念表現情報を抽出する拡張概念表現抽出工程と、を有することを特徴とする。
【００１９】
＜プログラム＞
本発明にかかるプログラムは、上記記載の文書検索方法をコンピュータに実行させることを特徴とする。
【００５２】
【発明の実施の形態】
（発明の特徴）
まず、本発明にかかる文書検索装置の特徴について説明する。
本発明にかかる文書検索装置における概念表現方法は、テキスト内に含まれる概念を、文節情報に基づいて抽出される概念表現基本単位と、文節間関係情報に基づいて抽出される概念表現基本単位と、の関係を用いて表現する。概念表現基本単位は、基本的には文節に対応しており、文節内の自立語をトークンとし、文節内の付属語の特定パターン抽出により抽出される意図表現の組み合わせで表現される。
【００５３】
この概念表現方法は、概念表現基本単位を連続的につなげることで複数単語による概念を表現、指定することができる。例えば、「最新⇒ＯＳ⇒インストール（＋可能＋打消）」。また、この概念表現方法は、ユーザにとって解り易いだけでなく、その表現の拡張など、ユーザによる操作も行い易くなっている。
【００５４】
本発明にかかる文書検索装置は、上記の概念表現方法において、指定された概念表現を拡張した概念表現を、テキストデータ内から検索することとする。これにより、例えば、「わかる（＋打消）」という概念表現「わからないという意味」に、ユーザが着目した場合に、概念表現「わかる（＋打消）」を拡張した概念表現を、テキストデータ内から検索することが可能となる。この例を以下に記す。
【００５５】
（例）
指定概念表現：「わかる（＋打消）」
検索概念表現：「使い方 ⇒ わかる（＋打消）」
検索概念表現：「操作 ⇒ わかる（＋打消）」
検索概念表現：「意味 ⇒ わかる（＋打消）」
検索概念表現：「わかる（＋打消） ⇒ ユーザ」
検索概念表現：「わかる（＋打消） ⇒ 理由」
【００５６】
これにより、ユーザは、「何がわからないのか」や「わからない何なのか」などを知ることができる。なお、概念表現の拡張は、概念を意味的に絞り込むことである。これは、テキストデータに含まれる概念を理解したり、大量の概念の中から必要な概念を探し出したりするのに効果的である。
【００５７】
以下、添付図面を参照しながら本発明にかかる文書検索装置について詳細に説明する。なお、図１は、本発明にかかる文書検索装置の構成図である。図２は、テキストデータ構造の構造例である。図３は、テキストデータ構造の各構成要素が管理する情報例である。図４は、単語リスト例である。図５は、文節管理情報例である。図６は、概念表現検索方法のフローチャートである。
【００５８】
（概念表現）
まず、本実施の形態において用いられる概念表現について説明する。
【００５９】
本発明における概念表現は、テキストを言語解析することで得られる文節、あるいは、文節間関係情報に基づいている。言語解析としては、例えば、形態素解析、文節係り受け解析を利用することができる。なお、形態素解析は、テキストに含まれる単語を分析する。係り受け解析は、テキストに含まれる文節を解析し、文節間の関係として係りと受けとの関係にある文節を解析する。
【００６０】
例えば、「ソフトウェアのインストールが正常に実行できない」というテキストの場合、言語解析の結果、以下に示す（例１）の情報を得ることができる。
【００６１】
【表１】

【００６２】
なお、上記（例１）における「自」は自立語を、「付」は付属語を示す。なお、自立語とは、動詞、形容詞、名詞などの品詞の単語であり、付属語とは、助詞、助動詞などの品詞の単語である。通常文節は、１個の自立語と、０個以上の付属語と、で構成される。解析方法によっては、１文節に複数個の自立語が含まれるような結果を出すものもあるが、本実施の形態では、文節には必ず１個の自立語しか含まないように文節を生成する解析方法を利用するものとする。
【００６３】
なお、概念表現は、概念表現の基本単位と基本単位間の関係表現により表現される。概念表現の基本単位は、トークンおよび意図表現を利用して表現される。
【００６４】
なお、トークンとは、それ自体で１つの意味をあらわす単語であり、自立語を利用することができる。例えば、上記（例１）では、「ソフトウェア」「インストール」「正常」「実行」がトークンとなる。トークンの表現はトークンの表記を利用することもできるし、トークンの代表的表記に変換したものを利用することもできる。
【００６５】
また、意図表現とは、文節内の付属語による意味の付加を表す表現であり、付属語の特定の表現パターンを抽出することで、その文節に付加されている意図を解析する。例えば、「〜ない（助動詞）」「〜ず（助動詞）」という表現は「打消」の意味を、「〜できる（補助動詞）」という表現は「可能」の意味を、「〜たい（助動詞）」という表現は「要望」の意味を、文節に対して付加しているとすることが可能である。例えば、上記（例１）の「実行できない」という文節から「可能」と「打消」の意図表現が抽出される。
【００６６】
意図表現は、例えば、「（＋打消）」「（＋可能−打消）」というように表現することができる、ここで、「＋ＸＸ」は、その意図表現が付加されていることを、「−ＸＸ」は、その意図表現が付加されていないことを表している。
【００６７】
概念表現の基本単位としては、トークンのみ、意図表現のみ、あるいはトークンと意図表現の組み合わせで表現され、例えば、以下のように表現される。
【００６８】
概念表現基本単位表現例：「購入」「（＋可能）」「実行（＋可能＋打消）」
【００６９】
なお、トークンと意図表現との組み合わせとは、ある文節に指定されたトークンが含まれていて、かつ、その文節に指定された意図表現が付加されていることを意味する。
【００７０】
基本単位間の関係は、基本単位間に意味的な強い関係があることを示す。意味的な強い関係とは、基本的には係り受け関係にある文節に含まれることを表す。例えば、基本単位間の関係を「⇒」で表すものとすると、「情報⇒検索」という概念表現は、係り受け関係にある２つの文節において、係り文節に「情報」が、受け文節に「検索」が、含まれていることを意味する（「情報を検索する」）。
【００７１】
このように、基本単位間の関係として、文節係り受け関係を利用することで、文書検索などで利用される単語の論理式「ソフトウェア＆インストール」のように、単にテキスト内の共起出現関係を指定するのではなく、基本単位がテキスト内で意味的に強い関係をもって出現していることを指定することができる。
【００７２】
文節係り受け関係は、ある文節が係り文節になる場合は、受け文節は１つのみであるが、複数の係り文節が同じ１つの受け文節に係ることができる（上記（例１）の文節４は、文節２と文節３との受け文節となっている）。そのため、概念表現における基本単位間の関係の表現は、（１）複数の係り文節を持つ受け文節という文節間関係を表現する場合と、（２）文節間関係を表現しない場合と、の２通りが可能である。
【００７３】
（１）複数の係り文節を持つ受け文節という文節間関係を表現しない場合は、概念表現は基本単位の単純な１次元のリスト表現となる。
【００７４】
【表２】

【００７５】
（２）複数の係り文節を持つ受け文節という文節間関係を表現する場合は、概念表現は基本単位のツリー表現となる。
【００７６】
【表３】

【００７７】
（１）の場合、概念表現は、ユーザにとって簡単でわかりやすく、表現の拡張などの操作も行いやすいが、複雑な文節係り受け関係構造の表現ができない問題がある。（２）の場合、複雑な文節係り受け関係構造も表現できるが、ユーザには複雑で解りにくく、操作も行いにくいと考えられる。また、（１）と（２）との両方を利用することができるが、本実施の形態では、ユーザにとって解りやすく操作もしやすい（１）の表現方法を用いることとして説明する。
【００７８】
（例１）のテキストから生成することのできる概念表現例を以下の（例２）に示す。
【００７９】
（例２）
概念表現１：ソフトウェア
概念表現２：インストール
概念表現３：正常
概念表現４：実行
概念表現５：実行（＋可能）
概念表現６：実行（＋打消）
概念表現７：実行（＋可能＋打消）
概念表現８：ソフトウェア⇒インストール
概念表現９：インストール⇒実行
概念表現１０：インストール⇒実行（＋可能）
概念表現１１：インストール⇒実行（＋打消）
概念表現１２：インストール⇒実行（＋可能＋打消）
概念表現１３：正常⇒実行
概念表現１４：正常⇒実行（＋可能）
概念表現１５：正常⇒実行（＋打消）
概念表現１６：正常⇒実行（＋可能＋打消）
概念表現１７：ソフトウェア⇒インストール⇒実行
概念表現１８：ソフトウェア⇒インストール⇒実行（＋可能）
概念表現１９：ソフトウェア⇒インストール⇒実行（＋打消）
概念表現２０：ソフトウェア⇒インストール⇒実行（＋可能＋打消）
【００８０】
（概念表現の拡張検索）
本発明にかかる文書検索装置は、上記で説明した概念表現を用いた際に、指定された概念表現を拡張した概念表現をテキストから抽出する概念検索方法を用いる。例えば（例１）の場合、概念表現「インストール⇒実行」が指定された場合、下記の（例３）のように（例１）のテキストに含まれる概念表現（例２）のうち「インストール⇒実行」を含む概念表現を拡張概念として抽出する。
【００８１】
（例３）
拡張概念表現１：インストール⇒実行（＋可能）
拡張概念表現２：インストール⇒実行（＋打消）
拡張概念表現３：インストール⇒実行（＋可能＋打消）
拡張概念表現４：ソフトウェア⇒インストール⇒実行
拡張概念表現５：ソフトウェア⇒インストール⇒実行（＋可能）
拡張概念表現６：ソフトウェア⇒インストール⇒実行（＋打消）
拡張概念表現７：ソフトウェア⇒インストール⇒実行（＋可能＋打消）
【００８２】
次に、本発明にかかる文書検索装置における動作処理を説明する。
なお、本発明にかかる文書検索装置は、図１に示すように、テキスト入力手段１０１と、言語解析手段１０２と、トークン抽出手段１０３と、意図表現抽出手段１０４と、テキストデータ構造記憶手段１０５と、概念表現指定手段１０６と、条件指定手段１０７と、拡張概念表現抽出手段１０８と、拡張概念表現記憶手段１０９と、を有して構成される。
【００８３】
（テキストの入力）
テキスト入力手段１０１では、テキストを入力する。なお、既にテキストが記録されている場合は、そのテキストを入力とすることもできる。また、入力するテキストは１つでも複数でもかまわないが、以下においては、複数のテキストが入力されたものとして説明する。
【００８４】
（言語解析）
言語解析手段１０２では、入力された各テキストに対して、形態素解析と、係り受け解析と、の言語解析を行う。形態素解析ではテキストに含まれる単語を解析する。係り受け解析では、テキストに含まれる文、文節を解析し、文節間の関係として係りと受けとの関係にある文節を解析する。例えば、「ソフトウェアのインストールが正常に実行できない」という文を解析した場合の解析結果例を以下に示す。なお、単語の区切りを「／」で示す。また、各単語の上の「自」は自立語を、「付」は付属語を示す。
【００８５】
【表４】

【００８６】
通常、文節は１つの自立語を含む。１つの文節に複数の自立語を含むように解析する処理方法もあるが、本実施の形態では、文節には必ず１つの自立語だけを含むように解析する方法を利用するものとする。
【００８７】
（トークン抽出）
トークン抽出手段１０３は、言語解析手段１０２によって解析された各文節からトークンを抽出する。文節内の単語情報から、自立語品詞である単語を抽出してトークンとする。
【００８８】
上記の（テキスト例）からは、以下のようになる。
文節１トークン：ソフトウェア
文節２トークン：インストール
文節３トークン：正常
文節４トークン：実行
【００８９】
（意図表現抽出）
意図表現抽出手段１０４は、言語解析手段１０２により解析された各文節から意図表現を抽出する。文節内の単語情報から、特定の表現パターンを抽出し、意図表現情報を生成する。例えば、「打消」「要望」「疑問」「可能」という意図表現は、下記のような単語、あるいは、表現パターンが含まれている場合に抽出することができる。
【００９０】
意図表現「打消」：助動詞「ない」、助動詞「ず」、助動詞「まい」、補助助動詞「にくい」
意図表現「要望」：助動詞「たい」
意図表現「疑問」：終助詞「か」、終助詞「か」＋終助詞「な」、記号「？」
意図表現「可能」：補助動詞「できる」、助動詞「れる」、助動詞「られる」
【００９１】
上記の（テキスト例）の場合は、意図表現として以下のものが抽出される。
【００９２】
文節４意図表現：＋可能＋打消
【００９３】
（テキストデータ構造記憶）
テキストデータ構造記憶手段１０５は、言語解析手段１０２により解析されたテキストの構造と、トークン抽出手段１０３と、意図表現抽出手段１０４と、で抽出されたトークンおよび意図表現の情報を記憶する。
【００９４】
テキストは、図２に示すようなデータ構造で記憶される。また、図２に示す各構成要素は、図３に示す情報を保持している。本実施の形態では、図４に示すように、テキストに含まれる単語に対して、ユニークな識別子を付与した単語リストを生成し、単語の管理を行うものとする。その際、品詞情報や全体における出現頻度を算出しておくことも可能である。
【００９５】
図３に、図２に示す各構成要素が保持する情報を示す。各テキストはユニークなＩＤを付与されて管理される。各テキストは、テキストに含まれる文ＩＤリストを管理する。文は、自分の文ＩＤと文に含まれる文節リストを管理する。文節は、自分の文節ＩＤと文節に含まれる単語ＩＤリスト、係り文節ＩＤリスト、受け文節ＩＤを管理する。単語ＩＤは、図４に示した単語リストにおけるＩＤである。係り文節ＩＤリストは、当該文節を受けとする係り文節のＩＤである。
【００９６】
上記の（テキスト例）にもあるように、１つの受け文節に対して、複数の文節が係り文節となりうるので係り文節ＩＤリストで管理する。受け文節ＩＤは、当該文節が係り文節となる受け文節のＩＤである。係り文節は、受け文節を１つしかとることができない。また、文節はその文節から抽出されたトークンと意図表現も管理する。
【００９７】
文節が管理する情報として、係り受けの関係の種類を保持することも可能である。例えば、連体修飾なのか連用修飾なのか、などである。
【００９８】
図５に、文節が保持するデータ例を示す。
また、同義語辞書を持ち同義語を持つ単語に関して代表表記情報をもたせることも可能である。これは、図４に示すように、単語リストの項目として同義語代表表記を持つことにより実現できる。
【００９９】
（概念表現指定）
概念表現指定手段１０６は、拡張検索を行う対象となる概念表現を指定する。エディタなどを用いてユーザに直接記述したり、テキストに含まれている概念表現の一覧表示などを行う表示手段がある場合は、その表示手段上で、ユーザが選択することで概念表現を指定したりすることができる。
【０１００】
（条件指定）
条件指定手段１０７は、拡張検索を行う際の条件を指定する。
なお、本実施の形態では、意図表現による拡張を行う際に、指定された概念表現に含まれるどの概念表現基本単位に関して拡張するかを、ユーザが指定することとする。
【０１０１】
また、本実施の形態では、意図表現による拡張を行う際に、拡張する意図表現の種類をユーザが指定する。例えば、「可能」「打消」「要望」「疑問」の中から選択させることとする。
【０１０２】
また、本実施の形態では、概念表現基本単位の追加による拡張を行う際に、追加する概念表現基本単位の数をユーザが指定することとする。
【０１０３】
また、本実施の形態では、概念表現基本単位の追加による拡張を行う際に、概念表現基本単位を追加する方向を（前方か後方か）をユーザが指定することとする。
【０１０４】
また、本実施の形態では、概念表現基本単位の追加による拡張を行う際に、追加する概念表現基本単位のトークン品詞をユーザが指定することとする。
【０１０５】
また、本実施の形態では、概念表現基本単位の追加による拡張を行う際に、追加する概念表現基本単位の文節間関係をユーザが指定することとする。例えば、「連体修飾」「連用修飾」「格修飾」「並列関係」「複合関係」などから選択させることとする。
【０１０６】
（拡張概念表現抽出）
拡張概念表現抽出手段１０８は、テキストデータ構造記憶手段１０５に記憶されている情報に基づいて、概念表現指定手段１０６で指定された概念表現を拡張した概念表現を抽出する。
【０１０７】
概念表現の拡張は、（１）意図表現の追加による拡張と、（２）概念表現基本単位の追加による拡張の２通りが考えられる。この２通りについて以下に説明する。
【０１０８】
（１）意図表現の追加による拡張
まず、意図表現の追加による拡張について説明する。
意図表現による拡張とは、指定された概念表現に含まれる概念表現基本単位に対して、意図表現を追加することで拡張する。上記の（テキスト例）の場合、指定概念表現として「インストール⇒実行」が指定されると、意図表現により拡張された概念表現として以下の概念表現を抽出する。
【０１０９】
【表５】

【０１１０】
なお、意図表現による拡張は以下の手順で行われる。
【０１１１】
▲１▼（指定概念表現と適合するテキストデータ構造の検索）
テキストデータ構造記憶手段１０５に記憶されているデータ構造から、指定概念表現と適合する構造を検索する。指定概念表現の検索処理のフローチャートを図６に示す。図６のフローチャートは、１つのテキストに対する処理であるが、複数テキストを対象とする場合は各テキストに対してこの処理を行う。以下、図６に示す処理動作について説明する。
【０１１２】
まず、文ＩＤ：Ｓｉと、文節ＩＤ：Ｋｊと、指定概念表現：ＣＥｎ（ｎ＝１〜Ｎ）と、のｉ、ｊ、ｎに１を代入して（ステップＳ１）、以下の処理を行う。
【０１１３】
まず、文節Ｋｊが、概念表現基本単位ＣＥｎを含むか否かを判定する（ステップＳ２）。この判定により、文節Ｋｊが、概念表現基本単位ＣＥｎを含まないと判定した場合は（ステップＳ２／ＮＯ）、その文節Ｋｊが、文Ｓｉの最後の文節か否かを判定する（ステップＳ８）。
【０１１４】
また、このステップＳ２の判定により、文節Ｋｊが、概念表現基本単位ＣＥｎを含むと判定した場合は（ステップＳ２／ＹＥＳ）、変数の文節Ｋｘに文節Ｋｊを代入する（文節Ｋｘ＝Ｋｊ）（ステップＳ３）。次に、指定概念表現ＣＥｎのｎの値が最後の指定概念表現ＣＥＮか否かを判定する（ステップＳ４）。該判定により、ｎの値が最後の指定概念表現のＮであると判定した場合は（ステップＳ４／ＹＥＳ）、概念表現抽出処理を行う（ステップＳ５）。
【０１１５】
また、ｎの値が最後の指定概念表現のＮでないと判定した場合は（ステップＳ４／ＮＯ）、ｎの値を１つ移行させ、ｎ＝ｎ＋１と設定する（ステップＳ６）。そして、文節Ｋｘの受け文節Ｋｙが概念表現基本単位ＣＥｎを含むか否かを判定する（ステップＳ７）。この判定により、文節Ｋｘの受け文節Ｋｙが概念表現基本単位ＣＥｎを含むと判定した場合は（ステップＳ７／ＹＥＳ）、ステップＳ４に戻り同様の処理を繰り返す。また、この判定により、文節Ｋｘの受け文節Ｋｙが概念表現基本単位ＣＥｎを含まないと判定した場合は（ステップＳ７／ＮＯ）、その文節Ｋｊが、文Ｓｉの最後の文節か否かを判定する（ステップＳ８）。
【０１１６】
次に、ステップＳ８の判定により、その文節Ｋｊが、文Ｓｉの最後の文節でないと判定した場合は（ステップＳ８／ＮＯ）、ステップＳ２に戻り、文節Ｋｊのｊの値を１つ移行し、ｊ＝ｊ＋１とし、また、指定概念表現ＣＥｎのｎの値を１として、同様の処理を行う。また、このステップＳ８の判定により、その文節Ｋｊが、文Ｓｉの最後の文節であると判定した場合は（ステップＳ８／ＹＥＳ）、文Ｓｉがテキストの最後の文か否かを判定する（ステップＳ９）。この判定により、文Ｓｉがテキストの最後の文であると判定した場合は（ステップＳ９／ＹＥＳ）、処理を終了する。また、この判定により、文Ｓｉがテキストの最後の文でないと判定した場合は（ステップＳ９／ＮＯ）、ステップＳ２に戻り、文Ｓｉのｉの値を１つ移行し、ｉ＝ｉ＋１と設定し、また、文節Ｋｊのｊの値を１と設定し、また、指定概念表現ＣＥｎのｎの値を１と設定して、同様の処理を行う。
【０１１７】
図６に示す処理を行うことで、上記の（テキスト例）の場合、指定概念表現として「インストール⇒実行」が指定されると、「文節２⇒文節４」というテキストデータ構造が適合する。
【０１１８】
▲２▼（検索されたテキストデータ構造に基づいた拡張概念表現の抽出）
検索されたテキストデータ構造すべてに対して、その文節情報から、意図表現の拡張により拡張概念表現を抽出する。検索された「文節２」と「文節４」との意図表現情報を参照し、指定概念表現には含まれていない意図表現を追加した拡張概念表現を生成する。文節２には、意図表現情報はなく、文節４には、「＋可能＋打消」という情報があるので、意図表現の組み合わせのバリエーションにより、以下の３つの拡張概念表現が抽出される。
【０１１９】
拡張概念表現１：「インストール⇒実行（＋可能）」
拡張概念表現２：「インストール⇒実行（＋打消）」
拡張概念表現３：「インストール⇒実行（＋可能＋打消）」
【０１２０】
▲３▼（抽出した拡張概念表現の記録）
抽出された拡張概念表現を拡張概念表現記憶手段１０９に記憶する。その際に、出現頻度や出現テキスト数を計数して管理する。
【０１２１】
なお、本実施の形態では、拡張概念の抽出を行う際に、ユーザが拡張する指定概念表現内の概念表現基本単位を指定することができる。この場合は、上記▲２▼（検索されたテキストデータ構造に基づいた拡張概念表現の抽出）の処理において、指定された概念表現基本単位に対応する文節情報にのみ基づいて拡張概念表現を抽出する。
【０１２２】
また、本実施の形態では、拡張概念の抽出を行う際に、ユーザが拡張する意図表現の種類を指定することができる。この場合は、上記▲２▼（検索されたテキストデータ構造に基づいた拡張概念表現の抽出）の処理において、指定された意図表現についてのみ拡張概念表現を抽出する。例えば、上記の（テキスト例）において、意図表現として「＋可能」が指定されていた場合、「インストール⇒実行（＋可能）」だけを拡張概念表現として抽出する。
【０１２３】
（２）概念表現基本単位の追加による拡張
次に、概念表現基本単位の追加による拡張について説明する。
概念表現基本単位の追加による拡張とは、指定された概念表現に新たな概念表現基本単位を追加した拡張概念表現を抽出する。上記の（テキスト例）の場合、指定概念表現として「インストール」が指定されると、概念表現基本単位の追加により拡張された概念表現として以下の概念表現を抽出する。
【０１２４】
【表６】

【０１２５】
なお、追加する概念表現基本単位の数は、いくつでもかまわないが、拡張の処理は繰り返し行うことができるので、通常は、概念表現基本単位を１つ追加した拡張概念表現を抽出するようにしてもよい。
【０１２６】
また、例えば「インストール⇒実行」という概念表現が指定されている場合、１つ概念表現基本単位を追加する場合、次の３つのパターンが考えられる。
パターン１：ＸＸＸ：⇒インストール⇒実行
パターン２：インストール⇒実行⇒ＸＸＸ
パターン３：インストール⇒ＸＸＸ⇒実行
パターン１、パターン２は、指定された概念表現の前後に概念表現基本表現を追加すればよい。しかし、パターン３は、指定された概念表現に含まれる概念表現基本単位間に、新たな概念表現基本単位を追加するために、指定された概念表現自体も変更してしまう。そのため、指定された概念表現の意味が変わってしまう可能性がある。また、パターン３の拡張を行わないように実装してもかまわないが、パターン３の拡張を行う場合は、指定された概念表現自体も変更してしまうことに注意しなければならない。また、パターン３の拡張は、追加する概念表現基本単位との関係が特定の係り受け関係（例えば、複合関係）の時にのみに行うように実装することも可能である。
【０１２７】
なお、概念表現基本単位の追加による拡張は以下の手順で行われる。
【０１２８】
▲１▼（指定された概念表現の拡張パターンの生成）
指定された概念表現に含まれる概念表現基本単位間に、新たな概念表現基本単位を追加した拡張概念表現のパターンを生成する。例えば、「インストール」という概念表現が指定されていて、概念基本表現を１つ追加する場合は、以下のような拡張概念表現パターンを生成する。
拡張概念表現パターン１：「ＸＸＸ⇒インストール」
拡張概念表現パターン２：「インストール⇒ＸＸＸ」
【０１２９】
また、概念基本表現を２つ追加する場合は、以下のような拡張概念表現パターンを生成する。
【０１３０】
拡張概念表現パターン１：「ＸＸＸ⇒ＹＹＹ⇒インストール」
拡張概念表現パターン２：「インストール⇒ＸＸＸ⇒ＹＹＹ」
【０１３１】
また、例えば、「インストール⇒実行」という概念表現が指定されて場合、概念基本表現を１つ追加する場合は、以下のような拡張概念表現パターンを生成する。
【０１３２】
拡張概念表現パターン１：「ＸＸＸ⇒インストール⇒実行」
拡張概念表現パターン２：「インストール⇒実行⇒ＸＸＸ」
拡張概念表現パターン３：「インストール⇒ＸＸＸ⇒実行」
【０１３３】
概念基本表現を２つ追加する場合は、以下のような拡張概念表現パターンを生成する。
【０１３４】
拡張概念表現パターン１：「ＸＸＸ⇒ＹＹＹ⇒インストール⇒実行」
拡張概念表現パターン２：「ＸＸＸ⇒インストール⇒ＹＹＹ⇒実行」
拡張概念表現パターン３：「インストール⇒ＸＸＸ⇒ＹＹＹ⇒実行」
拡張概念表現パターン４：「インストール⇒ＸＸＸ⇒実行⇒ＹＹＹ」
拡張概念表現パターン５：「インストール⇒実行⇒ＸＸＸ⇒ＹＹＹ」
【０１３５】
▲２▼（拡張概念表現パターンと適合するテキストデータ構造の検索）
テキストデータ構造記憶手段１０５に記憶されているデータ構造から、上記▲１▼（指定された概念表現の拡張パターンの生成）で生成した拡張概念表現パターンと適合する構造を検索する。拡張概念表現パターンの検索処理のフローチャートを図６に示す。ただし、▲１▼（指定された概念表現の拡張パターンの生成）で生成した拡張概念パターンの拡張部分（「ＸＸＸ」「ＹＹＹ」）は任意の文節に適合するものとして検索を行う。図６のフローチャートは１つのテキストに対する処理であるが、複数テキストを対象とする場合は各テキストに対してこの処理を行う。
【０１３６】
上記の（テキスト例）の場合、指定概念表現として「インストール」、追加する概念表現基本単位の数を１とすると、拡張概念表現パターンとして以下のパターンが生成される。
【０１３７】
拡張概念表現パターン１：「ＸＸＸ⇒インストール」
拡張概念表現パターン２：「インストール⇒ＸＸＸ」
【０１３８】
そして、以下のテキストデータ構造が適合する。
【０１３９】
拡張概念表現パターン１：「文節１⇒文節２」
拡張概念表現パターン２：「文節２⇒文節４」
【０１４０】
▲３▼（検索されたテキストデータ構造に基づいた拡張概念表現の抽出）
検索されたテキストデータ構造のすべてに対して、その文節情報のトークン、意図表現情報に基づいて拡張概念表現を抽出する。
【０１４１】
テキストデータ構造「文節１⇒文節２」
拡張概念表現１：「ソフトウェア⇒インストール」
【０１４２】
テキストデータ構造「文節２⇒文節４」
拡張概念表現２：「インストール⇒実行」
拡張概念表現３：「インストール⇒実行（＋可能）」
拡張概念表現４：「インストール⇒実行（＋打消）」
拡張概念表現５：「インストール⇒実行（＋可能＋打消）」
【０１４３】
なお、拡張概念表現を抽出する際は、意図表現の組み合わせのバリエーションをすべて生成してもかまわないが、抽出する拡張概念表現の種類が増えてしまう問題もある。そのため、概念表現基本単位の追加による拡張は、トークン情報だけで行い（拡張概念表現１、２だけを抽出する）、意図表現の情報が知りたい場合は、抽出された拡張概念表現に対して「意図表現による拡張」を行うように実装してもよい。
【０１４４】
▲４▼（抽出した拡張概念表現の記録）
抽出された拡張概念表現を拡張概念表現記憶手段１０９に記憶する。その際に、出現頻度や出現テキスト数を計数して管理する。
【０１４５】
なお、本実施の形態では、拡張概念の抽出を行う際に、ユーザが追加する概念表現基本単位の数を指定することができる。この場合は、上記▲１▼（指定された概念表現の拡張パターンの生成）の処理において、指定された数の概念表現基本単位を追加した場合の拡張概念表現パターンを生成すればよい。
【０１４６】
また、本実施の形態では、拡張概念の抽出を行う際に、ユーザが概念表現基本単位を追加する方向（前方か後方か）を指定することができる。この場合は、上記▲１▼（指定された概念表現の拡張パターンの生成）の処理において、指定された方向に対して概念表現基本単位を追加した場合の拡張概念表現パターンを生成すればよい。
【０１４７】
また、本実施の形態では、拡張概念の抽出を行う際に、ユーザが追加する概念表現基本単位のトークンの品詞を指定することができる。この場合は、上記▲２▼（拡張概念表現パターンと適合するテキストデータ構造の検索）の処理において、拡張概念表現パターンを検索する際に、拡張部分（「ＸＸＸ」「ＹＹＹ」）の適合条件に指定された品詞情報を利用すればよい。
【０１４８】
また、本実施の形態では、拡張概念の抽出を行う際に、追加する概念表現基本単位を、ユーザが概念表現基本単位間の関係である文節間関係を指定することにより選択することができる。この場合は、上記▲２▼（拡張概念表現パターンと適合するテキストデータ構造の検索）の処理において、拡張概念表現パターンを検索する際に、拡張部分（「ＸＸＸ」「ＹＹＹ」）の適合条件に指定された文節関係情報を利用すればよい。
【０１４９】
（拡張概念表現記憶）
拡張概念表現記憶手段１０９は、拡張概念表現抽出手段１０８で抽出された拡張概念表現を記憶する。その際、出現頻度や出現テキスト数も記憶することもできる。
【０１５０】
なお、上述する実施の形態は、本発明の好適な実施の形態であり、本発明の要旨を逸脱しない範囲内において種々変更実施が可能である。例えば、本実施の形態における文書検索装置での一連の処理動作をプログラムとして、情報処理装置で実行させることでも文書検索装置と同様の処理動作を行う装置を構築することが可能である。
【０１５１】
【発明の効果】
以上の説明より明らかなように本発明によれば、ユーザは指定した概念を意味的に絞り込んだ概念を検索することができ、テキストに含まれる概念の理解、探索を効率的に行うことができる。
【図面の簡単な説明】
【図１】本発明にかかる文書検索装置の構成を示すブロック図である。
【図２】本実施の形態におけるテキストデータ構造の構造例を示す図である。
【図３】本実施の形態におけるテキストデータ構造の各構成要素が管理する情報例を示す図である。
【図４】本実施の形態における単語リスト例を示す図である。
【図５】本実施の形態における文節管理情報例を示す図である。
【図６】本実施の形態における概念表現検索方法を示すフローチャートである。
【符号の説明】
１０１テキスト入力手段
１０２言語解析手段
１０３トークン抽出手段
１０４意図表現抽出手段
１０５テキストデータ構造記憶手段
１０６概念表現指定手段
１０７条件指定手段
１０８拡張概念表現抽出手段
１０９拡張概念表現記憶手段[0001]
BACKGROUND OF THE INVENTION
  The present invention relates to an information extraction technique, a document search apparatus using an information search technique, and a document searchMethod and programFor example, a document search apparatus using an information extraction technique and a document search suitable for a document search system, a document classification system, a document analysis system, etc.Method and programAbout.
[0002]
[Prior art]
In recent years, text mining techniques aimed at analyzing a large amount of text data such as questionnaire data and call center data have attracted attention.
When analyzing such a large amount of text data, extracting a characteristic concept included in a document set is one of the major issues. Research on information extraction technology including conceptual information is progressing as a method for finding some knowledge from a large amount of document data. In the analysis of text data, it is important to know what concepts are included in text data.
[0003]
In conventional technology, in order to extract characteristic concepts and relationships between concepts, frequency information is used in text data, or concept dictionaries and category dictionaries are prepared in advance and used. Yes.
[0004]
However, this can extract only a statistically significant concept or a concept registered in the dictionary. In the analysis of text data, there is information that is important for the user even if it is not a statistically significant concept such as discovery of an idea or a known concept that can be registered in a dictionary. Therefore, in the analysis of text data, a function that allows the user to freely search for a concept included in the text data is also required.
[0005]
Another problem is that the concept expression provided by the system is not a concept expression that allows the user to freely expand or narrow down the concept. The concept in the text data is expressed by a combination of words, but variations of the combination are various, and a concept useful for the user also varies depending on the user's request and viewpoint.
[0006]
Therefore, it is easy for the user to understand the concepts contained in the text data and to express them in a format that is easy for the user to operate, that is, an overview of the concepts contained in the text without the user needing special grammar knowledge, or In addition, a concept expression that can perform operations such as concept search, expansion, and narrowing down is required.
[0007]
As a technical document filed prior to the present invention, for a text set, an important sentence is extracted from each text, a keyword is extracted from the important sentence, a dependency structure of the important sentence, a thesaurus dictionary, There is an invention in which sentences are grouped based on the above, and statistical processing is performed using frequency information centered on keywords and sentence groups to extract characteristic sentences and keywords (see, for example, Patent Document 1).
[0008]
For a text set, a keyword with a category is extracted from each text using a category dictionary. And there exists a system which extracts the combination between keywords based on phrase dependency relation, and calculates the correlation statistically (for example, refer to patent documents 2).
[0009]
In addition, for the text set, words are extracted from each text, and a syntax tree is generated based on the phrase dependency relationship. There is an invention in which a frequently occurring pattern is extracted based on a given pattern restriction, and a document having a syntax tree including the pattern is output (for example, see Patent Document 3).
[0010]
Further, a concept is extracted from each document using a concept definition dictionary in which classification axes such as actions and results are described in advance for the document set. There is an invention that classifies documents using a composite concept in which concepts belonging to different classifications are combined (see, for example, Patent Document 4).
[0011]
[Patent Document 1]
JP 2000-172691 A
[Patent Document 2]
JP 2001-75966 A
[Patent Document 3]
JP 2001-84250 A
[Patent Document 4]
JP 2001-147937 A
[0012]
[Problems to be solved by the invention]
However, since the invention ofPatent Document 1 uses frequency information, only statistically significant information can be extracted. A thesaurus dictionary is also required.
[0013]
In the system inPatent Document 2, a concept is a keyword in which a category is assigned, and a category dictionary is required in advance. In addition, the concept is basically expressed by one word, and the relationship between the concepts is only using the relationship between the two concepts (keywords), and it is impossible to express one concept by a plurality of words. It is. Moreover, since the relationship between the characteristic categories is statistically obtained, only statistically significant information can be extracted.
[0014]
In the knowledge extraction method inPatent Document 3, knowledge is a pattern of a syntax tree, and knowledge extraction is extraction of a pattern that appears frequently. For this reason, only statistically significant information can be extracted. Further, it is not considered that the user freely manipulates knowledge or its expression.
[0015]
In the system inPatent Document 4, the concept needs to be described in advance as a dictionary. For this reason, it is impossible for the user to express or specify the concept freely and search for the concept included in the text set.
[0016]
  The present invention has been made in view of the above circumstances, and provides a document search apparatus and document search for searching for expanded concept expressions from text data.Method and programThe purpose is to provide.
[0017]
[Means for Solving the Problems]
  In order to achieve this object, the present invention has the following features.
  <Document search device>
The document search apparatus according to the present invention is based on a token that is a word representing one meaning by itself extracted from a word included in a phrase constituting text data, and a specific expression pattern of an attached word included in the phrase. Text data structure storage means for managing the intention expression information added to the obtained phrase in association with each phrase, concept expression specifying means for specifying the concept expression information including at least one token, and the concept expression The phrase associated with the token included in the concept expression information designated by the designation means is identified from the text data structure storage means, the intention expression information associated with the identified phrase, and the designated concept expression And extended concept expression extraction means for extracting extended concept expression information configured by combining the information.
[0018]
  <Document search method>
  The document search method according to the present invention is based on a token that is a word representing one meaning itself extracted from a word included in a clause constituting text data, and a specific expression pattern of an attached word included in the clause. A document search method performed by a document search apparatus having a text data structure storage means for managing intention expression information added to the obtained phrase in association with each phrase, and conceptual expression information including at least one token A phrase associated with the token included in the concept expression information designated in the concept expression designation step is identified from the text data structure storage means, and is associated with the identified phrase. Extended concept table for extracting extended concept expression information configured by combining the intention expression information and the specified concept expression information An extraction step, characterized by having a.
[0019]
  <Program>
A program according to the present invention causes a computer to execute the document search method described above.
[0052]
DETAILED DESCRIPTION OF THE INVENTION
(Characteristics of the invention)
First, features of the document search apparatus according to the present invention will be described.
The concept expression method in the document search apparatus according to the present invention includes a concept expression basic unit that is extracted based on phrase information, and a concept expression basic unit that is extracted based on inter-phrase relationship information. It expresses using the relation of. The basic unit of conceptual expression basically corresponds to a phrase, and is expressed by a combination of intention expressions extracted by extracting a specific pattern of an attached word in the phrase, with the independent word in the phrase as a token.
[0053]
This concept expression method can express and specify a concept of a plurality of words by continuously connecting concept expression basic units. For example, “Latest ⇒ OS ⇒ Install (+ Possible + Cancel)”. Further, this concept expression method is not only easy for the user to understand, but also facilitates user operations such as expansion of the expression.
[0054]
The document search apparatus according to the present invention searches the text data for a concept expression obtained by extending the designated concept expression in the above-described concept expression method. As a result, for example, when the user pays attention to the concept expression “I understand (+ cancellation)” “meaning that I don't know”, the concept expression that expands the concept expression “I understand (+ cancel)” is searched from the text data. It becomes possible to do. An example of this is given below.
[0055]
(Example)
Designated conceptual expression: “Understand (+ cancel)”
Search concept expression: “How to use ⇒ understand (+ cancel)”
Retrieval conceptual expression: "Operation ⇒ understand (+ cancel)"
Retrieval conceptual expression: "Meaning ⇒ understand (+ cancellation)"
Retrieval concept: “Understand (+ cancellation) ⇒ User”
Retrieval concept: “Understand (+ cancellation) ⇒ Reason”
[0056]
As a result, the user can know “what is not known”, “what is not known”, and the like. Note that the expansion of the concept expression is to narrow down the concept semantically. This is effective for understanding the concepts contained in the text data and for finding out the necessary concepts from a large number of concepts.
[0057]
Hereinafter, a document search apparatus according to the present invention will be described in detail with reference to the accompanying drawings. FIG. 1 is a configuration diagram of a document search apparatus according to the present invention. FIG. 2 is a structural example of a text data structure. FIG. 3 is an example of information managed by each component of the text data structure. FIG. 4 is an example of a word list. FIG. 5 is an example of phrase management information. FIG. 6 is a flowchart of the concept expression search method.
[0058]
(Conceptual expression)
First, the concept expression used in the present embodiment will be described.
[0059]
The concept expression in the present invention is based on clauses obtained by linguistic analysis of text or inter-phrase relation information. As the language analysis, for example, morphological analysis or phrase dependency analysis can be used. In the morphological analysis, words included in the text are analyzed. In dependency analysis, a clause included in a text is analyzed, and a clause having a relationship between dependency and dependency is analyzed as a relationship between clauses.
[0060]
For example, in the case of the text “Software installation cannot be executed normally”, the following information (example 1) can be obtained as a result of language analysis.
[0061]
[Table 1]

[0062]
In the above (Example 1), “self” indicates an independent word, and “attached” indicates an attached word. Independent words are parts of speech such as verbs, adjectives and nouns, and adjuncts are parts of speech such as particles and auxiliary verbs. A normal phrase is composed of one independent word and zero or more attached words. Depending on the analysis method, there may be a result in which a single phrase includes a plurality of independent words, but in this embodiment, a phrase is generated so that the phrase always includes only one independent word. The analysis method shall be used.
[0063]
The concept expression is expressed by the basic unit of the concept expression and the relationship expression between the basic units. The basic unit of concept expression is expressed using tokens and intention expressions.
[0064]
A token is a word that expresses one meaning by itself, and an independent word can be used. For example, in the above (Example 1), “software”, “installation”, “normal”, and “execution” are tokens. The token expression can use the token notation, or can be converted to a representative token notation.
[0065]
The intention expression is an expression representing the addition of meaning by an attached word in a phrase, and the intention added to the phrase is analyzed by extracting a specific expression pattern of the attached word. For example, the expression “~ not (auxiliary verb)” or “~ z (auxiliary verb)” means “cancellation”, the expression “can do (auxiliary verb)” means “possible”, “~ tai (auxiliary verb)” The expression “can be added to the phrase with the meaning of“ request ”. For example, intention expressions “possible” and “cancellation” are extracted from the phrase “cannot be executed” in the above (Example 1).
[0066]
The intention expression can be expressed as, for example, “(+ cancellation)” and “(+ possible−cancellation)”, where “+ XX” indicates that the intention expression is added. “XX” indicates that the intention expression is not added.
[0067]
As a basic unit of concept expression, it is expressed by only a token, only an intention expression, or a combination of a token and an intention expression. For example, it is expressed as follows.
[0068]
Conceptual expression basic unit Expression example: "Purchase" "(+ Possible)" "Execution (+ Possible + Cancellation)"
[0069]
Note that the combination of a token and an intention expression means that a specified token is included in a certain phrase and the specified intention expression is added to the phrase.
[0070]
The relationship between the basic units indicates that there is a strong semantic relationship between the basic units. A semantically strong relationship basically means being included in a clause having a dependency relationship. For example, if the relationship between basic units is expressed as "⇒", the conceptual expression "information => search" is "information" in the dependency clause and "search" in the dependency clause in the two clauses in the dependency relationship. "Is included (" search for information ").
[0071]
In this way, by using the phrase dependency relationship as the relationship between the basic units, the co-occurrence appearance relationship in the text is simply obtained as in the logical expression “software & installation” of words used in document search and the like. Instead of specifying, it can be specified that the basic unit appears in the text with a strong semantic relationship.
[0072]
In the clause dependency relationship, when a certain clause becomes a dependency clause, there is only one reception clause. Is the receiving clause ofclause 2 and clause 3). Therefore, there are two ways of expressing the relationship between basic units in the concept expression: (1) expressing the inter-phrase relationship of receiving clauses having a plurality of dependency clauses, and (2) not expressing the inter-phrase relationship. Is possible.
[0073]
(1) In the case where the inter-phrase relationship of receiving clauses having a plurality of dependency clauses is not expressed, the concept representation is a simple one-dimensional list representation of the basic unit.
[0074]
[Table 2]

[0075]
(2) When expressing the inter-phrase relationship of receiving clauses having a plurality of dependency clauses, the concept representation is a basic unit tree representation.
[0076]
[Table 3]

[0077]
In the case of (1), the concept expression is easy and understandable for the user, and it is easy to perform operations such as expansion of the expression, but there is a problem that the complicated phrase dependency relation structure cannot be expressed. In the case of (2), a complicated phrase dependency relation structure can be expressed. Although both (1) and (2) can be used, this embodiment will be described as using the expression method (1) that is easy for the user to understand and operate.
[0078]
A conceptual expression example that can be generated from the text of (Example 1) is shown in (Example 2) below.
[0079]
(Example 2)
Conceptual expression 1: Software
Conceptual expression 2: Installation
Conceptual expression 3: Normal
Conceptual expression 4: Execution
Conceptual expression 5: Execution (+ possible)
Conceptual expression 6: Execution (+ cancellation)
Conceptual expression 7: Execution (+ possible + cancellation)
Conceptual expression 8: Software⇒Installation
Conceptual expression 9: Installation ⇒ execution
Conceptual expression 10: installation ⇒ execution (+ possible)
Conceptual expression 11: Installation ⇒ execution (+ cancellation)
Conceptual expression 12: Installation ⇒ execution (+ possible + cancellation)
Conceptual expression 13: normal ⇒ execution
Conceptual expression 14: normal ⇒ execution (+ possible)
Conceptual expression 15: normal ⇒ execution (+ cancellation)
Conceptual expression 16: normal ⇒ execution (+ possible + cancellation)
Conceptual expression 17: Software⇒Installation⇒Execution
Conceptual expression 18: Software-> Installation-> Execution (+ possible)
Conceptual expression 19: Software-> Installation-> Execution (+ cancellation)
Conceptual expression 20: Software-> Installation-> Execution (+ possible + cancellation)
[0080]
(Extended search of concept expression)
The document search apparatus according to the present invention uses a concept search method that extracts a concept expression obtained by extending a designated concept expression from a text when the concept expression described above is used. For example, in the case of (Example 1), when the concept expression “installation → execution” is specified, “installation →” in the concept expression (example 2) included in the text of (example 1) as shown in (example 3) below. A concept expression including “execution” is extracted as an extended concept.
[0081]
(Example 3)
Extended conceptual expression 1: Installation ⇒ Execution (+ possible)
Extended concept expression 2: Installation ⇒ Execution (+ cancellation)
Extended concept expression 3: Installation ⇒ Execution (+ possible + cancellation)
Extended concept expression 4: Software⇒Installation⇒Execution
Extended concept expression 5: Software-> Installation-> Execution (+ possible)
Extended concept expression 6: Software-> Installation-> Execution (+ cancellation)
Extended conceptual expression 7: Software-> Installation-> Execution (+ possible + cancellation)
[0082]
Next, operation processing in the document search apparatus according to the present invention will be described.
As shown in FIG. 1, the document search apparatus according to the present invention includes a text input unit 101, a language analysis unit 102, a token extraction unit 103, an intentionexpression extraction unit 104, and a text data structure storage unit 105. , A conceptexpression specifying means 106, a condition specifying means 107, an extended conceptexpression extracting means 108, and an extended concept expression storage means 109.
[0083]
(Enter text)
The text input unit 101 inputs text. If text is already recorded, the text can also be used as input. In addition, although one or a plurality of texts may be input, the following description will be made assuming that a plurality of texts are input.
[0084]
(Language analysis)
The language analysis unit 102 performs language analysis of morphological analysis and dependency analysis on each input text. In morphological analysis, words included in text are analyzed. In dependency analysis, sentences and clauses included in a text are analyzed, and a clause having a relationship between dependency and dependency is analyzed as a relationship between clauses. For example, an analysis result example when a sentence “software installation cannot be executed normally” is analyzed is shown below. In addition, a word break is indicated by “/”. In addition, “self” above each word indicates an independent word, and “attached” indicates an attached word.
[0085]
[Table 4]

[0086]
A phrase usually contains one free word. Although there is a processing method for analyzing a single phrase so as to include a plurality of independent words, in this embodiment, it is assumed that a method for analyzing a phrase so as to always include only one independent word is used.
[0087]
(Token extraction)
The token extraction unit 103 extracts a token from each clause analyzed by the language analysis unit 102. From the word information in the phrase, a word that is an independent word part of speech is extracted and used as a token.
[0088]
From the above (text example), it becomes as follows.
Sentence 1 Token: Software
Sentence 2 token: Install
Sentence 3 token: Normal
Clause 4 token: execute
[0089]
(Intentional expression extraction)
The intentionexpression extraction unit 104 extracts an intention expression from each clause analyzed by the language analysis unit 102. A specific expression pattern is extracted from the word information in the phrase, and intention expression information is generated. For example, the intention expressions “cancellation”, “request”, “question”, and “possible” can be extracted when the following words or expression patterns are included.
[0090]
Intent expression “cancellation”: auxiliary verb “no”, auxiliary verb “z”, auxiliary verb “mai”, auxiliary auxiliary verb “difficult”
Intentional expression "request": auxiliary verb "tai"
Intent expression "question": final particle "ka", final particle "ka" + final particle "na", symbol "?"
Intent expression “possible”: auxiliary verb “can”, auxiliary verb “re”, auxiliary verb “re”
[0091]
In the case of the above (text example), the following are extracted as intention expressions.
[0092]
Sentence 4 intention expression: + possible + cancellation
[0093]
(Text data structure storage)
The text data structure storage unit 105 stores the text structure analyzed by the language analysis unit 102, the token and intention expression information extracted by the token extraction unit 103 and the intentionexpression extraction unit 104.
[0094]
The text is stored in a data structure as shown in FIG. Each component shown in FIG. 2 holds the information shown in FIG. In the present embodiment, as shown in FIG. 4, a word list in which a unique identifier is assigned to a word included in a text is generated, and the word is managed. At that time, it is also possible to calculate the part-of-speech information and the appearance frequency in the whole.
[0095]
FIG. 3 shows information held by each component shown in FIG. Each text is managed with a unique ID. Each text manages a sentence ID list included in the text. The sentence manages its own sentence ID and a phrase list included in the sentence. The phrase manages its own phrase ID, a word ID list included in the phrase, a related phrase ID list, and a received phrase ID. The word ID is an ID in the word list shown in FIG. The related phrase ID list is an ID of a related phrase that receives the relevant phrase.
[0096]
As described in the above (text example), since a plurality of clauses can be related clauses for one received clause, they are managed by a related clause ID list. The received phrase ID is an ID of a received phrase that is a related phrase. A dependency clause can take only one receiving clause. The clause also manages tokens and intention expressions extracted from the clause.
[0097]
It is also possible to hold the type of dependency relationship as information managed by the clause. For example, whether the modification is a continuous modification or a continuous modification.
[0098]
FIG. 5 shows an example of data held by a phrase.
It is also possible to have representative notation information for words having a synonym dictionary and having synonyms. This can be realized by having a synonym representative notation as an item in the word list, as shown in FIG.
[0099]
(Concept expression designation)
The conceptexpression designating unit 106 designates a concept expression for which an extended search is performed. When there is a display means that directly describes to the user using an editor or the like and displays a list of the concept expressions included in the text, the user selects the concept expression on the display means. Can be.
[0100]
(Condition specification)
Thecondition designating unit 107 designates a condition for performing an extended search.
In the present embodiment, the user designates which concept expression basic unit included in the designated concept expression is to be expanded when the intention expression is expanded.
[0101]
In the present embodiment, the user specifies the type of intention expression to be expanded when performing expansion by intention expression. For example, the user can select from “possible”, “cancellation”, “request”, and “question”.
[0102]
Further, in the present embodiment, when the expansion is performed by adding the concept expression basic unit, the user specifies the number of concept expression basic units to be added.
[0103]
In the present embodiment, when the expansion is performed by adding the concept representation basic unit, the user designates the direction in which the concept representation basic unit is added (forward or backward).
[0104]
In the present embodiment, when expansion is performed by adding a concept expression basic unit, the user specifies a token part-of-speech for the concept expression basic unit to be added.
[0105]
Further, in the present embodiment, when the expansion is performed by adding the concept expression basic unit, the user specifies the inter-phrase relationship of the concept expression basic unit to be added. For example, it is selected from “continuous modification”, “continuous modification”, “case modification”, “parallel relationship”, “composite relationship”, and the like.
[0106]
(Extended concept expression extraction)
Based on the information stored in the text data structure storage unit 105, the extended conceptrepresentation extraction unit 108 extracts a concept representation obtained by extending the concept representation designated by the conceptrepresentation designation unit 106.
[0107]
There are two types of concept expression expansion: (1) expansion by addition of intention expression and (2) expansion by addition of basic unit of concept expression. These two types will be described below.
[0108]
(1) Expansion by adding intention expression
First, expansion by adding intention expressions will be described.
The extension by intention expression is extended by adding the intention expression to the basic unit of concept expression included in the specified concept expression. In the case of the above (text example), when “installation → execution” is designated as the designated concept expression, the following concept expression is extracted as the concept expression expanded by the intention expression.
[0109]
[Table 5]

[0110]
In addition, expansion by intention expression is performed in the following procedure.
[0111]
(1) (Search for text data structure that matches specified conceptual expression)
From the data structure stored in the text data structure storage means 105, a structure that matches the specified conceptual expression is retrieved. FIG. 6 shows a flowchart of the designated concept expression search process. The flowchart in FIG. 6 is processing for one text, but when processing a plurality of texts, this processing is performed for each text. Hereinafter, the processing operation shown in FIG. 6 will be described.
[0112]
First, 1 is substituted into i, j, and n of sentence ID: Si, clause ID: Kj, and designated conceptual expression: CEn (n = 1 to N) (step S1), and the following processing is performed. .
[0113]
First, it is determined whether or not the phrase Kj includes a conceptual expression basic unit CEn (step S2). If it is determined by this determination that the phrase Kj does not include the conceptual expression basic unit CEn (step S2 / NO), it is determined whether or not the phrase Kj is the last phrase of the sentence Si (step S8).
[0114]
If it is determined in step S2 that the clause Kj includes the conceptual expression basic unit CEn (step S2 / YES), the clause Kj is substituted into the variable clause Kx (phrase Kx = Kj) (step S3). Next, it is determined whether or not the value n of the designated concept expression CEn is the last specified concept expression CEN (step S4). If it is determined that the value of n is N of the last designated conceptual expression (step S4 / YES), a conceptual expression extraction process is performed (step S5).
[0115]
If it is determined that the value of n is not N in the last designated conceptual expression (step S4 / NO), the value of n is shifted by one and set to n = n + 1 (step S6). Then, it is determined whether or not the received clause Ky of the clause Kx includes the conceptual expression basic unit CEn (step S7). If it is determined that the received clause Ky of the clause Kx includes the conceptual expression basic unit CEn (step S7 / YES), the process returns to step S4 and the same processing is repeated. If it is determined by this determination that the received clause Ky of the clause Kx does not include the conceptual expression basic unit CEn (step S7 / NO), it is determined whether or not the clause Kj is the last clause of the sentence Si. (Step S8).
[0116]
Next, if it is determined in step S8 that the clause Kj is not the last clause of the sentence Si (step S8 / NO), the process returns to step S2, and the value of j in the phrase Kj is shifted by one. Similar processing is performed by setting j = j + 1 and setting the value of n of the designated concept expression CEn to 1. If it is determined in step S8 that the clause Kj is the last clause of the sentence Si (step S8 / YES), it is determined whether the sentence Si is the last sentence of the text (step S8). S9). If it is determined that the sentence Si is the last sentence of the text (step S9 / YES), the process ends. If it is determined by this determination that the sentence Si is not the last sentence of the text (step S9 / NO), the process returns to step S2, the value of i in the sentence Si is shifted to 1, and i = i + 1 is set. Further, the j value of the phrase Kj is set to 1, and the n value of the designated concept expression CEn is set to 1, and the same processing is performed.
[0117]
By performing the processing shown in FIG. 6, in the case of the above (text example), when “installation → execution” is designated as the designated concept expression, the text data structure of “clause 2 →clause 4” is adapted.
[0118]
(2) (Extraction of extended concept expression based on searched text data structure)
For all retrieved text data structures, an extended concept expression is extracted from the clause information by expanding the intention expression. By referring to the intention expression information of the searched “clause 2” and “clause 4”, an extended concept expression in which an intention expression not included in the designated concept expression is added is generated. Thephrase 2 has no intention expression information, and thephrase 4 has “+ possible + cancellation” information. Therefore, the following three extended concept expressions are extracted depending on variations of combinations of intention expressions.
[0119]
Extended conceptual expression 1: “Installation ⇒ Execution (+ possible)”
Extended concept expression 2: “Installation → Execution (+ Cancellation)”
Extended Conceptual Expression 3: “Installation → Execution (+ Possible + Cancellation)”
[0120]
(3) (Recording of extracted extended concept expression)
The extracted extended concept expression is stored in the extended conceptexpression storage unit 109. At that time, the appearance frequency and the number of appearance texts are counted and managed.
[0121]
In the present embodiment, when an extended concept is extracted, a concept expression basic unit in a specified concept expression to be extended by a user can be specified. In this case, in the process (2) (extraction of extended concept expression based on searched text data structure), the extended concept expression is extracted based only on the phrase information corresponding to the designated concept expression basic unit. .
[0122]
In the present embodiment, when extracting an extended concept, the type of intention expression to be extended by the user can be specified. In this case, in the process of (2) (extraction of extended concept expression based on searched text data structure), the extended concept expression is extracted only for the designated intention expression. For example, in the above (text example), when “+ possible” is specified as the intention expression, only “installation → execution (+ possible)” is extracted as the extended concept expression.
[0123]
(2) Expansion by adding basic units of concept expression
Next, the expansion by adding the conceptual expression basic unit will be described.
The extension by adding a basic unit of concept expression is to extract an extended conceptual expression in which a new basic unit of concept expression is added to a specified conceptual expression. In the case of the above (example text), when “installation” is designated as the designated concept representation, the following concept representation is extracted as the concept representation expanded by adding the basic unit of the concept representation.
[0124]
[Table 6]

[0125]
Note that the number of conceptual expression basic units to be added may be any number, but the expansion process can be performed repeatedly. Therefore, in general, an extended conceptual expression in which one conceptual expression basic unit is added is extracted. Also good.
[0126]
For example, when the concept expression “installation → execution” is specified, when adding one concept expression basic unit, the following three patterns are conceivable.
Pattern 1: XXX: ⇒Installation⇒Execution
Pattern 2: Installation-> Execution-> XXX
Pattern 3: Install⇒XXX⇒Execute
Forpattern 1 andpattern 2, a basic concept expression may be added before and after the specified conceptual expression. However, in thepattern 3, in order to add a new concept expression basic unit between the concept expression basic units included in the specified concept expression, the specified concept expression itself is also changed. Therefore, there is a possibility that the meaning of the designated concept expression is changed. Further, the implementation may be performed so that thepattern 3 is not extended. However, when thepattern 3 is extended, it should be noted that the designated concept expression itself is also changed. Further, the extension of thepattern 3 can be implemented so as to be performed only when the relationship with the basic unit of concept expression to be added is a specific dependency relationship (for example, a composite relationship).
[0127]
The expansion by adding the basic unit of concept expression is performed according to the following procedure.
[0128]
(1) (Generation of extended pattern of specified concept expression)
An extended concept expression pattern is generated by adding a new concept expression basic unit between the concept expression basic units included in the specified concept expression. For example, when the concept expression “install” is specified and one concept basic expression is added, the following extended concept expression pattern is generated.
Extended concept expression pattern 1: “XXX⇒Install”
Extended concept expression pattern 2: “Installation XXX”
[0129]
When adding two concept basic expressions, the following extended concept expression pattern is generated.
[0130]
Extended concept expression pattern 1: “XXX⇒YYY⇒Install”
Extended concept expression pattern 2: "Installation-> XXX-> YYY"
[0131]
Also, for example, when the concept expression “installation → execution” is specified, and when one concept basic expression is added, the following extended concept expression pattern is generated.
[0132]
Extended concept expression pattern 1: “XXX⇒Installation⇒Execution”
Extended concept expression pattern 2: "Installation-> Execution-> XXX"
Extended concept expression pattern 3: “Installation ⇒ ⇒ Execution”
[0133]
When adding two concept basic expressions, the following extended concept expression pattern is generated.
[0134]
Extended concept expression pattern 1: “XXX⇒YYY⇒Installation⇒Execution”
Extended concept expression pattern 2: "XXX-> install-> YYY-> execute"
Extended concept expression pattern 3: "Installation-> XXX-> YYY-> Execution"
Extended concept expression pattern 4: "Installation-> XXX-> Execution-> YYY"
Extended concept expression pattern 5: "Installation-> Execution-> XXX-> YYY"
[0135]
(2) (Search for text data structure that matches the extended concept expression pattern)
From the data structure stored in the text data structure storage means 105, a structure that matches the extended concept expression pattern generated in (1) (generation of an extended pattern of the specified concept expression) is searched. FIG. 6 shows a flowchart of the extended concept expression pattern search process. However, the extended portion (“XXX” and “YYY”) of the extended concept pattern generated in (1) (generation of an extended pattern of the designated concept expression) is searched as being suitable for an arbitrary clause. The flowchart in FIG. 6 is processing for one text, but when processing a plurality of texts, this processing is performed for each text.
[0136]
In the case of the above (text example), if “install” is specified as the designated concept expression and the number of concept expression basic units to be added is 1, the following pattern is generated as the extended concept expression pattern.
[0137]
Extended concept expression pattern 1: “XXX⇒Install”
Extended concept expression pattern 2: “Installation XXX”
[0138]
And the following text data structure is applicable.
[0139]
Extended concept expression pattern 1: “Phrase 1⇒ Clause 2”
Extended concept expression pattern 2: “Phrase 2⇒ Clause 4”
[0140]
(3) (Extraction of extended concept expression based on searched text data structure)
For all of the retrieved text data structures, an extended concept expression is extracted based on the token information and intention expression information.
[0141]
Text data structure "Phrase 1⇒ Clause 2"
Extended conceptual expression 1: “Software⇒Installation”
[0142]
Text data structure "Phrase 2⇒ Clause 4"
Extended conceptual expression 2: “Installation → Execution”
Extended Conceptual Expression 3: “Installation → Execution (+ Possible)”
Extended concept expression 4: “Installation → Execution (+ Cancellation)”
Extended concept expression 5: “Installation → Execution (+ possible + cancellation)”
[0143]
When extracting an extended concept expression, all variations of combinations of intention expressions may be generated, but there is a problem that the types of extended concept expressions to be extracted increase. Therefore, the extension by adding the basic unit of concept expression is performed only with token information (extracting only theextended concept expressions 1 and 2). It may be implemented to perform “expansion by intention expression”.
[0144]
(4) (Recording of extracted extended concept expression)
The extracted extended concept expression is stored in the extended conceptexpression storage unit 109. At that time, the appearance frequency and the number of appearance texts are counted and managed.
[0145]
In the present embodiment, when extracting an extended concept, the number of concept expression basic units added by the user can be specified. In this case, an extended concept expression pattern when a specified number of basic units of concept expression are added in the process of (1) (generation of extended pattern of specified concept expression) may be generated.
[0146]
Further, in the present embodiment, when the extended concept is extracted, the user can specify the direction (forward or backward) in which the concept expression basic unit is added. In this case, an extended concept expression pattern in the case where a basic unit of concept expression is added in the specified direction may be generated in the process of (1) (generation of extended pattern of specified concept expression).
[0147]
Further, in the present embodiment, when extracting an extended concept, it is possible to specify a part of speech of a token of a concept expression basic unit added by a user. In this case, in the process of (2) (search for text data structure that matches the extended concept expression pattern), when the extended concept expression pattern is searched, the matching condition of the extended portion (“XXX” “YYY”) is satisfied. The specified part-of-speech information may be used.
[0148]
Further, in the present embodiment, when extracting an extended concept, the user can select the concept expression basic unit to be added by designating the inter-phrase relationship that is the relationship between the concept expression basic units. In this case, in the process of (2) (search for text data structure that matches the extended concept expression pattern), when the extended concept expression pattern is searched, the matching condition of the extended portion (“XXX” “YYY”) is satisfied. The specified phrase related information may be used.
[0149]
(Extended concept representation memory)
The extended conceptexpression storage unit 109 stores the extended concept expression extracted by the extended conceptexpression extraction unit 108. At that time, the appearance frequency and the number of appearance texts can also be stored.
[0150]
The above-described embodiment is a preferred embodiment of the present invention, and various modifications can be made without departing from the gist of the present invention. For example, it is possible to construct a device that performs the same processing operation as the document search device by causing the information processing device to execute a series of processing operations in the document search device in the present embodiment as a program.
[0151]
【The invention's effect】
As apparent from the above description, the present inventionAccording toThe user can search for a concept in which the specified concept is semantically narrowed, and can efficiently understand and search for the concept included in the text.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a document search apparatus according to the present invention.
FIG. 2 is a diagram showing a structural example of a text data structure in the present embodiment.
FIG. 3 is a diagram illustrating an example of information managed by each component of a text data structure in the present embodiment.
FIG. 4 is a diagram showing an example of a word list in the present embodiment.
FIG. 5 is a diagram showing an example of phrase management information in the present embodiment.
FIG. 6 is a flowchart showing a concept expression search method according to the present embodiment.
[Explanation of symbols]
101 Text input means
102 Language analysis means
103 Token extraction means
104 Intention expression extraction means
105 Text data structure storage means
106 Concept expression designation means
107 Condition specifying means
108 Extended concept expression extraction means
109 Extended concept expression storage means