JPH11282842A

Movatterモバイル変換

Info

Publication number: JPH11282842A
Application number: JP10103927A
Authority: JP
Inventors: Ikuaki Kobayashi; 生明小林
Original assignee: Brother Industries Ltd
Current assignee: Brother Industries Ltd
Priority date: 1998-03-30
Filing date: 1998-03-30
Publication date: 1999-10-15

Abstract

PROBLEM TO BE SOLVED: To rationalize the retrieval range of a Japanese dictionary to improve the retrieval efficiency when the dictionary is retrieved to analyze the Japanese morphemes and go increase the processing speed of a Japanese analysis device. SOLUTION: The types of characters of KANJI (Chinese character) and HIRAGANA (cursive form of Japanese syllabary) parts of an inputted Japanese sentence are decided, and the KANJI and HIRAGANA are replaced with numeric characters 1 and 0, for example, respectively and stored. These numeric characters are used as keys to divide the inputted sentence into the same character type. Then the inputted sentence is divided into characters strings of combinations of the prescribed character types as the character strings to be retrieved by noticing the combinations of character types. Only the character strings of combinations of similar character types are set as retrieval object among a Japanese dictionary to improve the retrieval efficiency of the dictionary.

Description

Translated fromJapanese

【発明の詳細な説明】DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、日本語解析装置及
び日本語解析プログラムを記録したコンピュータ読み取
り可能な記録媒体に関し、詳しくは、日本語形態素解析
における日本語辞書検索の効率化に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a Japanese-language analysis apparatus and a computer-readable recording medium on which a Japanese-language analysis program is recorded, and more particularly to an efficient Japanese dictionary search in Japanese morphological analysis. .

【０００２】[0002]

【従来の技術】従来の日本語解析装置において、与えら
れた日本語の文字列から単語を切り出すための形態素解
析の方法として最長一致法による解析が一般的に用いら
れていた。この最長一致法というのは、まず辞書に存在
する一番長い単語の文字数と同じ長さのかな漢字文字列
を、解析の対象である日本語文章から切り出し、それを
辞書に記録された単語と照合しながら同一文字列を辞書
検索し、一致した文字列が存在すれば単語と認識される
が、一致した文字列がない場合は検索に失敗したことに
なり、このときには末尾を一文字削ったものを検索の対
象文字列としてまたそれを辞書検索をし、さらに失敗す
るようであればさらに一文字削って検索し、繰り返し一
致した文字列があるまで検索するように構成されてい
た。2. Description of the Related Art In a conventional Japanese language analyzer, the longest matching method is generally used as a morphological analysis method for extracting words from a given Japanese character string. The longest match method first extracts a kana-kanji character string of the same length as the longest word in the dictionary from the Japanese sentence to be analyzed, and matches it with the words recorded in the dictionary. While searching the same character string dictionary, if there is a matching character string, it is recognized as a word, but if there is no matching character string, it means that the search has failed, It was configured to perform a dictionary search again as a search target character string, and if it fails, search for another character, and search until a repeated match is found.

【０００３】例えば「一の宮は良い天気です」という文
であれば、まず、日本語辞書の１０文字の辞書を参照し
て検索し、１０文字の辞書に「一の宮は良い天気です」
という登録単語がなければ、次に、「一の宮は良い天気
で」という文字列について、日本語辞書の９文字の辞書
を参照して検索し、「一の宮は良い天気で」という登録
単語がなければ、次に「一の宮は良い天気」について同
様な検索を繰り返し、「一の宮」という３文字からなる
文字列の日本語辞書での検索で一致するまで、単語検索
を繰り返し行う必要があり、極めて多くの手順を経なけ
れば検索できなかった。[0003] For example, if the sentence is "Ichinomiya is good weather", first search with reference to a 10-character dictionary in the Japanese dictionary, and then enter "Ichinomiya is good weather" in the 10-character dictionary.
If there is no registered word, then the character string "Ichinomiya is in good weather" is searched with reference to a nine-character dictionary in the Japanese dictionary. Then, it is necessary to repeat the same search for “Ichinomiya is good weather” and repeat word searches until a character string consisting of three letters “Ichinomiya” matches in a Japanese dictionary search. You could not search without going through the procedure.

【０００４】ただ、このような検索であれば、辞書に記
載されているすべての単語について検索されるため、正
確な辞書検索が出来ることになるので、最長一致法が採
用されてきた。However, since such a search is performed for all words described in the dictionary, an accurate dictionary search can be performed. Therefore, the longest matching method has been employed.

【０００５】[0005]

【発明が解決しようとする課題】しかしながら、実際に
単語として解析対象である日本語文章に用いられている
ものの多くは、辞書に登録されている最長の単語に比べ
極めて短く、このような短い単語を多く含んだ解析対象
である文字列について、辞書の一番長い文字列から順番
に長い文字列においてすべて一致するか否かの検索をす
るのでは検索に無駄が多く、解析時間を遅くする原因と
なるという問題点があった。However, most of the words actually used in Japanese sentences to be analyzed as words are extremely shorter than the longest word registered in the dictionary, and such a short word is used. If a search is performed to find whether or not all strings in the analysis target string that contain a lot of characters match in the longest string from the longest string in the dictionary, the search is wasteful and the analysis time is delayed. There was a problem that becomes.

【０００６】そこで、本発明は、上記課題を解決するた
めなされたものであり、無駄のない検索範囲を設定する
ことにより、検索漏れを起こさずに日本語解析の精度を
落とさないで、かつ検索時間を短縮できる、日本語形態
素解析をするための日本語解析装置及び日本語解析装置
のためのプログラムを記録したコンピュータ読み取り可
能な記録媒体を提供するものである。Therefore, the present invention has been made to solve the above-mentioned problem, and by setting a lean search range, search accuracy is not reduced, search accuracy is not reduced, and search is not performed. It is an object of the present invention to provide a Japanese language analyzer for performing Japanese morphological analysis and a computer-readable recording medium on which a program for the Japanese language analyzer is recorded, which can reduce the time.

【０００７】[0007]

【課題を解決するための手段】この目的を達成するため
に請求項１に記載の日本語解析装置は、かな漢字文字列
を入力するための入力手段と、その入力手段により入力
されたかな漢字文字列を記憶する記憶手段と、その記憶
手段に記憶されたかな漢字文字列の漢字、かな等の文字
種を判定する文字種判定手段と、前記文字種判定手段に
より判定された文字種に基づいて前記入力されたかな漢
字文字列の文字種の変わる境目で分割し、１または連続
した複数の同種の文字種からなる漢字部分、かな部分等
に分割する文字種分割手段と、日本語の単語及びその単
語の情報を記憶した日本語辞書と、前記文字種分割手段
により分割された位置で区切ったかな漢字文字列を前記
日本語辞書から単語として検索する単語検索手段とを備
えて日本語文章の形態素解析を行うことを特徴とする。In order to achieve this object, a Japanese language analyzer according to claim 1 comprises an input means for inputting a Kana-Kanji character string, and a Kana-Kanji character string input by the input means. , A character type determining unit for determining a character type such as a kanji or a kana of a kana kanji character string stored in the storage unit, and the input kana kanji character based on the character type determined by the character type determining unit. Character type division means for dividing at the boundary where the character type of the column changes, and dividing it into kanji parts, kana parts, etc. composed of one or more consecutive similar character types, and a Japanese dictionary storing Japanese words and information on the words And a word search means for searching for a kana kanji character string separated by the position divided by the character type dividing means as a word from the Japanese dictionary. And performing Taimoto analysis.

【０００８】請求項１に記載の日本語解析装置によれ
ば、日本語文章の形態素解析において、文字種判定手段
により判定された文字種に基づいて入力されたかな漢字
文字列を、１または連続した複数の同種の文字種からな
る漢字部分、かな部分等に分割し、分割された位置で区
切ったかな漢字文字列を日本語辞書から単語として検索
することで、必要以上に長い日本語辞書の単語を参照す
ることなく、かつ検索漏れのない、無駄無く効率の良い
単語検索ができる。According to the first aspect of the present invention, in a morphological analysis of a Japanese sentence, a kana-kanji character string input based on the character type determined by the character type determining means is converted into one or a plurality of continuous kana-kanji character strings. Referencing words in the Japanese dictionary that are longer than necessary by dividing the Kanji part and the Kana part, etc., consisting of the same type of characters, and searching for the Kana Kanji character string separated at the split position as a word from the Japanese dictionary An efficient and efficient word search without waste and without omission of search can be performed.

【０００９】請求項２に記載の日本語解析装置は、請求
項１に記載の日本語解析装置の構成に加え、前記文字種
判定手段は、前記かなをひらがなとカタカナに、或いは
前記漢字かなを除く文字種として英数文字とその他の記
号に、又は、さらに文字種を多種類に分類する文字種判
定手段であり、前記文字種分割手段は、入力された文字
列を当該分類に基づいて分割する文字種分割手段である
ことを特徴とする。According to a second aspect of the present invention, in addition to the configuration of the first aspect of the present invention, the character type determining means excludes the kana to hiragana and katakana or the kanji kana. The character type is alphanumeric characters and other symbols, or a character type determination unit that further classifies the character type into various types, and the character type division unit is a character type division unit that divides an input character string based on the classification. There is a feature.

【００１０】請求項２に記載の日本語解析装置によれ
ば、かなをひらがなとカタカナに、或いは漢字かなを除
く文字種として英数文字とその他の記号に、又は、さら
に文字種を多種類に分類して文字種を判定し、その分類
に基づいて入力された文字列を分割するため、より正確
で効率的な単語検索ができる。[0010] According to the Japanese language analysis device of the second aspect, the kana is classified into hiragana and katakana, or the character types other than kanji and kana are classified into alphanumeric characters and other symbols, or further, the character types are classified into various types. Character type, and the input character string is divided based on the classification, so that a more accurate and efficient word search can be performed.

【００１１】請求項３に記載の日本語解析装置は、請求
項１又は請求項２に記載の日本語解析装置の構成に加
え、前記単語検索手段により検索を失敗した時、前記文
字種分割手段により分割されたかな漢字文字列の末尾か
ら１文字削った残りのかな漢字文字列に対し、前記日本
語辞書から単語として検索する第２の単語検索手段を備
えたことを特徴とする。According to a third aspect of the present invention, in addition to the configuration of the first or second aspect of the present invention, when the retrieval by the word retrieval unit fails, the character type dividing unit is used. A second word search means is provided for searching the remaining Japanese kana kanji character string as a word for the remaining kana kanji character string obtained by removing one character from the end of the divided kana kanji character string.

【００１２】請求項３に記載の日本語解析装置の構成に
よれば、単語検索手段により検索を失敗した時、文字種
分割手段により分割されたかな漢字文字列の末尾から１
文字削った残りのかな漢字文字列に対し、日本語辞書か
ら単語として検索するため、日本語辞書の検索漏れが生
じない。According to the third aspect of the present invention, when the search by the word search means fails, one character from the end of the Kana-Kanji character string divided by the character type division means.
Since the remaining kana-kanji character strings are searched as words from the Japanese dictionary, the Japanese dictionary is not missed.

【００１３】請求項４に記載の日本語解析装置は、請求
項１乃至請求項３のいずれかに記載の日本語解析装置の
構成に加え、前記文字種分割手段は、特定のかな文字の
前あるいは後では分割しないようにするためのかなを判
定するかな判定手段を備えたことを特徴とする。According to a fourth aspect of the present invention, there is provided a Japanese-language analysis apparatus according to any one of the first to third aspects, wherein the character-type dividing means includes a part before a specific kana character or It is characterized in that it is provided with a kana judging means for judging kana so as not to be divided later.

【００１４】請求項４に記載の日本語解析装置の構成に
よれば、文字種分割手段が特定のかな文字の前あるいは
後では分割しないようにするためのかなを判定するた
め、漢字を結び付けて一つの単語を作ることが多い特定
のかなにより結合された文字列を一つの連続したものと
扱うことでさらに効率よく形態素解析ができる。According to the structure of the Japanese language analyzing apparatus of the fourth aspect, the character type dividing means determines the kana character so as not to divide the character before or after the specific kana character. A morphological analysis can be performed more efficiently by treating a character string combined by a specific kana, which often produces two words, as one continuous one.

【００１５】請求項５に記載の記録媒体は、日本語文章
の形態素解析を行う日本語解析装置のための日本語解析
プログラムを記録したコンピュータ読み取り可能な記録
媒体であって、コンピュータに、かな漢字文字列を入力
する手順と、前記入力されたかな漢字文字列を記憶する
手順と、前記記憶されたかな漢字文字列の文字種を漢
字、かな等の文字種とに判定する文字種判定の手順と、
その文字種判定の手順により判定された文字種に基づい
て前記入力されたかな漢字文字列の文字種の変わる境目
で分割し、１または連続した複数の同種の文字種からな
る漢字部分、かな部分及びその他の部分の文字列に分割
する文字種分割の手順と、前記文字種分割の手順により
分割された位置で区切ったかな漢字文字列を前記日本語
の単語及びその単語の情報を記憶した日本語辞書から単
語として検索する単語検索の手順とを実行させることを
特徴とする。According to a fifth aspect of the present invention, there is provided a computer-readable storage medium storing a Japanese language analysis program for a Japanese language analysis device for performing a morphological analysis of Japanese sentences. Inputting a sequence, storing the input kana-kanji character string, and storing the kana-kanji character string as a character type for determining the character type of the stored kana-kanji character string as a character type such as kanji or kana.
Based on the character type determined by the character type determination procedure, the input kana kanji character string is divided at a boundary where the character type changes, and the kanji part, the kana part, and the other part of one or a plurality of the same type of consecutive character types are divided. A procedure for dividing a character type into character strings, and a word for retrieving a Kana-Kanji character string divided at the position divided by the procedure for dividing the character type as a word from the Japanese dictionary storing the Japanese word and information on the word And executing a search procedure.

【００１６】請求項５に記載の記録媒体の構成によれ
ば、コンピュータによる日本語文章の形態素解析におい
て、コンピュータに、文字種判定の手順により判定され
た文字種に基づいて入力されたかな漢字文字列を、１ま
たは連続した複数の同種の文字種からなる漢字部分、か
な部分等に分割し、分割された位置で区切ったかな漢字
文字列を日本語辞書から単語として検索することで、必
要以上に長い日本語辞書の単語を参照することなく、か
つ検索漏れのない、無駄無く効率の良い単語検索の手順
を実行させることができる。According to the configuration of the recording medium according to the fifth aspect, in the morphological analysis of a Japanese sentence by a computer, a kana-kanji character string input to the computer based on the character type determined by the procedure of character type determination is used. A Japanese dictionary that is longer than necessary by dividing it into kanji parts, kana parts, etc. consisting of one or more consecutive similar character types, and searching for kana kanji character strings separated at the divided positions as words from the Japanese dictionary The word search procedure can be efficiently executed without referring to the word and without omission and without waste.

【００１７】請求項６に記載の記録媒体は、請求項５に
記載の日本語解析プログラムを記録したコンピュータ読
み取り可能な記録媒体の構成に加え、前記文字種判定の
手順は、前記かなをひらがなとカタカナに、或いは前記
漢字かなを除く文字種をさらに英数文字とその他の記号
に、又は、さらに文字種を多種類に分類する文字種判定
の手順であり、前記文字種分割の手順は、入力された文
字列を当該分類に基づいて分割する文字種分割の手順で
あることを特徴とする。According to a sixth aspect of the present invention, in addition to the configuration of the computer readable recording medium storing the Japanese language analysis program according to the fifth aspect, the character type determination procedure includes the steps of: Or a character type determination procedure for further classifying a character type other than the kanji kana into alphanumeric characters and other symbols, or further classifying the character type into various types. It is a procedure of character type division based on the classification.

【００１８】請求項６に記載の記録媒体によれば、コン
ピュータに、かなをひらがなとカタカナに、或いは漢字
かなを除く文字種をさらに英数文字とその他の記号に、
又は、さらに文字種を多種類に分類して文字種を判定
し、その分類に基づいて入力された文字列を分割するた
め、より正確で効率的な単語検索の手順を実行させるこ
とができる。According to the recording medium of the sixth aspect, the computer can be used to convert kana to hiragana and katakana, or the character type excluding kanji and kana to alphanumeric characters and other symbols.
Alternatively, since the character types are determined by further classifying the character types, and the input character string is divided based on the classification, a more accurate and efficient word search procedure can be executed.

【００１９】請求項７に記載の記録媒体は、請求項５又
は請求項６に記載の日本語解析プログラムを記録したコ
ンピュータ読み取り可能な記録媒体の構成に加え、前記
コンピュータに、前記単語検索の手順により検索を失敗
したとき、前記文字種分割手段により分割されたかな漢
字文字列の末尾から１文字削った残りのかな漢字文字列
に対し前記日本語辞書から単語として検索する第２の単
語検索の手順を実行させるプログラムをさらに備えたこ
とを特徴とする。According to a seventh aspect of the present invention, there is provided a computer-readable recording medium storing the Japanese language analysis program according to the fifth or sixth aspect of the present invention. Executes a second word search procedure for searching as a word from the Japanese dictionary the remaining kana kanji character string obtained by removing one character from the end of the kana kanji character string divided by the character type dividing means when the search fails. The program further comprises a program for causing the program to execute.

【００２０】請求項７に記載の記録媒体の構成によれ
ば、コンピュータに、単語検索の手順により検索を失敗
した時、文字種分割の手順により分割されたかな漢字文
字列の末尾から１文字削った残りのかな漢字文字列に対
し、日本語辞書から単語として検索するため、日本語辞
書の漏れが生じない検索を実行させることができる。According to the configuration of the recording medium of the present invention, when the computer fails the search by the word search procedure, the computer removes one character from the end of the kana-kanji character string divided by the character type division procedure. Since the kana-kanji character string is searched as a word from the Japanese dictionary, a search that does not cause omission of the Japanese dictionary can be executed.

【００２１】請求項８に記載の記録媒体は、請求項５乃
至請求項７のいずれかに記載の日本語解析プログラムを
記録したコンピュータ読み取り可能な記録媒体の構成に
加え、前記コンピュータに、特定のかな文字の前あるい
は後では分割しないようにするためのかなを判定するか
な判定の手順を備えた漢字分割の手順を実行させるプロ
グラムをさらに備えたことを特徴とする。[0021] The recording medium according to claim 8 has a computer-readable recording medium that records the Japanese language analysis program according to any one of claims 5 to 7, and further includes a computer-specific recording medium. A program for executing a kanji division procedure including a kana judgment procedure for judging a kana before or after a kana character is provided.

【００２２】請求項８に記載の記録媒体によれば、コン
ピュータに、文字種分割の手順が特定のかな文字の前あ
るいは後では分割しないようにするためのかなを判定す
る手順を実行させるため、例えば「の」や「ヶ」のよう
に漢字を結び付けて一つの単語を作ることが多い特定の
かなにより結合された文字列を一つの連続したものと扱
う手順を実行させることでさらに効率よく形態素解析が
できる。According to the recording medium of the eighth aspect, in order for the computer to execute a procedure for determining a kana to prevent the character type from being divided before or after a specific kana character, for example, Morphological analysis is more efficient by executing a procedure that treats a character string combined by a specific kana as one continuous word, often connecting kanji like “no” or “ga” Can be.

【００２３】[0023]

【発明の実施の形態】以下、本発明を一の実施の形態に
より図面を参照して説明する。本実施の形態の日本語解
析装置は、コンピュータを備え、コンピュータ読み取り
可能な記録媒体であるＲＯＭに記憶された言語解析プロ
グラムにより、日本語文字列を漢字又はかなの連続する
文字種の部分に分けて、日本語形態素解析を行うもので
ある。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The present invention will be described below with reference to the drawings according to one embodiment. The Japanese language analyzing apparatus according to the present embodiment includes a computer, and divides a Japanese character string into portions of continuous character types such as Kanji or Kana by a language analysis program stored in a ROM that is a computer-readable recording medium. , To perform Japanese morphological analysis.

【００２４】ここで、本願において、特に断りがない限
り「かな」といった場合は、「ひらがな」及び「カタカ
ナ」をいう。また、「かな漢字文字列」とは、少なくと
もかな、漢字、その他の文字種のいずれかを含むものを
いい、例えば、カタカナのみの文字列や英数文字が含ま
れたような文字列も含めて考えるものとする。なお、本
実施の形態の説明においては、理解のため漢字とひらが
なのみを含む日本語文を例に挙げて説明する。In the present application, "kana" means "hiragana" and "katakana" unless otherwise specified. The term "kana-kanji character string" means at least one of kana, kanji, and other character types, such as a character string containing only katakana characters and alphanumeric characters. Shall be. In the description of the present embodiment, a Japanese sentence including only kanji and hiragana will be described as an example for understanding.

【００２５】まず、本実施の形態の日本語解析装置の概
略図を示すブロック図を図１を用いて説明する。図１に
示すように本実施形態の日本語解析装置は、データバス
６０を有し、これを介して入力手段に相当する入力装置
２０と、読み出し専用の記憶装置であるＲＯＭ４０と、
読み書き可能な記憶装置であるＲＡＭ５０と、外部記憶
装置７０と、Ｉ／Ｏポート８０と、解析結果等を表示さ
せる表示装置３０と、出力装置９０と、それらを制御す
るＣＰＵ１０とが接続されて構成されている。First, a block diagram showing a schematic diagram of a Japanese language analyzer according to the present embodiment will be described with reference to FIG. As shown in FIG. 1, the Japanese language analyzer according to the present embodiment has a data bus 60 through which an input device 20 corresponding to an input unit, a ROM 40 that is a read-only storage device,
A configuration in which a RAM 50 which is a readable and writable storage device, an external storage device 70, an I / O port 80, a display device 30 for displaying analysis results and the like, an output device 90, and a CPU 10 for controlling them are connected. Have been.

【００２６】データバス６０は、本実施の形態を構成す
る各機器等の情報のやり取りを可能にするもので、例え
ばＣＰＵ１０は、このデータバス６０を介して、ＲＡＭ
５０やＲＯＭ４０にアクセスする。The data bus 60 enables the exchange of information of each device constituting the present embodiment. For example, the CPU 10 is connected to the RAM via the data bus 60.
It accesses 50 and ROM40.

【００２７】入力装置２０は、キーボード及びマウスを
備えるもので、キーボードから解析対象である日本語文
字列をキー入力して、ＲＡＭ５０のかな漢字テキスト記
憶領域５１に日本語文字列を蓄積したり、マウスにより
ＣＰＵ１０に対して指示コマンドを入力する。The input device 20 is provided with a keyboard and a mouse. A Japanese character string to be analyzed is input from the keyboard as keys, and the Japanese character string is stored in the Kana-Kanji text storage area 51 of the RAM 50. Input an instruction command to the CPU 10.

【００２８】ＲＯＭ４０には、文字種判定手段に相当す
る文字種判定プログラム４１と、文字種分割手段に相当
する文字種分割プログラム４２と、日本語辞書４３と、
かな判定手段に相当する非分割かな辞書４４と、単語検
索手段および第２の単語検索手段に相当する単語検索プ
ログラム４５とからなっている。The ROM 40 includes a character type determining program 41 corresponding to character type determining means, a character type dividing program 42 corresponding to character type dividing means, a Japanese dictionary 43,
It comprises an undivided kana dictionary 44 corresponding to kana determination means, and a word search program 45 corresponding to word search means and second word search means.

【００２９】文字種判定プログラム４１は、かな漢字テ
キスト記憶領域５１に記憶されている日本語文字列につ
いて、それぞれの文字を区点コード番号や１６進コード
番号などの文字コード表を参照して、そのコード番号か
ら漢字とひらがなに、又はそれ以外に識別して判定する
手順をコンピュータに実行させるプログラムであり、本
実施の形態では、漢字を「１」、ひらがなを「０」とし
て記憶して文字種格納領域５４に格納する。The character type judging program 41 refers to a character code table such as a kuten code number or a hexadecimal code number for each character of the Japanese character string stored in the kana-kanji text storage area 51 and finds the code. This is a program for causing a computer to execute a procedure for identifying and determining a kanji character and a hiragana character or a number from a number. In the present embodiment, a kanji character is stored as "1" and a hiragana character is stored as "0". 54.

【００３０】文字種分割プログラム４２は、文字種格納
領域５４に格納された文字列の文字種の変わる境目で分
割し、１又は複数個の同種の文字種が連続した文字列ご
とに分け、この分けられた文字列を一つのブロックとす
る。そして、このブロックを定められた数だけ有するよ
うに組み合わせて分割するものである。従って、１つの
ブロックは、「１」または「０」のいずれかの文字を１
又は複数含む文字列になっている。例えば「１」「１
１」「１１１」のようにである。The character type dividing program 42 divides a character string stored in the character type storage area 54 at a boundary where the character type changes, and divides one or a plurality of similar character types into a continuous character string. A row is one block. Then, the blocks are combined and divided so as to have a predetermined number. Therefore, one block is composed of one character of either “1” or “0”.
Or it is a character string that includes more than one. For example, "1""1
1 "and" 111 ".

【００３１】具体的には、例えば「１１０１００…」と
いう文字列が文字種格納領域５４に格納されている場合
を考えると、「漢字、かな、漢字」の組み合わせで分割
するとすれば、まず「１１／０／１／００…」と同一の
文字種のブロックに分け、次に、例えば分割を最初から
「漢字、かな、漢字」のように分割すると決めておけ
ば、最初の３つのブロックを含む「１１／０／１」の文
字列が分割されることになる。More specifically, for example, when a character string “110100...” Is stored in the character type storage area 54, if it is divided by a combination of “Kanji, Kana, Kanji”, first, “11 / 0/1/00... ”And then, for example, if it is determined that the division is to be made as“ Kanji, Kana, Kanji ”from the beginning,“ 11 ”including the first three blocks The character string “/ 0/1” is divided.

【００３２】日本語辞書４３は、図３に示すように、見
出し語と単語の品詞と特別の情報をＲＯＭ４０の日本語
辞書４３として格納したもので、文字種分割プログラム
４２によって分割された日本語文字列を検索するための
検索エリアである。As shown in FIG. 3, the Japanese dictionary 43 stores headwords, parts of speech of words, and special information as a Japanese dictionary 43 in the ROM 40. This is a search area for searching a column.

【００３３】非分割かな辞書４４は、たとえば「が」の
ように、「希望が丘」「霧が峰」「君が代」の如く「漢
字＋が＋漢字」の組み合わせでよく使うばあい、「が」
を特定の非分割かなとして当該非分割かな辞書４４に格
納しておき、検索される日本語文字列としてこのかなの
前後の漢字といっしょに切り出すものである。その他
「ヶ」「ヵ」のような小文字のカタカナ、更には「＆」
のような英語の記号からなるものを含めても良い。The non-divided kana dictionary 44 is often used in a combination of "kanji + ga + kanji", such as "Kiga-ga-oka", "Kiri-ga-mine", "Kimi-ga-yo", such as "ga".
Is stored in the non-divided kana dictionary 44 as a specific non-divided kana, and is cut out together with kanji before and after this kana as a Japanese character string to be searched. Other lowercase katakana characters such as "ka" and "ka", and "&"
May be included.

【００３４】単語検索プログラム４５は、文字種分割プ
ログラム４２によって分割された日本語文字列を、日本
語辞書４３を参照して同一の見出しの単語が存在するか
検索する手順をコンピュータに実行させるものである。The word search program 45 causes a computer to execute a procedure for searching for a Japanese character string divided by the character type division program 42 with reference to the Japanese dictionary 43 for the presence of a word with the same heading. is there.

【００３５】制御プログラム４６は、日本語解析プログ
ラムの全体の制御を行うもので、例えば、記憶された手
順に従って、上記各プログラムを起動したり、入出力の
制御などを行う。The control program 46 controls the entire Japanese language analysis program. For example, the control program 46 starts each of the programs and controls input / output in accordance with stored procedures.

【００３６】ＲＡＭ５０には、記憶手段に相当するかな
漢字テキスト記憶領域５１と、検索文字列記憶領域５２
と、検索位置記憶領域５３と、文字種格納領域５４と、
作業領域５５が設けられている。In the RAM 50, a kana-kanji text storage area 51 corresponding to storage means and a search character string storage area 52
A search position storage area 53, a character type storage area 54,
A work area 55 is provided.

【００３７】かな漢字テキスト記憶領域５１は、入力装
置２０から入力された日本語文字列をテキスト情報とし
て格納する記憶バッファである。The kana-kanji text storage area 51 is a storage buffer for storing Japanese character strings input from the input device 20 as text information.

【００３８】検索文字列記憶領域５２は、文字種分割プ
ログラム４２により分割された、日本語辞書４３の検索
をする対象の日本語文字列を記憶する領域である。The search character string storage area 52 is an area for storing Japanese character strings to be searched in the Japanese dictionary 43, which are divided by the character type division program 42.

【００３９】検索位置記憶領域５３は、検索が終了した
文字列の位置を手掛かりに、次の検索をするための、検
索済みの文字列の最後の位置を記録してある記憶領域で
ある。The search position storage area 53 is a storage area in which the last position of the searched character string for the next search is recorded based on the position of the character string after the search.

【００４０】文字種格納領域５４は、入力された日本語
文字列を、文字種に応じて、即ち漢字を１、ひらがなを
０と置き換えた数字からなる文字列を格納しておく領域
である。The character type storage area 54 is an area for storing an input Japanese character string in accordance with the character type, that is, a character string consisting of numbers obtained by replacing kanji with 1 and hiragana with 0.

【００４１】作業領域５５は、上記各記憶領域に記憶さ
れる情報以外の情報を一時的に記憶する領域で、各ステ
ップで適宜使用されるものである。The work area 55 is an area for temporarily storing information other than the information stored in each of the storage areas, and is used as appropriate in each step.

【００４２】外部記憶装置７０は、本実施の形態ではハ
ードディスクドライブを用いており、日本語解析処理の
対象にする文章や、日本語解析処理の終了した文章など
を蓄積しておくことができる。The external storage device 70 uses a hard disk drive in the present embodiment, and can store sentences to be subjected to Japanese language analysis processing, sentences for which Japanese language analysis processing has been completed, and the like.

【００４３】Ｉ／Ｏポート８０は、適宜他のコンピュー
タや、電話回線、その他有線無線を問わず情報の入出力
ができるものであり、日本語解析処理の対象や結果物を
入出力する場合に使用できる。The I / O port 80 is capable of inputting / outputting information arbitrarily irrespective of another computer, a telephone line, or other wired / wireless communication. Can be used.

【００４４】表示装置３０は、ＣＲＴを備え、入力され
た日本語文や解析結果を表示する。The display device 30 has a CRT and displays an input Japanese sentence and an analysis result.

【００４５】出力装置９０は、本実施の形態では、プリ
ンタを用い言語解析された結果をハードコピーする場合
などに用いられる。In the present embodiment, the output device 90 is used, for example, when making a hard copy of the result of language analysis using a printer.

【００４６】次に図２を参照して本実施の形態の日本語
解析装置及び日本語解析処理を行うプログラムの流れを
説明する。Next, with reference to FIG. 2, the flow of a Japanese language analyzing apparatus and a program for performing a Japanese language analyzing process according to the present embodiment will be described.

【００４７】まず、システムを起動させると、ＲＯＭ４
０の制御プログラム４６が立ち上がり、ＲＡＭ５０の作
業領域５５、かな漢字テキスト記憶領域５１、検索文字
列記憶領域５２、検索位置記憶領域５３、文字種格納領
域５４の各記憶領域が確保され、入力装置２０からの入
力が可能となり、処理が可能となる（開始）。入力装置
２０より入力されたかな漢字文字列がかな漢字テキスト
記憶領域５１に記憶される（ステップ２１（以下ステッ
プを単にＳと略記する。））。なお、ここで入力された
文は前述のように、理解のため漢字とひらがなのみを含
む文字列とする。First, when the system is started, the ROM 4
0, the control program 46 is started, and the work area 55 of the RAM 50, the kana-kanji text storage area 51, the search character string storage area 52, the search position storage area 53, and the character type storage area 54 are secured. Input becomes possible and processing becomes possible (start). The kana-kanji character string input from the input device 20 is stored in the kana-kanji text storage area 51 (step 21 (hereinafter, steps are simply abbreviated as S)). The sentence input here is a character string containing only kanji and hiragana for understanding as described above.

【００４８】次に、文字種判定プログラム４１によりそ
の文字列の文字の属性「ひらがな」か「漢字」が、それ
ぞれ０と１の記号で表される（Ｓ２２）。例えば「一の
宮は良い天気です」という入力文があると図４のように
文字種格納領域５４に「１０１０１０１１００」と記憶
される。つまり、この「０」と「１」はそれぞれその位
置のひらがなと漢字が位置していることを示している。Next, the character attribute "Hiragana" or "Kanji" of the character string is represented by symbols 0 and 1 by the character type determination program 41 (S22). For example, if there is an input sentence “Ichinomiya is good weather”, “1010101100” is stored in the character type storage area 54 as shown in FIG. That is, "0" and "1" indicate that the hiragana and the kanji are located at that position, respectively.

【００４９】次に文字種分割処理を行う（Ｓ２３）。こ
の処理は文字種格納領域５４に格納された前記文字列の
文字の属性を参照し、単語を切り出す位置を決定する。
ここでは文字種格納領域５４を参照しその先頭に「１」
つまり漢字が先頭にきた場合には、その後にひらがなが
現われ、再び漢字が現れた後、ひらがなが現れる位置を
探す。つまり文字種格納領域５４の始めの「１０１」の
直後の「０」の位置、即ち、「漢字＋ひらがな＋漢字」
の後の「かな」の位置、かな漢字テキスト記憶領域５１
に記憶された例文で言うと「一＋の＋宮」の後の「は」
である。Next, character type division processing is performed (S23). This processing refers to the character attribute of the character string stored in the character type storage area 54, and determines the position where the word is cut out.
Here, the character type storage area 54 is referred to and “1” is added at the beginning.
In other words, when the kanji comes first, the hiragana appears after that, and after the kanji appears again, the position where the hiragana appears is searched. That is, the position of "0" immediately after "101" at the beginning of the character type storage area 54, that is, "Kanji + Hiragana + Kanji"
"Kana" position after "Kana Kanji text storage area 51"
In the example sentence stored in the "ha" after "Ichi + no Miya"
It is.

【００５０】ここで、日本語において、一の単語の中で
漢字とひらがなが混在する例を考慮すると、漢字を先頭
とする場合、考えられる組み合わせは例えば「漢字＋ひ
らがな」「漢字＋ひらがな＋漢字」「漢字＋ひらがな＋
漢字＋ひらがな」「漢字＋ひらがな＋漢字＋ひらがな＋
漢字」等無数にある。この場合、先頭の漢字は１文字に
は限らず複数あってもよく、同様に２番目のひらがなも
複数あってもよい。Here, in Japanese, considering the case where kanji and hiragana coexist in one word, when kanji is the first, possible combinations are, for example, “kanji + hiragana”, “kanji + hiragana + kanji” "" Kanji + Hiragana +
"Kanji + Hiragana""Kanji + Hiragana + Kanji + Hiragana +
There are countless such as "Kanji". In this case, the first kanji is not limited to one character, but may be plural, and similarly, the second hiragana may be plural.

【００５１】ところで、日本語の単語は、その成り立ち
から「漢字」を語幹として、ここに「ひらがな」から成
る付属語がつくことが多い。一方、文字数が多い単語に
おいて、その文字種を調べてみると漢字のみであった
り、或いはひらがなのみであったり、カタカナのみであ
ったりすることが多い。逆に言えば、長い単語に漢字や
ひらがなが交互に何度も現れることは少ない。つまり、
多くの場合は単語の基本の部分を構成する漢字があり、
ここにひらがなが付属することが最も頻度として多く、
さらに漢字が続くことは稀で、さらにひらがなが続くの
は極めて少なく、このような文字列を単語検索する意味
は低い。これ以上漢字とひらがなが交互に現れるような
ものは、複合語として分割しても解析可能な場合がほと
んどである。By the way, Japanese words are often derived from "Kanji" as a stem and have an attached word consisting of "Hiragana". On the other hand, when examining the character type of a word having many characters, it is often the case that only a kanji character, only a hiragana character, or only a katakana character is used. Conversely, kanji and hiragana rarely appear repeatedly in long words. That is,
Often there are kanji that make up the basic part of a word,
Hiragana is most often included here,
In addition, kanji characters rarely follow, and hiragana characters rarely follow, and the meaning of word search for such a character string is low. If the kanji and hiragana alternately appear more than this, it can be analyzed in most cases even if divided as compound words.

【００５２】即ち、本発明において、辞書検索の対象
を、連続した文字列から切り出すのに、単に文字数だけ
で考えずに、文字種に着目することで、今までにない極
めて効率のよい辞書検索が可能に成るものである。ま
た、もし漢字とひらがなが交互に何度も現れ、複合語と
して分割できないものがあったとしても、その数は極め
て稀で、この場合は定型句や慣用句辞書で処理した方が
はるかに能率がよいことになる。In other words, according to the present invention, an extremely efficient dictionary search can be realized by focusing on the character type instead of merely considering the number of characters in extracting a dictionary search target from a continuous character string. It becomes possible. Also, if kanji and hiragana appear alternately many times, and there are some that cannot be divided as compound words, the number is extremely rare.In this case, processing with a fixed phrase or idiom dictionary is much more efficient Will be good.

【００５３】通常この組み合わせを選択するのに日本語
辞書４３のすべての単語を調べ、漢字が先頭である単語
のうちで、かなと漢字の組み合わせが一番多い物を調
べ、この組み合わせを選ぶ。ここでは「一の宮」のよう
に「漢字＋ひらがな＋漢字」が日本語辞書４３に格納さ
れた、かなと漢字の組み合わせの一番多い単語とする。
つまり、「漢字＋ひらがな＋漢字＋ひらがな」という組
み合わせ若しくはこれ以上の漢字、ひらがなからなる単
語は辞書には存在しなかったということになる。従っ
て、「漢字＋ひらがな＋漢字」の組み合わせから検索し
ても、検索漏れとなる単語は存在しないことになる。Normally, when selecting this combination, all the words in the Japanese dictionary 43 are examined, and among the words whose kanji are the first, the combination having the largest combination of kana and kanji is examined, and this combination is selected. Here, “Kanji + Hiragana + Kanji”, such as “Ichinomiya”, is stored in the Japanese dictionary 43 and is the word with the largest combination of kana and kanji.
In other words, the combination of "kanji + hiragana + kanji + hiragana" or a word composed of more kanji and hiragana did not exist in the dictionary. Therefore, even if the search is performed based on the combination of “Kanji + Hiragana + Kanji”, there is no word that is omitted from the search.

【００５４】そして切り出されたかな漢字文字列「一の
宮」が検索文字列記憶領域５２に記憶される。The extracted kana kanji character string "Ichinomiya" is stored in the search character string storage area 52.

【００５５】次に、検索文字列記憶領域５２に格納され
ている文字列「一の宮」を日本語辞書４３から辞書検索
処理をする（Ｓ２４）。日本語辞書４３はその概念図を
示すと図３の様になる。日本語の各単語が見出しと品
詞、その他の情報が格納されている。この日本語辞書４
３には単語「一の宮」が存在するため、検索の対象と一
致し検索が成功する（Ｓ２５：ＹＥＳ）。Next, a dictionary search process is performed on the character string "Ichinomiya" stored in the search character string storage area 52 from the Japanese dictionary 43 (S24). FIG. 3 shows a conceptual diagram of the Japanese dictionary 43. Each word in Japanese stores a headline, a part of speech, and other information. This Japanese dictionary 4
Since the word “Ichinomiya” exists in 3, it matches the search target and the search is successful (S 25: YES).

【００５６】次の検索が必要かどうかの判断のため、検
索された単語「一の宮」が入力された文字列すなわち、
かな漢字テキスト記憶領域５１に記憶されている文字列
の末尾かどうかが判定され（Ｓ２８）、末尾と判定され
れば日本語解析が成功したとして日本語解析処理を終了
するが（Ｓ２８：ＹＥＳ，終了）、末尾でないときに
は、まず検索された単語の末尾の次の文字の位置にフラ
グをたてて、検索位置記憶領域５３に記憶する（Ｓ２
８）。ここでは、かな漢字文字列「一の宮」の末尾位置
の次の文字、すなわち次に検索する文字の先頭位置であ
る４文字目という内容が検索位置記憶領域５３に記憶さ
れる。To determine whether the next search is necessary, a character string in which the searched word "Ichinomiya" is input, that is,
It is determined whether the character string stored in the kana-kanji text storage area 51 is at the end (S28). If it is determined that the character string is at the end, the Japanese analysis is determined to be successful and the Japanese analysis processing is terminated (S28: YES, end). If it is not the end, a flag is set at the position of the character next to the end of the searched word, and the flag is stored in the search position storage area 53 (S2).
8). Here, the content next to the end position of the kana-kanji character string “Ichinomiya”, that is, the fourth character, which is the start position of the next character to be searched, is stored in the search position storage area 53.

【００５７】次に後に続く文字列の解析をするために再
び文字種分割処理を行う（Ｓ２８：ＮＯ，Ｓ２３）。こ
こでは検索位置記憶領域５３を参照し、「一の宮」の次
の「は」の位置から文字種分割プログラム４２により切
り出す。文字列「は良い天気です」に対して前述の文字
列の切り出しを行う。但し、「一の宮」の場合は文字列
が漢字で始まっていたが、この場合は文字列がひらがな
で始まっているので、ひらがな文字列と漢字文字列のつ
ながったものをひとまとまりの文字列として切り出して
くる。この場合も漢字が先頭の場合と同じように、通常
は日本語辞書４３内に存在する単語でひらがなから始ま
る物のうちでひらがなと漢字の組み合わせが一番長いパ
ターンを選ぶ。ここでは「ひらがな＋漢字」の組み合わ
せが最長とすると文字列「は良」が切り出され検索文字
列記憶領域５２に記憶される（Ｓ２３）。次に文字列
「は良」の検索が行われる（Ｓ２４）。この検索は先に
述べたように日本語辞書４３と一致した文字列を検索す
ることで行う。この場合、単語が存在しないので検索は
失敗と判定される（Ｓ２５：ＮＯ）。そのため文字列の
末尾から一文字削除し、検索文字列記憶領域５２の文字
列の末尾から一文字削る（Ｓ２６）。すると検索文字列
記憶領域５２の内容は「は」になる。次に検索が継続可
能か判断するため文字列が０になってないか判断され、
ここでは文字列「は」が残っているため０ではないと判
定され再び辞書検索処理がされる（Ｓ２７：ＮＯ、Ｓ２
４）。従って今度は、文字列「は」の辞書検索が行われ
辞書中に存在するので（Ｓ２５：ＹＥＳ）、次に検索単
語が文字列末尾か否かを判断される（Ｓ２８）。Next, character type division processing is performed again to analyze the following character string (S28: NO, S23). Here, with reference to the search position storage area 53, the character type division program 42 cuts out from the position of “ha” next to “Ichinomiya”. The character string described above is cut out for the character string “is good weather”. However, in the case of "Ichinomiya", the character string started with kanji, but in this case, since the character string starts with hiragana, the connection of the hiragana character string and the kanji character string is cut out as a group of character strings Come. In this case, as in the case where the kanji is the first character, a pattern having the longest combination of the hiragana and the kanji is selected from the words existing in the Japanese dictionary 43 and starting with the hiragana. Here, assuming that the combination of “Hiragana + Kanji” is the longest, the character string “Hara” is cut out and stored in the search character string storage area 52 (S23). Next, a search for the character string "ha good" is performed (S24). This search is performed by searching for a character string that matches the Japanese dictionary 43 as described above. In this case, since the word does not exist, the search is determined to have failed (S25: NO). Therefore, one character is deleted from the end of the character string and one character is deleted from the end of the character string in the search character string storage area 52 (S26). Then, the content of the search character string storage area 52 becomes “ha”. Next, it is determined whether the character string is 0 to determine whether the search can be continued.
Here, since the character string "wa" remains, it is determined that it is not 0, and the dictionary search process is performed again (S27: NO, S2
4). Therefore, this time, since the dictionary search for the character string "ha" is performed and exists in the dictionary (S25: YES), it is determined whether the search word is the end of the character string (S28).

【００５８】なお、もしＳ２６で一字削除した結果
「ん」のような一文字で意味を成さないような文字が、
誤記等で残った場合、０文字ではないので再度辞書検索
処理され（Ｓ２７：ＮＯ，Ｓ２４）、「ん」は日本語辞
書にないため検索は失敗し（Ｓ２５：ＮＯ）、さらに一
字削除され（Ｓ２６）、「ん」から一字削除されること
により文字列は「０文字」になる。一文字削った結果残
りの文字が存在しなくなったときには検索がもはや不可
能であるため、検索失敗として（Ｓ２７：ＹＥＳ）処理
を終了され(終了）、「ん」は未知の文字としてとして
処理されることになる。It should be noted that, if one character such as “n” which does not make sense as a result of deleting one character in S26,
If there is a mistake, etc., the dictionary search processing is performed again because the character is not 0 (S27: NO, S24). Since "n" is not in the Japanese dictionary, the search fails (S25: NO), and one more character is deleted. (S26) The character string becomes "0 character" by deleting one character from "n". When the remaining characters no longer exist as a result of removing one character, the search is no longer possible. Therefore, the search is failed (S27: YES), the processing is ended (end), and "n" is processed as an unknown character. Will be.

【００５９】同じようにして単語「良い」「天気」「で
す」が検索されていき最後に文字列「です」が検索され
たときには検索された文字列がかな漢字テキスト記憶領
域５１の末尾のため全文検索成功と判定され（Ｓ２８：
ＹＥＳ）、日本語解析が成功したとして処理を終了され
る（終了）。Similarly, when the words "good", "weather" and "is" are searched, and the character string "is" is finally searched, the searched character string is at the end of the kana-kanji text storage area 51, so that the full text It is determined that the search is successful (S28:
YES), the process is terminated assuming that the Japanese analysis has been successful (end).

【００６０】すると文字列「一の宮は良い天気です」と
いう文字列は図４に示すように「一の宮／は／良い／天
気／です」と解析結果を出力する。Then, a character string "Ichinomiya / ha / good / weather / is" is output as the character string "Ichinomiya is good weather" as shown in FIG.

【００６１】以上の説明から明らかなように、本実施の
形態のように日本語の単語の漢字およびひらがなの組み
合わせでその検索対象を決定し、その位置から辞書検索
を行うことによって無駄な位置からの検索処理を省略す
ることができ、解析処理を高速化することができる。As is apparent from the above description, as in the present embodiment, the search target is determined by the combination of the kanji and the hiragana of the Japanese word, and a dictionary search is performed from that position, so that the search is performed from a wasteful position. Can be omitted, and the analysis processing can be sped up.

【００６２】なお、本実施の形態では従来の解析処理の
様に日本語辞書４３中に存在する最長単語の文字数をｎ
としその長さから辞書検索する最長一致法による処理に
ついては組み込まれていないが、図２のＳ２３の文字種
分割処理において求められたかなと漢字の組み合わせで
切り出されたかな漢字文字列の長さと比較し短い方の位
置から文字列を切り出すように構成してあっても良い。In this embodiment, the number of characters of the longest word existing in the Japanese dictionary 43 is set to n as in the conventional analysis processing.
Although the processing by the longest match method for dictionary search based on the length is not incorporated, the length is compared with the length of the kana kanji character string extracted by the combination of the kana and kanji obtained in the character type division processing of S23 in FIG. The character string may be cut out from the shorter position.

【００６３】また、「一の宮」の「の」や「千鳥が淵」
の「が」や、「希望ヶ丘」の「ヶ」の様に前後につなが
りやすいひらがなに関しては非分割かな辞書４４にそれ
らのかな文字を記憶させておき、それらのひらがな文字
を漢字と同様に扱い、図２の文字種判定処理においても
文字種格納領域５４に漢字である「１」を記憶させる様
に構成してあっても良い。この場合は、本実施の形態の
例文の「一の宮は良い天気です」では、まず「一の宮」
については、「の」が漢字とみなされるため、最初に切
り出される部分は「一の宮は良」ということになる。Also, "No" and "Chidori-ga-fuchi" of "Ichinomiya"
Regarding hiragana that is easy to connect back and forth like “ga” or “ki” in “Kihogaoka”, those kana characters are stored in the non-divided kana dictionary 44, and those hiragana characters are treated like kanji. In the character type determination process of FIG. 2, the character type storage area 54 may be configured to store the Chinese character “1”. In this case, in the example sentence of this embodiment, "Ichinomiya is good weather", first "Ichinomiya"
As for, "no" is regarded as a kanji, so the first clipped portion is "Ichinomiya is good".

【００６４】また、図２のＳ２７において切り出された
文字長が０文字になったときＹＥＳと判定され解析失敗
するように構成されているが、それまでにＳ２５で検索
成功している文字列があれば、その文字列を検索し直す
バックトラック処理といわれる処理を設けることも可能
である。例えば上記実施の形態の例で示すとすでに検索
成功している文字列「一の宮」を切り直し、「一の」を
検索することによって後のつながりを成功させるように
工夫されていても良い。バックトラック処理自体はすで
に知られているが、このような周知の技術を組み合わせ
ることにより効率をよくすることが可能である。When the character length cut out in S27 of FIG. 2 becomes 0 characters, the determination is YES and the analysis is failed. If so, it is possible to provide a process called backtracking for re-searching the character string. For example, as shown in the example of the above embodiment, the character string “Ichinomiya” which has already been successfully searched may be re-cut, and the search for “Ichino” may be devised so that the subsequent connection is successful. The backtracking process itself is already known, but efficiency can be improved by combining such known techniques.

【００６５】尚、本発明は以上詳述した実施の形態に限
定されるものではなく、その要旨を逸脱しない範囲にお
いて、種々の変更を加えることができる。The present invention is not limited to the embodiment described in detail above, and various changes can be made without departing from the gist of the present invention.

【００６６】例えば、本実施の形態では漢字とひらがな
が混在する文章のみを例として説明したが、例えばひら
がなとカタカナ、その他の記号を英数文字、記述記号、
その他の記号などの文字種に分けた場合も本発明の思想
は適用可能である。For example, in the present embodiment, only a sentence in which kanji and hiragana are mixed has been described as an example. For example, hiragana and katakana, and other symbols are alphanumeric characters, descriptive symbols,
The idea of the present invention can be applied to other character types such as symbols.

【００６７】また、ひらがなの「の」や「が」につい
て、漢字二つを結合させ１の名詞を形成する場合を説明
したが、同様にカタカナの「ヶ」「ヵ」なども前後に漢
字やひらがなを伴って１の単語を形成する場合も多い。
更に、「・」や「−」「＆」なども種々の文字種を前後
に伴い１の単語を形成することが多い。そのため、これ
らについても文字種として判定せず、前後の文字種と一
体に判断して処理してもよい。Also, in the case of hiragana "no" and "ga", two kanji have been combined to form one noun. Similarly, katakana "ga" and "ka" are used before and after kanji and One word is often formed with hiragana.
Furthermore, ".", "-", "&", Etc. often form one word with various character types before and after. Therefore, these may not be determined as character types, but may be determined and processed integrally with the preceding and following character types.

【００６８】また、本実施の形態では、入力装置２０
は、キーボードとマウスによったが、要は解析の対象で
ある文章が読み込まれればよく、他の入力手段、例えば
無線または有線によりＩ／Ｏポート８０を介して読み込
まれるものや、フロッピーディスクドライブやハードデ
ィスクドライブなどからなる内蔵または外部記憶装置７
０を介して記録媒体により入力されるものや、音声によ
って認識するようなものであってもよい。In this embodiment, the input device 20
Depends on a keyboard and a mouse, but it is only necessary to read a sentence to be analyzed. Other input means, for example, those read through a wireless or wired I / O port 80 or a floppy disk drive Or external storage device 7 such as a hard disk drive
0 or may be recognized by voice.

【００６９】さらに、本実施の形態の言語解析装置は、
文字種判定プログラム４１と、文字種分割プログラム４
２と、日本語辞書４３と、非分割かな辞書４４と、単語
検索プログラム４５および制御プログラム４６がＲＯＭ
４０に予め格納されたものであるが、本発明は必ずしも
これに限定されるものではない。例えば、これらのプロ
グラムは、それぞれ明確に区別されて格納される必要は
なく、要は夫々の機能を有する部分が存在すれば十分
で、これらのプログラムが混在しているような形式であ
っても差し支えない。Further, the language analyzing apparatus according to the present embodiment
Character type determination program 41 and character type division program 4
2, a Japanese dictionary 43, an undivided kana dictionary 44, a word search program 45 and a control program 46 are stored in a ROM.
Although it is stored in advance at 40, the present invention is not necessarily limited to this. For example, these programs do not need to be clearly distinguished from each other and stored, and the point is that it is sufficient if there is a part having each function, and even if these programs are mixed, No problem.

【００７０】また、プログラムや辞書が格納される記録
媒体は必ずしもＲＯＭ４０によるものでなくてもよく、
要はこれらのプログラムがコンピュータが読み出し可能
に格納さえされれば十分で、フロッピーディスクやＣＤ
−ＲＯＭ等のコンピュータ読み取り可能な記録媒体に格
納したものを読み取り装置により読み取ることによって
動作させることもできる。また、有線若しくは無線回線
を使用して外部情報処理装置からプログラムを読み込ん
で動作させることもできる。この場合、前記フロッピー
ディスクやＣＤ−ＲＯＭ、或いは、コンピュータに内蔵
又は外付けされたハードディスクや、さらに外部情報処
理装置の当該プログラムを格納したメモリが本発明の記
録媒体を構成することになる。The recording medium on which programs and dictionaries are stored is not necessarily limited to the ROM 40.
In short, it is enough if these programs are stored in a computer readable form, such as a floppy disk or CD.
-It can also be operated by reading data stored in a computer-readable recording medium such as a ROM with a reading device. Further, a program can be read from an external information processing device using a wired or wireless line and operated. In this case, the above-mentioned floppy disk, CD-ROM, hard disk built in or external to the computer, and memory storing the program of the external information processing apparatus constitute the recording medium of the present invention.

【００７１】つまり、本実施の形態のようにＲＯＭ４０
にプログラムを格納した言語解析装置の専用機ばかりで
なく、何らかの記録媒体に本実施の形態のＲＯＭ４０の
内容が格納されていれば、この記録媒体を汎用コンピュ
ータに読み取らせることにより本発明の実施が可能にな
る。That is, as in the present embodiment, the ROM 40
If the contents of the ROM 40 according to the present embodiment are stored not only in the language analyzer dedicated to the program stored in the ROM 40 but also in any recording medium, the general-purpose computer can read this recording medium to implement the present invention. Will be possible.

【００７２】表示手段も、ＣＲＴに限らず液晶ディスプ
レーはもちろん、要は内容が表示可能であれば如何なる
形式のものであってもよく、出力手段も出力が可能であ
ればプリンタによるものに限定されず、たとえばＩ／Ｏ
ポート８０を介して有線若しくは無線回線によるものや
音声出力、記録媒体を介して出力されるものなどによる
ものであってもよい。The display means is not limited to the CRT but may be of any type as long as the contents can be displayed, not limited to a CRT. For example, I / O
It may be a wired or wireless line via the port 80, an audio output, an output via a recording medium, or the like.

【００７３】以上説明したことから明かなように、日本
語解析装置および日本語解析記録媒体によれば辞書検索
の回数を文字列の文字列の短い位置から検索することに
よって解析時間を高速化することができる。As is clear from the above description, according to the Japanese language analysis device and the Japanese language analysis recording medium, the number of times of dictionary search is reduced from the short position of the character string, thereby shortening the analysis time. be able to.

【００７４】[0074]

【発明の効果】請求項１に記載の日本語解析装置によれ
ば、日本語文章の形態素解析において、文字種判定手段
により判定された文字種に基づいて入力されたかな漢字
文字列の文字種の変わる境目で分割し、１または連続し
た複数の同種の文字種からなる漢字部分、かな部分等に
分割し、分割された位置で区切ったかな漢字文字列を日
本語辞書から単語として検索することで、最長一致法に
よる検索のように日本語辞書のもっとも長い文字列から
順次検索するような必要以上に長い日本語辞書の単語を
参照することなく、かつ、かなと漢字の組み合わせとい
う観点から検索漏れのない十分な範囲で、無駄無く効率
の良い単語検索ができるという効果がある。According to the first aspect of the present invention, in the morphological analysis of a Japanese sentence, at the boundary where the character type of the kana-kanji character string input based on the character type determined by the character type determining means changes. The longest match method is used to divide the Kana-Kanji character string consisting of one or more consecutive same-type character types into Kana parts, Kana parts, etc., and to search for Kana-Kanji character strings separated at the divided positions as words from the Japanese dictionary. Sufficient search area without referring to words in the Japanese dictionary that are longer than necessary, such as searching sequentially from the longest character string in the Japanese dictionary, and in terms of kana and kanji combinations Thus, there is an effect that an efficient word search can be performed without waste.

【００７５】請求項２に記載の日本語解析装置は、請求
項１に記載の日本語解析装置の効果に加え、文字種判定
手段が、かなをひらがなとカタカナに、或いは漢字かな
を除く文字種として英数文字とその他の記号に、又はさ
らに文字種を多種類に分類して文字種を判定し、分割す
るため、より正確で効率的な単語検索ができるという効
果がある。According to a second aspect of the present invention, in addition to the effects of the first aspect of the present invention, the character type judging means may use an English character as a character type other than a hiragana and katakana character or a kanji kana character. Since the character type is determined and divided into several characters and other symbols, or furthermore, the character types are classified into various types, there is an effect that a more accurate and efficient word search can be performed.

【００７６】請求項３に記載の日本語解析装置によれ
ば、請求項１又は請求項２に記載の日本語解析装置の効
果に加え、単語検索手段により検索を失敗した時、文字
種分割手段により分割されたかな漢字文字列の末尾から
１文字削った残りのかな漢字文字列に対し、日本語辞書
から単語として検索するため、日本語辞書の検索漏れが
生じないという効果がある。According to the third aspect of the present invention, in addition to the effects of the first or second aspect of the present invention, when the search by the word search means fails, the character type dividing means is used. Since the remaining kana kanji character strings obtained by removing one character from the end of the divided kana kanji character strings are searched as words from the Japanese dictionary, there is an effect that the Japanese dictionary is not missed.

【００７７】請求項４に記載の日本語解析装置は、請求
項１乃至請求項３のいずれかに記載の日本語解析装置の
効果に加え、文字種分割手段が特定のかな文字の前ある
いは後では分割しないようにするためのかなを判定する
ため、漢字を結び付けて一つの単語を作ることが多い特
定のかなにより結合された文字列を一つの連続したもの
と扱うことでさらに効率よく形態素解析ができるという
効果がある。According to a fourth aspect of the present invention, in addition to the effects of the first-third aspect of the present invention, the character type dividing means may be provided before or after a particular kana character. In order to determine the kana to avoid splitting, kanji characters are often combined to form one word.By treating a character string combined by a specific kana as one continuous thing, morphological analysis can be performed more efficiently There is an effect that can be.

【００７８】請求項５に記載の記録媒体によれば、コン
ピュータに、日本語文章の形態素解析において、文字種
判定の手順により判定された文字種に基づいて入力され
たかな漢字文字列の文字種の変わる境目で分割し、１ま
たは連続した複数の同種の文字種からなる漢字部分、か
な部分等に分割し、分割された位置で区切ったかな漢字
文字列を日本語辞書から単語として検索する手順を実行
させることができることで、必要以上に長い日本語辞書
の単語を参照することなく、かつ検索漏れのない、無駄
無く効率の良い単語検索ができるという効果がある。According to the recording medium of the present invention, in the morphological analysis of the Japanese sentence, the kana-kanji character string input based on the character type determined by the character type determination procedure at the boundary where the character type changes. It is possible to execute a procedure of dividing and dividing into one or a plurality of kanji parts and kana parts composed of a plurality of similar character types, and searching for a kana kanji character string separated at the divided position as a word from a Japanese dictionary. Therefore, there is an effect that an efficient word search can be performed without referring to words in the Japanese dictionary that are longer than necessary, without omission of search, and without waste.

【００７９】請求項６に記載の記録媒体によれば、請求
項５に記載の日本語解析プログラムを記録したコンピュ
ータ読み取り可能な記録媒体の効果に加え、コンピュー
タに、かなをひらがなとカタカナに、或いは漢字かなを
除く文字種をさらに英数文字とその他の記号に、又は、
さらに文字種を多種類に分類して文字種を判定し、入力
された文字列をその分類に基づいて分割するための手順
を実行させることができるため、より正確で効率的な単
語検索ができるという効果がある。According to the recording medium of the sixth aspect, in addition to the effect of the computer-readable recording medium on which the Japanese language analysis program of the fifth aspect is recorded, it is also possible to provide a computer with kana to hiragana and katakana or Character types other than kanji kana are further converted to alphanumeric characters and other symbols, or
Furthermore, since the character type is classified into various types to determine the character type, and a procedure for dividing the input character string based on the classification can be executed, a more accurate and efficient word search can be performed. There is.

【００８０】請求項７に記載の記録媒体では、請求項５
又は請求項６に記載の日本語解析プログラムを記録した
コンピュータ読み取り可能な記録媒体の効果に加え、コ
ンピュータに、単語検索の手順により検索を失敗したと
き、文字種分割の手順により分割されたかな漢字文字列
の末尾から１文字削った残りのかな漢字文字列に対し、
日本語辞書から単語として検索を実行させることができ
るため、日本語辞書の漏れが生じない正確な検索ができ
るという効果がある。According to the recording medium of the seventh aspect, in the fifth aspect,
Or, in addition to the effect of the computer-readable recording medium on which the Japanese-language analysis program according to claim 6 is recorded, a kana-kanji character string divided by a character type division procedure when a computer fails in a word search procedure. For the remaining kana kanji character strings one character removed from the end of
Since a search can be executed as a word from the Japanese dictionary, there is an effect that an accurate search without omission of the Japanese dictionary can be performed.

【００８１】請求項８に記載の記録媒体によれば、請求
項５乃至請求項７のいずれかに記載の日本語解析プログ
ラムを記録したコンピュータ読み取り可能な記録媒体の
効果に加え、コンピュータに、文字種分割の手順が特定
のかな文字の前あるいは後では分割しないようにするた
めのかなを判定するため、漢字を結び付けて一つの単語
を作ることが多い特定のかなにより結合された文字列を
一つの連続したものと扱う手順を実行させることでさら
に効率よく形態素解析ができるという効果がある。According to the recording medium of the eighth aspect, in addition to the effect of the computer-readable recording medium on which the Japanese language analysis program according to any one of the fifth to seventh aspects is recorded, the character type can be stored in the computer. In order to determine the kana that the dividing procedure does not split before or after a specific kana character, kanji is often combined into a word to combine a character string combined by a specific kana into one There is an effect that morphological analysis can be performed more efficiently by executing a procedure that treats continuity.

【図面の簡単な説明】[Brief description of the drawings]

【図１】本実施の形態の日本語解析装置の概略を表わす
ブロック図である。FIG. 1 is a block diagram showing an outline of a Japanese language analyzer according to the present embodiment.

【図２】本実施の形態の日本語解析装置の動作を表わす
フローチャートである。FIG. 2 is a flowchart illustrating an operation of the Japanese language analysis device according to the present embodiment.

【図３】本実施の形態の日本語解析装置の日本語辞書の
一例を表す図である。FIG. 3 is a diagram illustrating an example of a Japanese dictionary of the Japanese analysis device according to the present embodiment.

【図４】本実施の形態の日本語解析装置の記録媒体のか
な漢字テキスト記憶領域および文字種格納領域および解
析結果を表す説明図である。FIG. 4 is an explanatory diagram showing a Kana-Kanji text storage area, a character type storage area, and an analysis result of a recording medium of the Japanese language analysis device of the present embodiment.

【符号の説明】[Explanation of symbols]

１０ＣＰＵ２０入力装置３０表示装置４０ＲＯＭ４１文字種判定プログラム４２文字種分割プログラム４３日本語辞書４４非分割かな辞書４５単語検索プログラム４６制御プログラム５０ＲＡＭ５１かな漢字テキスト記憶領域５２検索文字列記憶領域５３検索位置記憶領域５４文字種格納領域５５作業領域６０データバス７０外部記憶装置８０Ｉ／Ｏポート９０出力装置 10 CPU 20 Input device 30 Display device 40 ROM 41 Character type determination program 42 Character type division program 43 Japanese dictionary 44 Non-divisional kana dictionary 45 Word search program 46 Control program 50 RAM 51 Kana-kanji text storage area 52 Search character string storage area 53 Search position Storage area 54 Character type storage area 55 Work area 60 Data bus 70 External storage device 80 I / O port 90 Output device

【手続補正書】[Procedure amendment]

【提出日】平成１０年７月２４日[Submission date] July 24, 1998

【手続補正１】[Procedure amendment 1]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】全文[Correction target item name] Full text

【補正方法】変更[Correction method] Change

【補正内容】[Correction contents]

【書類名】明細書[Document Name] Statement

【発明の名称】日本語解析装置および日本語解析プロ
グラムを記録したコンピュータ読み取り可能な記録媒体[Entitled] Japanese analysis apparatus and a Japanese analysis program recorded computer-readablerecord media

【特許請求の範囲】[Claims]

【発明の詳細な説明】DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【０００２】[0002]

【０００５】[0005]

【０００７】[0007]

【００２３】[0023]

【００７４】[0074]

【図面の簡単な説明】[Brief description of the drawings]

【符号の説明】１０ＣＰＵ２０入力装置３０表示装置４０ＲＯＭ４１文字種判定プログラム４２文字種分割プログラム４３日本語辞書４４非分割かな辞書４５単語検索プログラム４６制御プログラム５０ＲＡＭ５１かな漢字テキスト記憶領域５２検索文字列記憶領域５３検索位置記憶領域５４文字種格納領域５５作業領域６０データバス７０外部記憶装置８０Ｉ／Ｏポート９０出力装置[Description of Signs] 10 CPU 20 Input device 30 Display device 40 ROM 41 Character type judgment program 42 Character type division program 43 Japanese dictionary 44 Non-division Kana dictionary 45 Word search program 46 Control program 50 RAM 51 Kana-Kanji text storage area 52 Search character string Storage area 53 Search position storage area 54 Character type storage area 55 Work area 60 Data bus 70 External storage device 80 I / O port 90 Output device

Claims

Translated fromJapanese

【特許請求の範囲】[Claims]

【請求項１】かな漢字文字列を入力するための入力手
段と、その入力手段により入力されたかな漢字文字列を記憶す
る記憶手段と、その記憶手段に記憶されたかな漢字文字列の漢字、かな
等の文字種を判定する文字種判定手段と、前記文字種判定手段により判定された文字種に基づいて
前記入力されたかな漢字文字列の文字種の変わる境目で
分割し、１または連続した複数の同種の文字種からなる
漢字部分、かな部分等に分割する文字種分割手段と、日本語の単語及びその単語の情報を記憶した日本語辞書
と、前記文字種分割手段により分割された位置で区切ったか
な漢字文字列を前記日本語辞書から単語として検索する
単語検索手段とを備えた日本語文章の形態素解析を行う
日本語解析装置。1. An input means for inputting a Kana-Kanji character string, a storage means for storing a Kana-Kanji character string input by the input means, and a Kana-Kanji character string of a Kana-Kanji character string stored in the storage means. Character type determining means for determining a character type, and a kanji portion composed of one or a plurality of consecutive same type character strings, divided at a boundary where the character type of the input kana kanji character string changes based on the character type determined by the character type determining means Character type dividing means for dividing into kana parts, etc., a Japanese dictionary storing Japanese words and information on the words, and a kana kanji character string divided at the position divided by the character type dividing means from the Japanese dictionary. A Japanese analysis device for performing a morphological analysis of a Japanese sentence, comprising a word search means for searching as a word.

【請求項２】前記文字種判定手段は、前記かなをひら
がなとカタカナに、或いは前記漢字かなを除く文字種と
して英数文字とその他の記号に、又は、さらに文字種を
多種類に分類する文字種判定手段であり、前記文字種分割手段は、入力された文字列を当該分類に
基づいて分割する文字種分割手段であることを特徴とす
る請求項１に記載の日本語解析装置。2. The character type judging means, wherein the kana is hiragana and katakana, or the character type excluding the kanji kana is alphanumeric characters and other symbols, or further, the character type is classified into various types. 2. The apparatus according to claim 1, wherein the character type dividing unit is a character type dividing unit that divides an input character string based on the classification.

【請求項３】前記単語検索手段により検索を失敗した
時、前記文字種分割手段により分割されたかな漢字文字
列の末尾から１文字削った残りのかな漢字文字列に対
し、前記日本語辞書から単語として検索する第２の単語
検索手段を備えたことを特徴とする請求項１又は請求項
２に記載の日本語解析装置。3. When the search by the word search means fails, the Kana-Kanji character string obtained by removing one character from the end of the Kana-Kanji character string divided by the character type dividing means is searched as a word from the Japanese dictionary. The Japanese language analyzer according to claim 1, further comprising a second word search unit that performs a search.

【請求項４】前記文字種分割手段は、特定のかな文字
の前あるいは後では分割しないようにするためのかなを
判定するかな判定手段を備えた請求項１乃至請求項３の
いずれかに記載の日本語解析装置。4. The character type dividing means according to claim 1, further comprising a kana judging means for judging a kana for preventing division before or after a specific kana character. Japanese analyzer.

【請求項５】日本語文章の形態素解析を行う日本語解
析装置のための日本語解析プログラムを記録した記録媒
体であって、コンピュータに、かな漢字文字列を入力する手順と、前記入力されたかな漢字文字列を記憶する手順と、前記記憶されたかな漢字文字列の文字種を漢字、かな等
の文字種とに判定する文字種判定の手順と、その文字種判定の手順により判定された文字種に基づい
て前記入力されたかな漢字文字列の文字種の変わる境目
で分割し、１または連続した複数の同種の文字種からな
る漢字部分、かな部分及びその他の部分の文字列に分割
する文字種分割の手順と、前記文字種分割の手順により分割された位置で区切った
かな漢字文字列を前記日本語の単語及びその単語の情報
を記憶した日本語辞書から単語として検索する単語検索
の手順とを実行させるための日本語解析プログラムを記
録したコンピュータ読み取り可能な記録媒体。5. A recording medium storing a Japanese language analysis program for a Japanese language analysis device for performing a morphological analysis of a Japanese sentence, comprising the steps of: inputting a kana-kanji character string to a computer; A procedure for storing a character string; a procedure for determining a character type of the stored kana-kanji character string as a character type such as kanji or kana; and a procedure for determining the character type based on the character type determined by the procedure for the character type determination. A character type division procedure in which a character is divided at a boundary where the character type of a kana kanji character string changes, and is divided into character strings of one or a plurality of the same kind of character types, a kana part, and other parts. Simply search the kana-kanji character string delimited by the position divided by the word from the Japanese dictionary storing the Japanese word and the information of the word. A computer-readable recording medium on which a Japanese analysis program for executing a word search procedure is recorded.

【請求項６】前記文字種判定の手順は、前記かなをひ
らがなとカタカナに、或いは前記漢字かなを除く文字種
をさらに英数文字とその他の記号に、又は、さらに文字
種を多種類に分類する文字種判定の手順であり、前記文字種分割の手順は、入力された文字列を当該分類
に基づいて分割する文字種分割の手順であることを特徴
とする請求項５に記載の日本語解析プログラムを記録し
たコンピュータ読み取り可能な記録媒体。6. The character type judging step is characterized in that the character type excluding the kana is hiragana and katakana, the character type excluding the kanji kana is further classified into alphanumeric characters and other symbols, or the character type is further classified into various types. 6. The computer according to claim 5, wherein the character type dividing step is a character type dividing step of dividing an input character string based on the classification. A readable recording medium.

【請求項７】前記コンピュータに、前記単語検索の手順により検索を失敗したとき、前記文
字種分割の手順により分割されたかな漢字文字列の末尾
から１文字削った残りのかな漢字文字列に対し前記日本
語辞書から単語として検索する第２の単語検索の手順を
実行させるプログラムをさらに備えた日本語解析プログ
ラムを記録したコンピュータ読み取り可能な請求項５又
は請求項６に記載の記録媒体。7. The computer according to claim 1, wherein when the search fails in the word search procedure, the kana kanji character string obtained by removing one character from the end of the kana kanji character string divided in the character type division procedure The computer-readable recording medium according to claim 5, further comprising a computer-readable recording medium that includes a program for executing a second word search procedure for searching for a word from a dictionary.

【請求項８】前記コンピュータに、特定のかな文字の前あるいは後では分割しないようにす
るためのかなを判定するかな判定の手順を備えた漢字分
割の手順を実行させるプログラムをさらに備えた日本語
解析プログラムを記録したコンピュータ読み取り可能な
請求項５乃至請求項７のいずれかに記載の記録媒体。8. A Japanese language program further comprising a program for causing the computer to execute a kanji division procedure including a kana judgment procedure for judging a kana to prevent division before or after a specific kana character. The recording medium according to any one of claims 5 to 7, wherein the analysis program is recorded and is readable by a computer.