JP6232774B2

Movatterモバイル変換

Info

Publication number: JP6232774B2
Application number: JP2013133481A
Authority: JP
Inventors: 元博赤石沢
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2013-06-26
Filing date: 2013-06-26
Publication date: 2017-11-22
Anticipated expiration: 2033-06-26
Also published as: JP2015007943A

Description

本発明は、単語辞書を用いる、形態素解析装置、形態素解析方法、及び、形態素解析プログラムに関する。 The present invention relates to a morpheme analyzer, a morpheme analysis method, and a morpheme analysis program using a word dictionary.

「形態素解析装置」とは、入力文字列を、入力文字列を構成する形態素の列に分割する処理である「形態素解析」を行う装置である。ここで、「形態素」とは、ある言語で書かれた、入力された、又は、話された文を、それ以上分割したら意味をなさなくなるところまで分割して抽出された、言語で意味を持つ最小単位の文字列（文字コード等の列）であり、品詞種別が特定された単語である。尚、複数の品詞種別を持ちうる文字列は、文字列としては一つであっても、品詞種別毎に異なる形態素として扱われる（文字列及び品詞種別の組毎に辞書中で与えられたコード等で表現される。）。又、入力された、又は、話された文の場合には、形態素は、文字列ではなく、入力や音素等を表現するコード等の列として扱われることもある。 The “morpheme analyzer” is a device that performs “morpheme analysis”, which is a process of dividing an input character string into morpheme strings that constitute the input character string. Here, a “morpheme” has a meaning in a language that has been extracted and extracted to the point where it does not make sense to divide a sentence that has been written, entered, or spoken in a certain language and further divided It is a character string (character code or the like) in the smallest unit, and is a word with a specified part of speech type. A character string that can have a plurality of part of speech types is treated as a different morpheme for each part of speech type even if there is only one character string (a code given in the dictionary for each set of character string and part of speech types) Etc.). In the case of a sentence that has been input or spoken, the morpheme may not be a character string, but may be treated as a string of codes or the like representing input or phonemes.

形態素解析の手法には、さまざまな方法がある。例えば、形態素の列全体でコストの総和が最小になる形態素の列を最適な分割結果として選択する方法がある。この方法では、入力文字列は、複数の形態素を含む「形態素列」に分割される。このとき、入力文字列中の分割位置を変えることにより、複数の形態素列が作成される。 There are various methods for morphological analysis. For example, there is a method of selecting a morpheme column that minimizes the total cost of the entire morpheme column as an optimal division result. In this method, the input character string is divided into “morpheme strings” including a plurality of morphemes. At this time, a plurality of morpheme strings are created by changing the division position in the input character string.

それぞれの形態素列に含まれる２個の形態素間の接続に対しては、コスト（「接続コスト」ともいう。）が定義される。「コスト」とは、複数の形態素列の中から最も適切なものを選択するために用いられる指標である。そして、複数の形態素列の中から、コストの総和が最小となる形態素列が選択される。 A cost (also referred to as “connection cost”) is defined for the connection between two morphemes included in each morpheme string. “Cost” is an index used to select the most appropriate one from a plurality of morpheme strings. Then, the morpheme string that minimizes the total cost is selected from the plurality of morpheme strings.

形態素解析装置の一例が、特許文献１に開示されている。特許文献１の形態素解析装置は、形態素解析手段と、単語辞書と、接続表と、学習辞書と、解析誤り修正手段とを含む。形態素解析手段は、単語コストを保持した単語辞書と接続コストを保持した接続表とに基づいて（解析候補における単語コストと接続コストの総和を解析候補の尤もらしさの尺度とした最小コスト法に基づいて）、ユーザにより指示された文字列（入力文字列）の文節（形態素列）候補を生成する。解析誤り修正手段は、文節候補結果からユーザにより選択された正しい文節候補に基づき、学習辞書を作成する。形態素解析装置は、次回の形態素解析の際には、学習辞書を参照し、登録されている単語、又は２個の単語の連接（隣接する単語の並び）が含まれる解析結果の優先度を高めることで、学習結果を反映する。具体的には、形態素解析装置は、形態素解析のコスト計算の際に、学習辞書に登録されている単語、又は２個の単語の連接の部分のコストを強制的に”０”にすることによって、学習単位を含む解析結果の優先度を高める。 An example of a morphological analyzer is disclosed inPatent Document 1. The morpheme analyzer ofPatent Document 1 includes a morpheme analysis unit, a word dictionary, a connection table, a learning dictionary, and an analysis error correction unit. The morpheme analysis means is based on a word dictionary holding word costs and a connection table holding connection costs (based on a minimum cost method using the sum of word costs and connection costs in an analysis candidate as a measure of the likelihood of the analysis candidate). A phrase (morpheme string) candidate of the character string (input character string) designated by the user is generated. The analysis error correcting means creates a learning dictionary based on the correct phrase candidate selected by the user from the phrase candidate result. In the next morpheme analysis, the morpheme analyzer refers to the learning dictionary and raises the priority of the analysis result including a registered word or a connection of two words (a sequence of adjacent words). This reflects the learning results. Specifically, the morpheme analysis device forcibly sets the cost of the word registered in the learning dictionary or the concatenation part of two words to “0” when calculating the cost of the morpheme analysis. , Increase the priority of analysis results including learning units.

上記の動作の結果、特許文献１の形態素解析装置では、学習辞書に登録されている単語、又は２個の単語の連接を含む形態素解析候補が、形態素解析の結果として選択される可能性が高まる。 As a result of the above operation, in the morphological analyzer ofPatent Document 1, there is a high possibility that a word registered in the learning dictionary or a morphological analysis candidate including a concatenation of two words is selected as a result of the morphological analysis. .

特許文献２には、特許文献１の形態素解析装置のような「低速形態素解析装置」による解析結果を学習用データとして用いる「高速形態素解析装置」が開示されている。「低速形態素解析」とは、形態素（単語）辞書を用いる形態素解析である。「高速形態素解析」とは、文字をベースとした確率モデル（統計データベース）を利用した形態素解析である。「高速形態素解析」は、「低速形態素解析」より高速に動作することが多い。特許文献２の形態素解析装置は、低速形態素解析の結果を高速形態素解析の学習データに自動的に変換する。特許文献２の形態素解析装置は、学習データの学習後に、入力文を高速形態素解析により解析する。特許文献２の形態素解析装置は、大量の正確な学習データを学習した後には、低速形態素解析による形態素解析とほぼ同じ精度で形態素解析を実行する。 Patent Document 2 discloses a “high-speed morpheme analyzer” that uses an analysis result by a “low-speed morpheme analyzer” such as the morpheme analyzer ofPatent Document 1 as learning data. “Low-speed morpheme analysis” is morpheme analysis using a morpheme (word) dictionary. “Fast morphological analysis” is a morphological analysis using a character-based probability model (statistical database). “Fast morphological analysis” often operates faster than “Low speed morphological analysis”. The morpheme analyzer ofPatent Document 2 automatically converts the result of low-speed morpheme analysis into learning data for high-speed morpheme analysis. The morpheme analyzer ofPatent Document 2 analyzes an input sentence by high-speed morpheme analysis after learning data. After learning a large amount of accurate learning data, the morpheme analyzer ofPatent Document 2 performs morpheme analysis with almost the same accuracy as morpheme analysis by low-speed morpheme analysis.

特開平９−１１４８２５公報（第３−４ページ、図１−４）Japanese Laid-Open Patent Publication No. 9-114825 (page 3-4, FIG. 1-4)特許３９３９２６４号公報（第３−７ページ、図１、２、１１）Japanese Patent No. 3939264 (page 3-7, FIGS. 1, 2 and 11)

特許文献１の形態素解析装置では、学習辞書に登録されている単語、又は２個の単語の連接の部分のコストが強制的に優先される。そこで、特許文献１の形態素解析装置では、学習辞書に登録されている単語、又は２個の単語の連接の部分を含み、かつ、学習辞書に登録されている単語、又は２個の単語の連接に基づく形態素解析が不適当な文に対して、不適当な形態素解析結果を出力する可能性が高い。つまり、特許文献１の形態素解析装置では、高精度な形態素解析を実現することが困難であるという問題がある。 In the morphological analysis device ofPatent Document 1, the cost of a word registered in a learning dictionary or a concatenation part of two words is forcibly prioritized. Therefore, in the morphological analysis device ofPatent Document 1, a word registered in the learning dictionary or a concatenation part of two words and a word registered in the learning dictionary or a concatenation of two words There is a high possibility that an inappropriate morpheme analysis result is output for a sentence in which morphological analysis based on is inappropriate. That is, the morphological analysis device ofPatent Document 1 has a problem that it is difficult to realize high-precision morphological analysis.

特許文献２の形態素解析装置では、「低速形態素解析」による形態素解析とほぼ同じ精度で形態素解析を実行するためには、学習データとして大量の正確な「低速形態素解析」による形態素解析結果が必要である。つまり、特許文献２の形態素解析装置では、「低速形態素解析」による高精度な形態素解析結果を効率的に取得する必要があるという問題がある。
（発明の目的）
本発明の目的は、単語辞書を用いる形態素解析において、単語辞書に登録された情報を効率的に修正して、高精度な解析を実現することができる、形態素解析装置、形態素解析方法、及び、形態素解析プログラムを提供することにある。In the morpheme analyzer ofPatent Document 2, in order to perform morpheme analysis with almost the same accuracy as that of “slow morpheme analysis”, a large amount of accurate “slow morpheme analysis” results are necessary as learning data. is there. That is, the morpheme analyzer ofPatent Document 2 has a problem that it is necessary to efficiently acquire a highly accurate morpheme analysis result by “low-speed morpheme analysis”.
(Object of invention)
An object of the present invention is to efficiently correct information registered in a word dictionary in a morpheme analysis using a word dictionary and realize a highly accurate analysis, a morpheme analysis device, a morpheme analysis method, and To provide a morphological analysis program.

本発明の形態素解析装置は、単語、所定の単語の並びである複合語、及び複合語における所定の単語に分割する位置に関する語切り情報が登録された単語辞書に基づいて、複合語を１つの単語として文章の形態素解析を実行する形態素解析手段と、単語辞書に基づいて、形態素解析の結果に含まれる複合語を所定の単語に分割する形態素細分割手段と、を備えることを特徴とする。 The morpheme analyzer of the present invention provides a compound word based on a word dictionary in which words, compound words that are a sequence of predetermined words, and word-cutting information related to positions at which the compound words are divided into predetermined words are registered. It comprises morpheme analysis means for executing morphological analysis of sentences as words, and morpheme subdivision means for dividing a compound word included in the result of morpheme analysis into predetermined words based on a word dictionary.

本発明の形態素解析方法は、単語、所定の単語の並びである複合語、及び複合語における所定の単語に分割する位置に関する語切り情報が登録された単語辞書に基づいて、複合語を１つの単語として文章の形態素解析を実行し、単語辞書に基づいて、形態素解析の結果に含まれる複合語を所定の単語に分割することを特徴とする。 The morpheme analysis method of the present invention is based on a word dictionary in which word words, compound words that are sequences of predetermined words, and word dictionary information about positions to be divided into predetermined words in the compound words are registered. A morphological analysis of a sentence is executed as a word, and a compound word included in a result of the morphological analysis is divided into predetermined words based on a word dictionary.

本発明の形態素解析プログラムは、単語、所定の単語の並びである複合語、及び複合語における所定の単語に分割する位置に関する語切り情報が登録された単語辞書を備える形態素解析装置の備えるコンピュータを、単語辞書に基づいて、複合語を１つの単語として文章の形態素解析を実行する形態素解析手段と、単語辞書に基づいて、形態素解析の結果に含まれる複合語を所定の単語に分割する形態素細分割手段として機能させることを特徴とする。 A morpheme analysis program according to the present invention is a computer provided in a morpheme analyzer including a word, a compound word that is a sequence of predetermined words, and a word dictionary in which word-cutting information related to positions to be divided into predetermined words in the compound word is registered. Morphological analysis means for executing a morphological analysis of a sentence with a compound word as one word based on a word dictionary, and a morpheme detail for dividing a compound word included in the result of the morphological analysis into predetermined words based on the word dictionary It functions as a dividing means.

本発明によれば、単語辞書を用いる形態素解析において、単語辞書に登録された情報を効率的に修正して、高精度な解析を実現することができるという効果がある。 According to the present invention, in the morphological analysis using the word dictionary, there is an effect that the information registered in the word dictionary can be efficiently corrected and a highly accurate analysis can be realized.

本発明の第１の実施形態における形態素解析装置の構成の一例を示すブロック図である。It is a block diagram which shows an example of a structure of the morphological analyzer in the 1st Embodiment of this invention.本発明の第１の実施形態における形態素解析装置のハードウェア構成の一例を示すブロック図である。It is a block diagram which shows an example of the hardware constitutions of the morphological analyzer in the 1st Embodiment of this invention.本発明の第１の実施形態における形態素解析装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the morphological analyzer in the 1st Embodiment of this invention.本発明の第１の実施形態における形態素解析装置の処理手順の具体例を説明するための図である。It is a figure for demonstrating the specific example of the process sequence of the morphological analyzer in the 1st Embodiment of this invention.本発明の第１の実施形態における形態素解析装置のコスト計算の具体例を説明するための表である。It is a table | surface for demonstrating the specific example of the cost calculation of the morphological analyzer in the 1st Embodiment of this invention.本発明の第２の実施形態における形態素解析装置の処理手順の具体例を説明するための図である。It is a figure for demonstrating the specific example of the process sequence of the morphological analyzer in the 2nd Embodiment of this invention.本発明の第２の実施形態における形態素解析装置のコスト計算の具体例を説明するための表である。It is a table | surface for demonstrating the specific example of the cost calculation of the morphological analyzer in the 2nd Embodiment of this invention.

以下、本発明の実施形態について図面を参照して詳細に説明する。尚、すべての図面において、同等の構成要素には同じ符号を付し、適宜説明を省略する。
（第１の実施形態）
図１は、本実施形態における形態素解析装置の構成の一例を示すブロック図である。Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. In all the drawings, equivalent components are denoted by the same reference numerals, and description thereof will be omitted as appropriate.
(First embodiment)
FIG. 1 is a block diagram illustrating an example of the configuration of the morphological analyzer according to the present embodiment.

形態素解析装置１１０は、入力文字列を形態素（以下、「単語」ともいう。）の列に分割する。入力文字列は、ある言語で書かれた、入力された、又は、話された、文又は文の一部である。入力された、又は、話された、文又は文の一部の場合には、形態素は、文字列ではなく、入力や音素等を表現するコード等の列であってよい。以下、形態素の列を単に「入力文字列」、「文」又は「文章」ともいう。 Themorpheme analyzer 110 divides the input character string into morpheme (hereinafter also referred to as “word”) strings. An input string is a sentence or part of a sentence that is written, entered, or spoken in a language. In the case of a sentence or a part of a sentence that is input or spoken, the morpheme may not be a character string but a string such as a code that represents an input or a phoneme. Hereinafter, a string of morphemes is also simply referred to as “input character string”, “sentence”, or “sentence”.

形態素解析装置１１０は、形態素解析手段１２０と、形態素細分割手段１３０と、単語辞書１４０とを備える。 Themorpheme analyzer 110 includes amorpheme analyzer 120, amorpheme subdivision unit 130, and aword dictionary 140.

単語辞書１４０は、登録された単語の情報を保持する。単語辞書１４０に登録される単語には、任意の品詞に属する形態素である単語（以下、単に「単語」という。）に加え、単語の並び（以下、「複合語」という。）が含まれる。単語辞書１４０に複合語が登録される場合、単語辞書１４０には、複合語を複合語に含まれる単語に分割する位置に関する情報（以下、「語切り情報」という。）が含まれる。「語切り情報」は、複合語に対応付けられて登録される。尚、複合語に含まれる単語は、単語辞書に登録されなくてもよい。 Theword dictionary 140 holds registered word information. The words registered in theword dictionary 140 include word sequences (hereinafter simply referred to as “words”), as well as words (hereinafter simply referred to as “words”) that are morphemes belonging to any part of speech. When a compound word is registered in theword dictionary 140, theword dictionary 140 includes information (hereinafter referred to as “word cutting information”) regarding a position where the compound word is divided into words included in the compound word. The “word cut information” is registered in association with the compound word. Note that the words included in the compound word need not be registered in the word dictionary.

形態素解析手段１２０は、複合語を１つの単語として、単語辞書に基づいて、文章の形態素解析を実行する。つまり、形態素解析手段１２０は、複合語を単語として選択するときは、複合語を更に複合語に含まれる単語に分割することはしない。 Themorpheme analyzing unit 120 executes a morphological analysis of a sentence based on a word dictionary with a compound word as one word. That is, when selecting a compound word as a word, themorpheme analyzing unit 120 does not further divide the compound word into words included in the compound word.

形態素解析手段１２０が形態素解析を実行する方法は、単語辞書１４０に基づいて実行される方法であればよい。形態素解析手段１２０は、例えば、単語辞書中に定義された単語間の接続コストに基づき、文内の単語間の接続コストの総和が最小になる単語の並びを、形態素解析の結果としてもよい。形態素解析手段１２０が、単語辞書中に定義された単語間の接続コストに基づき形態素解析を実行する場合、形態素解析手段１２０は、単語間の接続コストの定義を保持する。 The method by which themorphological analysis unit 120 executes the morphological analysis may be a method that is executed based on theword dictionary 140. For example, themorphological analysis unit 120 may use, as a result of the morphological analysis, a sequence of words that minimizes the sum of the connection costs between words in a sentence based on the connection costs between words defined in the word dictionary. When themorpheme analysis unit 120 executes the morpheme analysis based on the connection cost between words defined in the word dictionary, themorpheme analysis unit 120 holds the definition of the connection cost between words.

形態素細分割手段１３０は、単語辞書１３０に含まれる「語切り情報」等に基づいて、形態素解析手段１２０による中間的な形態素解析の結果に含まれる複合語を単語に分割する。以下、複合語を単語に分割することを「細分割」という。 Themorpheme subdivision unit 130 divides the compound word included in the result of the intermediate morpheme analysis by themorpheme analysis unit 120 into words based on “word cut information” included in theword dictionary 130. Hereinafter, dividing a compound word into words is referred to as “subdivision”.

尚、形態素解析装置１１０の機能は、形態素解析手段１２０の機能と形態素細分割手段１３０の機能とに分離されて、２台の装置に配置されてもよい。 The function of themorpheme analyzer 110 may be separated into the function of themorpheme analyzer 120 and the function of themorpheme subdivider 130 and may be arranged in two apparatuses.

図２は、本実施形態における形態素解析装置のハードウェア構成の一例を示すブロック図である。 FIG. 2 is a block diagram illustrating an example of a hardware configuration of the morphological analyzer according to the present embodiment.

形態素解析装置１０１は、記憶装置１０２と、ＣＰＵ（Central Processing Unit）１０３と、キーボード１０４と、モニタ１０５と、Ｉ／Ｏ（Input/Output）１０８とを備え、これらが内部バス１０６で接続されている。記憶装置１０２は、形態素細分割手段１３０等のＣＰＵ１０３の動作プログラムを格納する。ＣＰＵ１０３は、形態素解析装置１０１全体を制御し、記憶装置１０２に格納された動作プログラムを実行し、Ｉ／Ｏ１０８を介して形態素細分割手段１３０等のプログラムの実行やデータの送受信を行なう。なお、上記の形態素解析装置１０１の内部構成は一例である。形態素解析装置１０１は、ＣＰＵ１０３のみを備え、外部に備えられた、記憶装置１０２、キーボード１０４、モニタ１０５、及びＩ／Ｏ１０８を用いて動作してもよい。 Themorphological analyzer 101 includes astorage device 102, a CPU (Central Processing Unit) 103, akeyboard 104, amonitor 105, and an I / O (Input / Output) 108, which are connected via aninternal bus 106. Yes. Thestorage device 102 stores an operation program of theCPU 103 such as themorpheme subdivision unit 130. TheCPU 103 controls the entiremorphological analyzer 101, executes an operation program stored in thestorage device 102, and executes programs such as the morpheme subdivision means 130 and transmits / receives data via the I /O 108. The internal configuration of themorpheme analyzer 101 is an example. Themorphological analyzer 101 may include only theCPU 103 and operate using thestorage device 102, thekeyboard 104, themonitor 105, and the I /O 108 provided outside.

次に、本実施形態の動作を説明する。 Next, the operation of this embodiment will be described.

図３は、本実施形態における形態素解析装置の動作を示すフローチャートである。 FIG. 3 is a flowchart showing the operation of the morphological analyzer according to this embodiment.

形態素解析装置１１０の形態素解析手段１２０は、複合語を１つの単語として、単語辞書に基づいて、文章の形態素解析を実行する（ステップＳ１１）。 Themorpheme analyzer 120 of themorpheme analyzer 110 executes a morpheme analysis of a sentence based on a word dictionary with a compound word as one word (step S11).

形態素解析装置１１０の形態素細分割手段１３０は、単語辞書１３０に含まれる「語切り情報」等に基づいて、形態素解析手段１２０による中間的な形態素解析の結果に含まれる複合語を単語に分割する（ステップＳ１２）。 Themorpheme subdivision unit 130 of themorpheme analyzer 110 divides a compound word included in the result of the intermediate morpheme analysis by themorpheme analysis unit 120 into words based on “word cut information” included in theword dictionary 130. (Step S12).

図４は、本実施形態における形態素解析装置の処理手順の具体例を説明するための図である。 FIG. 4 is a diagram for explaining a specific example of the processing procedure of the morphological analyzer according to the present embodiment.

単語辞書１４０には、予め７個の単語（番号１−７）：「そう（副詞）」、「する（動詞）」、「つもり（名詞）」、「つもり（助動詞）」、「で（助詞）」、「でいる（助動詞）」、「いる（動詞）」及び１個の複合語（番号１０）：「つもりでいる（助動詞）」が登録される。尚、括弧内に示された品詞種別は説明の便宜であり、単語辞書１４０の内容には含まれなくてもよい。単語辞書１４０には、品詞が異なるものであれば同じ文字列からなる複数の単語が含まれてもよい。例えば、「つもり（名詞）」（番号３）と「つもり（助動詞）」（番号４）は同じ文字列からなるが、品詞が異なる。 Theword dictionary 140 includes seven words (numbers 1-7) in advance: “so (adverb)”, “to (verb)”, “intention (noun)”, “intent (auxiliary verb)”, “de (particle). ) ”,“ Deil (auxiliary verb) ”,“ is (verb) ”and one compound word (number 10):“ intentional (auxiliary verb) ”is registered. Note that the part-of-speech types shown in parentheses are for convenience of explanation and may not be included in the contents of theword dictionary 140. Theword dictionary 140 may include a plurality of words composed of the same character string as long as the parts of speech are different. For example, “intent (noun)” (number 3) and “intent (auxiliary verb)” (number 4) are composed of the same character string, but have different parts of speech.

単語辞書１４０には、「語切り情報」として、複合語（番号１０）：「つもりでいる（助動詞）」を複合語に含まれる単語に分割する位置に関する情報が含まれる。図４の「語切り情報」は、複合語「つもりでいる（助動詞）」（番号１０）が、２つの単語「つもり（助動詞）」（番号４）と「でいる（助動詞）」（番号６）とに、単語の間（「／」の位置）で分割されることを示している。単語間の区切り記号「／」は一例であり、別の区切り記号が用いられてもよい。あるいは、区切り記号を用いる代わりに、単に単語の識別子の列（例えば、番号４、番号６の列）が用いられてもよい。あるいは、「語切り情報」は、複合語内で区切られた単語の開始位置を示す数字として保持されてもよい。例えば、「つもりでいる」の場合であれば、先頭の文字から３文字で分割されるとの意味で、語切り情報を”３”としてもよい。 Theword dictionary 140 includes information regarding the position where the compound word (number 10): “I am going to (auxiliary verb)” is divided into words included in the compound word as “word cutting information”. In the “word-cutting information” in FIG. 4, the compound word “intentional (auxiliary verb)” (number 10) is divided into two words “intent (auxiliary verb)” (number 4) and “deal (auxiliary verb)” (number 6). ) Are divided between words (position of “/”). The delimiter “/” between words is an example, and another delimiter may be used. Alternatively, instead of using a delimiter, a sequence of word identifiers (for example, a sequence ofnumbers 4 and 6) may be used. Alternatively, the “word cut information” may be held as a number indicating the start position of the words delimited within the compound word. For example, in the case of “I intend”, the word cut information may be set to “3” in the sense that the first character is divided into three characters.

形態素解析手段１２０は、単語辞書１４０に基づいて、与えられた原文１２１「そうするつもりでいる」の形態素解析を実行する（ステップＳ１１）。 Based on theword dictionary 140, themorpheme analysis unit 120 executes morpheme analysis of the givenoriginal sentence 121 “I intend to do so” (step S11).

形態素解析手段１２０は、単語辞書１４０に基づいて、可能な５つの形態素所解析の候補１２２（候補１−５）から、例えば、形態素所解析の結果として、中間結果１２３「／そう（１）／する（２）／つもりでいる（１０）」を選択する。 Based on theword dictionary 140, themorpheme analysis unit 120 selects theintermediate result 123 “/ so (1) / from the five possible morpheme analysis candidates 122 (candidates 1-5) as a result of the morpheme analysis, for example. Select “Yes (2) / I am going to (10)”.

形態素細分割手段１３０は、単語辞書１３０に含まれる「語切り情報」に基づいて、形態素解析手段１２０による中間結果１２３「／そう（１）／する（２）／つもりでいる（１０）」に含まれる複合語「つもりでいる（１０）」を、単語「つもり（４）」と「でいる（６）」とに分割する（ステップＳ１２）。 Themorpheme subdivision unit 130 determines that theintermediate result 123 “/ so (1) / do (2) / will be (10)” by themorpheme analysis unit 120 based on the “word cut information” included in theword dictionary 130. The compound word “going (10)” included is divided into the words “going (4)” and “deing (6)” (step S12).

図５は、本実施形態における形態素解析装置におけるコスト計算の具体例を説明するための表である。 FIG. 5 is a table for explaining a specific example of cost calculation in the morphological analyzer according to the present embodiment.

形態素解析手段１２０が、定義された単語間の接続コストに基づき形態素解析を実行する場合の詳細について説明する。 Details when themorpheme analysis unit 120 executes morpheme analysis based on the defined connection cost between words will be described.

図５の上側の表は、形態素解析手段１２０が保持する単語間の接続コストが定義された接続コストテーブルの一例である。接続コストテーブルには、９つの接続コスト（番号１―９）が定義さている。例えば、先頭のルール（番号１）は、「そう（１）」に「する（２）」が続く場合の接続コストが”１０”であることを示す。 The upper table in FIG. 5 is an example of a connection cost table in which connection costs between words held by themorphological analysis unit 120 are defined. Nine connection costs (numbers 1-9) are defined in the connection cost table. For example, the first rule (number 1) indicates that the connection cost is “10” when “Yes (2)” follows “Yes (1)”.

図５の下側の表は、図４に示された可能な５つの形態素所解析の候補１２２（候補１−５）毎の接続コストの総和の一例である。例えば、先頭の候補（候補１）は、「そう（１）」と「する（２）」の接続に対して先頭のルール（番号１）が適用された結果コスト”１０”が、「する（２）」と「つもりでいる（１０）」の接続に対して４番目のルール（番号４）が適用された結果コスト”１０”が加算され、接続コストの総和が”２０”であることを示す。可能な５つの形態素所解析の候補１２２（候補１−５）毎の接続コストの総和が計算された結果、先頭の候補（候補１）の接続コストの総和が最も小さいので、先頭の候補（候補１）が形態素解析の結果（中間結果）として選択される。 The lower table of FIG. 5 is an example of the sum of connection costs for each of the five possible morphological analysis candidates 122 (candidates 1-5) shown in FIG. For example, the top candidate (candidate 1) has the cost “10” as a result of applying the top rule (number 1) to the connection of “Yes (1)” and “Yes (2)”. 2) "and" I'm going to be (10) "connection, the result of applying the fourth rule (number 4), the cost" 10 "is added, and the total connection cost is" 20 " Show. As a result of calculating the sum of the connection costs for each of the five possible morphological analysis candidates 122 (candidates 1-5), the sum of the connection costs of the top candidate (candidate 1) is the smallest, so the top candidate (candidate 1) is selected as the result (intermediate result) of the morphological analysis.

以上説明したように、本実施形態における形態素解析装置１１０では、単語辞書１４０に複合語が登録される。そのため、複合語は内部の接続コストが”０”になり、複合語は形態素解析の結果に含まれやすくなる。複合語自体は１つの単語として他の語との結合コストが定義されるので、形態素解析の結果として選択される可能性が過度に高まらない。 As described above, in themorpheme analyzer 110 in the present embodiment, compound words are registered in theword dictionary 140. For this reason, the compound word has an internal connection cost of “0”, and the compound word is likely to be included in the result of the morphological analysis. Since the compound word itself is defined as a single word with a coupling cost with another word, the possibility of being selected as a result of morphological analysis does not increase excessively.

従って、本実施形態における形態素解析装置１１０は、単語辞書に登録された情報を効率的に修正して、高精度な形態素解析を実現することができる。 Therefore, themorpheme analyzer 110 according to the present embodiment can efficiently correct information registered in the word dictionary and realize high-precision morpheme analysis.

尚、本実施形態では、形態素解析手段が単語間の接続コストに基づき形態素解析を実行する場合について説明したが、形態素解析方法は、単語辞書を用いる形態素解析方法であればよく、具体的内容は特に限定されない。
（第２の実施形態）
本実施形態の説明においては、第１の実施形態と本実施形態とで共通する説明は省略し、第１の実施形態に対する本実施形態の相違点のみについて説明する。In the present embodiment, the case where the morpheme analysis unit executes the morpheme analysis based on the connection cost between words has been described. However, the morpheme analysis method may be any morpheme analysis method using a word dictionary. There is no particular limitation.
(Second Embodiment)
In the description of the present embodiment, descriptions common to the first embodiment and the present embodiment are omitted, and only the differences of the present embodiment from the first embodiment will be described.

本実施形態における形態素解析装置１１５の構成は、図１に示した第１の実施形態における形態素解析装置１１０の構成と同じである。但し、形態素解析装置１１５の形態素解析手段１２５の動作は、図４に示した第１の実施形態における形態素解析装置１１０の形態素解析手段１２０の動作と異なる。更に、形態素解析装置１１５の形態素細分割手段１３５の動作は、図４に示した第１の実施形態における形態素解析装置１１０の形態素細分割手段１３０の動作と異なる。すなわち、本実施形態における形態素解析手段１２５は、単語間で定義された接続コストの代わりに、品詞間で定義された接続コストに基づき形態素解析を実行する。また、本実施形態における形態素細分割手段１３５は、複合語中の区切り記号で特定された分割位置の代わりに、複合語の先頭からの文字数で特定された分割位置に基づきさらなる分割を実行する。 The configuration of themorpheme analyzer 115 in this embodiment is the same as that of themorpheme analyzer 110 in the first embodiment shown in FIG. However, the operation of themorpheme analyzer 125 of themorpheme analyzer 115 is different from the operation of themorpheme analyzer 120 of themorpheme analyzer 110 in the first embodiment shown in FIG. Furthermore, the operation of themorpheme subdivision unit 135 of themorpheme analyzer 115 is different from the operation of themorpheme subdivision unit 130 of themorpheme analyzer 110 in the first embodiment shown in FIG. That is, themorpheme analysis unit 125 according to the present embodiment performs morpheme analysis based on the connection cost defined between parts of speech instead of the connection cost defined between words. Further, themorpheme subdivision unit 135 in this embodiment performs further division based on the division position specified by the number of characters from the head of the compound word instead of the division position specified by the delimiter in the compound word.

図６は、本実施形態における形態素解析装置の処理手順の具体例を説明するための図である。 FIG. 6 is a diagram for explaining a specific example of the processing procedure of the morphological analyzer according to the present embodiment.

単語辞書１４５には、予め７個の単語（番号１−７）：「そう」（副詞）、「する」（動詞）、「つもり」（名詞）、「つもりだ」（助動詞）、「で」（助詞）、「でいる」（助動詞）、「いる」（動詞）及び１個の複合語（番号１０）：「つもりでいる」（助動詞）が登録される。但し、単語として「見出し語」が登録されており、単語辞書１４５には、「見出し語」の活用形「そう」、「する」、「つもり」、「つもり」、「で」、「でいる」、「いる」及び「つもりでいる」も登録される。尚、活用形は「各見出し語」に１つ示されているが、複数個の活用形を含んでもよい。あるいは、「見出し語」の「活用型」の情報が登録されてもよい。更に、単語辞書１４５には、「各見出し語」の品詞種別が登録される。 Theword dictionary 145 includes seven words (numbers 1-7) in advance: “s” (adverb), “s” (verb), “gonna” (noun), “gonna da” (auxiliary verb), “de”. (Participant), “de” (auxiliary verb), “is” (verb) and one compound word (number 10): “I am going” (auxiliary verb) is registered. However, “headword” is registered as a word, and theword dictionary 145 uses the “headword” utilization forms “so”, “to”, “intention”, “intention”, “de”, “in” “,” “Is” and “willing” are also registered. In addition, although one utilization form is shown by "each headword", a plurality of utilization forms may be included. Alternatively, “utilization type” information of “headword” may be registered. Further, the part of speech type of “each headword” is registered in theword dictionary 145.

単語辞書１４５では、「語切り情報」は、複合語内で区切られた単語の開始位置を示す数字（例えば、“３”）として保持される。単語辞書１４５では、「語切り情報」として、更に複合語（番号１０）：「つもりでいる（助動詞）」に含まれる２つの単語「つもりだ（助動詞）」（番号４）と「でいる（助動詞）」（番号６）の情報が含まれる。 In theword dictionary 145, the “word cut information” is held as a number (for example, “3”) indicating the start position of the words delimited within the compound word. In theword dictionary 145, as the “word cut information”, the two words “gando (auxiliary verb)” (number 4) and “delet” (compound word (number 10)): Information on the auxiliary verb) ”(number 6).

形態素解析手段１２５は、単語辞書１４５に基づいて、与えられた原文１２６「そうするつもりでいる」の形態素解析を実行する（ステップＳ１１）。 Based on theword dictionary 145, themorpheme analysis unit 125 performs a morpheme analysis of the givenoriginal sentence 126 “I intend to do so” (step S11).

形態素解析手段１２５は、単語辞書１４５に基づいて、可能な５つの形態素所解析の候補１２７（候補１−５）から、例えば、形態素所解析の結果として、中間結果１２８「／そう（１）／する（２）／つもりでいる（１０）」を選択する。 Based on theword dictionary 145, themorphological analysis unit 125 selects theintermediate result 128 “/ so (1) / from the five possible morphological analysis candidates 127 (candidate 1-5) as a result of morphological analysis, for example. Select “Yes (2) / I am going to (10)”.

形態素細分割手段１３５は、単語辞書１３５に含まれる「語切り情報」に基づいて、形態素解析手段１２５による中間結果１２８「／そう（１）／する（２）／つもりでいる（１０）」に含まれる複合語「つもりでいる（１０）」を、３文字目において、単語「つもり（４）」と「でいる（６）」とに分割する（ステップＳ１２）。 Themorpheme subdivision unit 135 determines that theintermediate result 128 “/ so (1) / do (2) / will be (10)” by themorpheme analysis unit 125 based on the “word cut information” included in theword dictionary 135. The compound word “going (10)” included is divided into the words “going (4)” and “deing (6)” at the third character (step S12).

図７は、本実施形態における形態素解析装置のコスト計算の具体例を説明するための表である。 FIG. 7 is a table for explaining a specific example of cost calculation of the morphological analyzer according to the present embodiment.

形態素解析手段１２５が、定義された単語間の接続コストに基づき形態素解析を実行する場合の詳細について説明する。 Details when themorpheme analysis unit 125 executes morpheme analysis based on the defined connection cost between words will be described.

図７の上側の表は、形態素解析手段１２５が保持する単語間の接続コストが定義された接続コストテーブルの一例である。接続コストテーブルには、８つの接続コスト（番号１−８）が定義さている。例えば、先頭のルール（番号１）は、「副詞」に「動詞」が続く場合の接続コストが”１０”であることを示す。 The upper table in FIG. 7 is an example of a connection cost table in which connection costs between words held by themorphological analysis unit 125 are defined. In the connection cost table, eight connection costs (numbers 1-8) are defined. For example, the first rule (number 1) indicates that the connection cost when “verb” follows “adverb” is “10”.

図７の下側の表は、図６に示された可能な５つの形態素所解析の候補１２７（候補１−５）毎の接続コストの総和の一例である。例えば、先頭の候補（候補１）は、「そう（１）」と「する（２）」の接続に対して先頭のルール（番号１）が適用された結果コスト”１０”が、「する（２）」と「つもりでいる（１０）」の接続に対して３番目のルール（番号３）が適用された結果コスト”１０”が加算され、接続コストの総和が”２０”であることを示す。可能な５つの形態素所解析の候補１２７（候補１−５）毎の接続コストの総和が計算された結果、先頭の候補（候補１）の接続コストの総和が最も小さいので、先頭の候補（候補１）が形態素解析の結果（中間結果）として選択される。 The table on the lower side of FIG. 7 is an example of the sum of connection costs for each of the five possible morphological analysis candidates 127 (candidates 1-5) shown in FIG. For example, the top candidate (candidate 1) has the cost “10” as a result of applying the top rule (number 1) to the connection of “Yes (1)” and “Yes (2)”. 2) "and" I'm going to (10) "connection, the result of applying the third rule (number 3), the cost" 10 "is added, and the total of connection costs is" 20 " Show. As a result of calculating the sum of the connection costs for each of the five possible morphological analysis candidates 127 (candidates 1-5), the sum of the connection costs of the top candidate (candidate 1) is the smallest, so the top candidate (candidate 1) is selected as the result (intermediate result) of the morphological analysis.

以上説明したように、本実施形態における形態素解析装置１１５では、単語辞書１４５に単語が見出し語、見出し語が属する品詞、見出し語の活用形の情報が登録される。そのため、単語が単語の活用形毎に単語辞書１４５に登録される必要が無い。 As described above, in themorphological analyzer 115 according to the present embodiment, theword dictionary 145 registers words as headwords, part of speech to which headwords belong, and headword usage forms. Therefore, it is not necessary to register a word in theword dictionary 145 for each word utilization form.

従って、本実施形態における形態素解析装置１１５は、第１の実施形態における形態素解析装置１１０の効果に加え、単語辞書の登録及び接続コストの設定をより簡素化できるという効果を有する。 Therefore, themorpheme analyzer 115 according to the present embodiment has an effect that the word dictionary registration and connection cost setting can be further simplified in addition to the effects of themorpheme analyzer 110 according to the first embodiment.

尚、本実施形態では、形態素解析手段が単語間の接続コストに基づき形態素解析を実行する場合について説明したが、形態素解析方法は、単語辞書を用いる形態素解析方法であればよく、具体的内容は特に限定されない。 In the present embodiment, the case where the morpheme analysis unit executes the morpheme analysis based on the connection cost between words has been described. However, the morpheme analysis method may be any morpheme analysis method using a word dictionary. There is no particular limitation.

尚、図３の形態素解析装置の各処理は、ソフトウェアによって実行されてもよい。すなわち、各処理を行うためのコンピュータプログラムが、形態素解析装置が備えるＣＰＵ（図２：９０３）によって読み込まれ、実行されてもよい。プログラムを用いて各処理を行っても、上述の実施形態の処理と同内容の処理を行うことができる。そして、上記のプログラムは、ＲＯＭ（Read Only Memory）、ＲＡＭ（Random Access Memory）、フラッシュメモリ等の半導体記憶装置、光ディスク、磁気ディスク、光磁気ディスク等、非一時的な媒体に格納されてもよい。 Each process of the morphological analyzer of FIG. 3 may be executed by software. That is, a computer program for performing each process may be read and executed by a CPU (FIG. 2: 903) provided in the morphological analyzer. Even if each process is performed using a program, the same process as the process of the above-described embodiment can be performed. The above program may be stored in a non-transitory medium such as a ROM (Read Only Memory), a RAM (Random Access Memory), a semiconductor memory device such as a flash memory, an optical disk, a magnetic disk, or a magneto-optical disk. .

あるいは、各処理は、個別の回路等の構成要素によって実行されてもよい。 Alternatively, each process may be executed by a component such as an individual circuit.

尚、本願発明は、上述の実施形態に限定されるものではなく、本願発明の要旨を逸脱しない範囲で種々変更、変形して実施することができる。 In addition, this invention is not limited to the above-mentioned embodiment, It can implement in various changes and deformation | transformation in the range which does not deviate from the summary of this invention.

本発明の形態素解析装置は、例えば、機械翻訳システムやテキストマイニングシステム等の一部としても利用することができる。 The morpheme analyzer of the present invention can be used as a part of, for example, a machine translation system or a text mining system.

又、本発明の形態素解析は、実施形態で説明した形態素解析装置だけでなく、例えば、かな漢字システムや音声認識システム等の一部にも適用することができる。 The morpheme analysis of the present invention can be applied not only to the morpheme analyzer described in the embodiment but also to a part of, for example, a Kana-Kanji system or a speech recognition system.

Claims

Translated fromJapanese

単語、所定の単語の並びである複合語及び前記複合語における前記所定の単語に分割可
能な位置に関する語切り情報が登録された単語辞書に基づいて、前記複合語を１つの単語
として文章の形態素解析を実行する形態素解析手段と、
前記形態素解析手段が前記形態素解析を実行した後、前記単語辞書に基づいて、前記形態素解析の結果に含まれる前記複合語を前記所定の単語に分割する形態素細分割手段と、
を備えることを特徴とする形態素解析装置。A morpheme of a sentence using the compound word as one word based on a word, a compound word that is a sequence of predetermined words, and a word dictionary in which word cutting information relating to positions where the compound word can be divided into the predetermined words is registered Morphological analysis means for performing analysis;
After the morpheme analysis unit performs the morpheme analysis , based on the word dictionary, a morpheme subdivision unit that divides the compound word included in the result of the morpheme analysis into the predetermined words;
A morphological analysis device comprising:

前記語切り情報は、前記位置に挿入された所定の記号、前記所定の単語の並び、又は前
記所定の単語の識別子の並びである
ことを特徴とする請求項１に記載の形態素解析装置。2. The morpheme analyzer according to claim 1, wherein the word cut information is a predetermined symbol inserted at the position, an array of the predetermined words, or an array of identifiers of the predetermined words.

前記語切り情報は、前記位置を表す、前記複合語の先頭又は末尾からの文字数である
ことを特徴とする請求項１に記載の形態素解析装置。The morphological analysis apparatus according to claim 1, wherein the word cut information is the number of characters from the beginning or end of the compound word that represents the position.

前記単語辞書には、活用形である前記単語が登録される
ことを特徴とする請求項１乃至３のいずれか１項に記載の形態素解析装置。The morpheme analyzer according to any one of claims 1 to 3, wherein the word that is an inflected form is registered in the word dictionary.

前記単語辞書には、見出し語である前記単語、及び前記見出し語に対応付けられた前記
見出し語の活用形に関する情報が登録される
ことを特徴とする請求項１乃至３のいずれか１項に記載の形態素解析装置。4. The information as to claim 1, wherein the word dictionary includes information about the word that is a headword and a utilization form of the headword associated with the headword. 5. The morphological analyzer described.

前記形態素解析手段は、前記単語を接続したときの第１の接続コストの情報を保持し、
前記文章中における前記第１の接続コストの総和が最小になる前記単語の並びを形態素解
析の結果として選択する
ことを特徴とする請求項１乃至５のいずれか１項に記載の形態素解析装置。The morphological analysis means holds information of a first connection cost when the word is connected,
The morpheme analyzer according to any one of claims 1 to 5, wherein an arrangement of the words that minimizes the total sum of the first connection costs in the sentence is selected as a result of morpheme analysis.

前記単語辞書には、前記単語が属する品詞の情報が登録される
ことを特徴とする請求項１乃至６のいずれか１項に記載の形態素解析装置。The morpheme analyzer according to claim 1, wherein information on a part of speech to which the word belongs is registered in the word dictionary.

前記形態素解析手段は、前記品詞が異なる前記単語を接続したときの第２の接続コスト
の情報を保持し、前記文章中における前記第２の接続コストの総和が最小になる前記単語
の並びを形態素解析の結果として選択する
ことを特徴とする請求項１乃至５のいずれか１項又は請求項７に記載の形態素解析装置。The morpheme analyzing means retains information on a second connection cost when the words having different parts of speech are connected, and the morpheme is an arrangement of the words that minimizes the sum of the second connection costs in the sentence. The morpheme analyzer according to claim 1, wherein the morpheme analyzer is selected as a result of analysis.

単語、所定の単語の並びである複合語、及び前記複合語における前記所定の単語に分割
可能な位置に関する語切り情報が登録された単語辞書に基づいて、前記複合語を１つの単
語として文章の形態素解析を実行し、
前記形態素解析を実行した後、前記単語辞書に基づいて、前記形態素解析の結果に含まれる前記複合語を前記所定の単語に分割することを特徴とする形態素解析方法。Based on a word dictionary in which words, compound words that are a sequence of predetermined words, and word dictionary information regarding positions that can be divided into the predetermined words in the compound words are registered, the compound words as one word Perform morphological analysis,
After executing the morpheme analysis, the morpheme analysis method divides the compound word included in the result of the morpheme analysis into the predetermined words based on the word dictionary.

単語、所定の単語の並びである複合語、及び前記複合語における前記所定の単語に分割
可能な位置に関する語切り情報が登録された単語辞書を備える形態素解析装置の備えるコ
ンピュータを、
前記単語辞書に基づいて、前記複合語を１つの単語として文章の形態素解析を実行する
形態素解析手段と、
前記形態素解析手段が前記形態素解析を実行した後、前記単語辞書に基づいて、前記形態素解析の結果に含まれる前記複合語を前記所定の単語に分割する形態素細分割手段と、
して機能させるための形態素解析プログラム。A computer provided with a morpheme analyzer including a word, a compound word that is a sequence of predetermined words, and a word dictionary in which word cutting information relating to positions that can be divided into the predetermined words in the compound word is registered;
Based on the word dictionary, morpheme analysis means for executing a morpheme analysis of a sentence with the compound word as one word;
After the morpheme analysis unit performs the morpheme analysis , based on the word dictionary, a morpheme subdivision unit that divides the compound word included in the result of the morpheme analysis into the predetermined words;
A morphological analysis program to make it function.