- Notifications
You must be signed in to change notification settings - Fork13
natto-py combines the Python programming language with MeCab, the part-of-speech and morphological analyzer for the Japanese language.
License
buruzaemon/natto-py
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
A package leveraging FFI (foreign function interface),natto-py
combinesthePython programming language withMeCab, the part-of-speech andmorphological analyzer for the Japanese language. No compiler is necessary, asit isnot a C extension.natto-py
will run on Mac OS, Windows and*nix.
You can learn more aboutnatto-py at GitHub.
If you are still usingPython 2 after sunset, please stick with versionnatto-py==0.9.2
.
natto-py
requires the following:
- An existing installation ofMeCab 0.996
- A system dictionary, likeIPA,Juman orUnidic
- cffi 0.8.6 or greater
The following Python 3 versions are supported:
For Python 2, please use version0.9.2
.
Installnatto-py
as you would any other Python package:
$ pip install natto-py
This will automatically install thecffi
package, whichnatto-py
usesto bind to themecab
library.
As long as themecab
(andmecab-config
for *nix and Mac OS)executables are on yourPATH
,natto-py
does not require any explicitconfiguration.
- On *nix and Mac OS, it queries
mecab-config
to discover the path to thelibmecab.so
orlibmecab.dylib
, respectively. - On Windows, it queries the Windows Registry to locate the MeCab installation folder.
- In order to convert character encodings to/from Unicode,
natto-py
will examine the charset of themecab
system dictionary.
Ifnatto-py
for some reason cannot locate themecab
library,or if it cannot determine the correct charset used internally bymecab
, then you will need to set theMECAB_PATH
andMECAB_CHARSET
environment variables.
- Set the
MECAB_PATH
environment variable to the exact name/path to yourmecab
library. - Set the
MECAB_CHARSET
environment variable to thecharset
character encoding used by your system dictionary.
e.g., for Mac OS:
export MECAB_PATH=/usr/local/Cellar/mecab/0.996/lib/libmecab.dylibexport MECAB_CHARSET=utf8
e.g., for bash on UNIX/Linux:
export MECAB_PATH=/usr/local/lib/libmecab.soexport MECAB_CHARSET=euc-jp
e.g., on Windows:
setMECAB_PATH=C:\Program Files\MeCab\bin\libmecab.dllsetMECAB_CHARSET=shift-jis
e.g., from within a Python program:
importosos.environ['MECAB_PATH']='/usr/local/lib/libmecab.so'os.environ['MECAB_CHARSET']='utf-16'
Here's a very quick guide to usingnatto-py
.
Instantiate a reference to themecab
library, and display some details:
fromnattoimportMeCabnm=MeCab()print(nm)# displays details about the MeCab instance<natto.mecab.MeCabmodel=<cdata'mecab_model_t *'0x801c16300>,tagger=<cdata'mecab_t *'0x801c17470>,lattice=<cdata'mecab_lattice_t *'0x801c196c0>,libpath="/usr/local/lib/libmecab.so",options={},dicts=[<natto.dictionary.DictionaryInfodictionary='mecab_dictionary_info_t *'0x801c19540>,filepath="/usr/local/lib/mecab/dic/ipadic/sys.dic",charset=utf8,type=0],version=0.996>
Display details about themecab
system dictionary used:
sysdic=nm.dicts[0]print(sysdic)# displays the MeCab system dictionary info<natto.dictionary.DictionaryInfodictionary='mecab_dictionary_info_t *'0x801c19540>,filepath="/usr/local/lib/mecab/dic/ipadic/sys.dic",charset=utf8,type=0>
Parse Japanese text and send the MeCab result as a single string tostdout
:
print(nm.parse('ピンチの時には必ずヒーローが現れる。'))# MeCab result as a single stringピンチ名詞,一般,*,*,*,*,ピンチ,ピンチ,ピンチの助詞,連体化,*,*,*,*,の,ノ,ノ時名詞,非自立,副詞可能,*,*,*,時,トキ,トキに助詞,格助詞,一般,*,*,*,に,ニ,ニは助詞,係助詞,*,*,*,*,は,ハ,ワ必ず副詞,助詞類接続,*,*,*,*,必ず,カナラズ,カナラズヒーロー名詞,一般,*,*,*,*,ヒーロー,ヒーロー,ヒーローが助詞,格助詞,一般,*,*,*,が,ガ,ガ現れる動詞,自立,*,*,一段,基本形,現れる,アラワレル,アラワレル。記号,句点,*,*,*,*,。,。,。EOS
Next, try parsing the text with MeCab node parsing. A generator yielding theMeCabNode instances lets you efficiently iterate over the output without firstmaterializing each and every resulting MeCabNode instance. The MeCabNodeinstances yielded allow access to more detailed information about eachmorpheme.
Here we use aPython with-statement to automatically clean up after wefinish node parsing with the MeCab tagger. This is the recommended approachfor usingnatto-py
in a production environment:
# Use a Python with-statement to ensure mecab_destroy is invoked#withMeCab()asnm:forninnm.parse('ピンチの時には必ずヒーローが現れる。',as_nodes=True):...# ignore any end-of-sentence nodes...ifnotn.is_eos():...print('{}\t{}'.format(n.surface,n.cost))...ピンチ3348の3722時5176に5083は5305必ず7525ヒーロー11363が10508現れる10841。7127
MeCab output formatting is extremely flexible and is highly recommended forany serious natural language processing task. Rather than parsing the MeCaboutput as a single, large string, use MeCab's--node-format
option(short form-F
) to customize the node'sfeature
attribute.
- morpheme surface
- part-of-speech
- part-of-speech ID
- pronunciation
It is good practice when using--node-format
to also specify nodeformatting in the case where the morpheme cannot be found in the dictionary,by using--unk-format
(short form-U
).
This example formats the nodefeature
to capture the items above as acomma-separated value:
# MeCab options used:## -F ... short-form of --node-format# %m ... morpheme surface# %f[0] ... part-of-speech# %h ... part-of-speech id (ipadic)# %f[8] ... pronunciation## -U ... short-form of --unk-format# output ?,?,?,? for morphemes not in dictionary#withMeCab(r'-F%m,%f[0],%h,%f[8]\n -U?,?,?,?\n')asnm:forninnm.parse('ピンチの時には必ずヒーローが現れる。',as_nodes=True):...# only normal nodes, ignore any end-of-sentence and unknown nodes...ifn.is_nor():...print(n.feature)...ピンチ,名詞,38,ピンチの,助詞,24,ノ時,名詞,66,トキに,助詞,13,ニは,助詞,16,ワ必ず,副詞,35,カナラズヒーロー,名詞,38,ヒーローが,助詞,13,ガ現れる,動詞,31,アラワレル。,記号,7,。
Partial parsing (制約付き解析), allows you to pass hints to MeCab onhow to tokenize morphemes when parsing. Most useful are boundary constraintparsing and feature constraint parsing.
With boundary constraint parsing, you can specify either a compiledre
regular expression object or a string to tell MeCab where the boundaries ofa morpheme should be. Use theboundary_constraints
keyword. For hints ontokenization, please seeRegular expression operations andre.finditerin particular.
This example uses the-F
node-format option to customize the resultingMeCabNode
feature attribute to extract:
%m
- morpheme surface%f[0]
- node part-of-speech%s
- nodestat
status value, 1 isunknown
Note that any such morphemes captured will have nodestat
status of 1 (unknown):
importrewithMeCab(r'-F%m,\s%f[0],\s%s\n')asnm:text='俺は努力したよっ? お前の10倍、いや100倍1000倍したよっ!'# capture 10倍, 100倍 and 1000倍 as single parts-of-speechpattern=re.compile('10+倍')forninnm.parse(text,boundary_constraints=pattern,as_nodes=True):...print(n.feature)...俺,名詞,0は,助詞,0努力,名詞,0し,動詞,0たよっ,動詞,0?,記号,0お前,名詞,0の,助詞,010倍,名詞,1、,記号,0いや,接続詞,0100倍,名詞,11000倍,名詞,1し,動詞,0たよっ,動詞,0!,記号,0EOS
With feature constraint parsing, you can provide instructions to MeCabon what feature to use for a matching morpheme. Use thefeature_constraints
keyword to pass in atuple
containing elementsthat themselves aretuple
instances with a specific morpheme (str)and a corresponding feature (str), in order of constraint precedence:
withMeCab(r'-F%m,\s%f[0],\s%s\n')asnm:text='心の中で3回唱え、 ヒーロー見参!ヒーロー見参!ヒーロー見参!'features= (('ヒーロー見参','感動詞'),)forninnm.parse(text,feature_constraints=features,as_nodes=True):...print(n.feature)...心,名詞,0の,助詞,0中,名詞,0で,助詞,03,名詞,1回,名詞,0唱え,動詞,0、,記号,0ヒーロー見参,感動詞,1!,記号,0ヒーロー見参,感動詞,1!,記号,0ヒーロー見参,感動詞,1!,記号,0EOS
- Examples and more detailed information about
natto-py
can be found on theproject Wiki. - Working code in Jupyter notebook form can be found under thisproject's notebooks directory.
- API documentation on Read the Docs.
- Usegit andcheck out the latest code at GitHub to make sure thefeature hasn't been implemented or the bug hasn't been fixed yet.
- Browse the issue tracker to make sure someone already hasn't requested itand/or contributed it.
- Fork the project.
- Start a feature/bugfix branch.
- Commit and push until you are happy with your contribution.
- Make sure to add tests for it. This is important so I don't break it in afuture version unintentionally.
- Please try not to mess with the
setup.py
,CHANGELOG
, or versionfiles. If you must have your own version, that is fine, but please isolateto its own commit so I can cherry-pick around it. - This project uses the following packages for development:
Please see theCHANGELOG
for the release history.
Copyright © 2022, Brooke M. Fujita. All rights reserved. Please seetheLICENSE
file for further details.
About
natto-py combines the Python programming language with MeCab, the part-of-speech and morphological analyzer for the Japanese language.
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Contributors5
Uh oh!
There was an error while loading.Please reload this page.