buruzaemon/natto-pyPublic

NotificationsYou must be signed in to change notification settings
Fork13
Star94

natto-py combines the Python programming language with MeCab, the part-of-speech and morphological analyzer for the Japanese language.

License

BSD-2-Clause license

94 stars 13 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 520 Commits
.github/workflows		.github/workflows
docs/src		docs/src
etc		etc
natto		natto
notebooks		notebooks
tests		tests
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
CHANGELOG		CHANGELOG
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.rst		README.rst
setup.py		setup.py

Repository files navigation

natto-py

What is natto-py?

A package leveraging FFI (foreign function interface),natto-py combinesthePython programming language withMeCab, the part-of-speech andmorphological analyzer for the Japanese language. No compiler is necessary, asit isnot a C extension.natto-py will run on Mac OS, Windows and*nix.

You can learn more aboutnatto-py at GitHub.

If you are still usingPython 2 after sunset, please stick with versionnatto-py==0.9.2.

Requirements

natto-py requires the following:

An existing installation ofMeCab 0.996
A system dictionary, likeIPA,Juman orUnidic
cffi 0.8.6 or greater

The following Python 3 versions are supported:

For Python 2, please use version0.9.2.

Installation

Installnatto-py as you would any other Python package:

$ pip install natto-py

This will automatically install thecffi package, whichnatto-py usesto bind to themecab library.

Automatic Configuration

As long as themecab (andmecab-config for *nix and Mac OS)executables are on yourPATH,natto-py does not require any explicitconfiguration.

On *nix and Mac OS, it queriesmecab-config to discover the path to thelibmecab.so orlibmecab.dylib, respectively.
On Windows, it queries the Windows Registry to locate the MeCab installation folder.
In order to convert character encodings to/from Unicode,natto-py will examine the charset of themecab system dictionary.

Explicit configuration via MECAB_PATH and MECAB_CHARSET

Ifnatto-py for some reason cannot locate themecab library,or if it cannot determine the correct charset used internally bymecab, then you will need to set theMECAB_PATH andMECAB_CHARSETenvironment variables.

Set theMECAB_PATH environment variable to the exact name/path to yourmecab library.
Set theMECAB_CHARSET environment variable to thecharset character encoding used by your system dictionary.

e.g., for Mac OS:

export MECAB_PATH=/usr/local/Cellar/mecab/0.996/lib/libmecab.dylibexport MECAB_CHARSET=utf8

e.g., for bash on UNIX/Linux:

export MECAB_PATH=/usr/local/lib/libmecab.soexport MECAB_CHARSET=euc-jp

e.g., on Windows:

setMECAB_PATH=C:\Program Files\MeCab\bin\libmecab.dllsetMECAB_CHARSET=shift-jis

e.g., from within a Python program:

importosos.environ['MECAB_PATH']='/usr/local/lib/libmecab.so'os.environ['MECAB_CHARSET']='utf-16'

Usage

Here's a very quick guide to usingnatto-py.

Instantiate a reference to themecab library, and display some details:

fromnattoimportMeCabnm=MeCab()print(nm)# displays details about the MeCab instance<natto.mecab.MeCabmodel=<cdata'mecab_model_t *'0x801c16300>,tagger=<cdata'mecab_t *'0x801c17470>,lattice=<cdata'mecab_lattice_t *'0x801c196c0>,libpath="/usr/local/lib/libmecab.so",options={},dicts=[<natto.dictionary.DictionaryInfodictionary='mecab_dictionary_info_t *'0x801c19540>,filepath="/usr/local/lib/mecab/dic/ipadic/sys.dic",charset=utf8,type=0],version=0.996>

Display details about themecab system dictionary used:

sysdic=nm.dicts[0]print(sysdic)# displays the MeCab system dictionary info<natto.dictionary.DictionaryInfodictionary='mecab_dictionary_info_t *'0x801c19540>,filepath="/usr/local/lib/mecab/dic/ipadic/sys.dic",charset=utf8,type=0>

Parse Japanese text and send the MeCab result as a single string tostdout:

print(nm.parse('ピンチの時には必ずヒーローが現れる。'))# MeCab result as a single stringピンチ名詞,一般,*,*,*,*,ピンチ,ピンチ,ピンチの助詞,連体化,*,*,*,*,の,ノ,ノ時名詞,非自立,副詞可能,*,*,*,時,トキ,トキに助詞,格助詞,一般,*,*,*,に,ニ,ニは助詞,係助詞,*,*,*,*,は,ハ,ワ必ず副詞,助詞類接続,*,*,*,*,必ず,カナラズ,カナラズヒーロー名詞,一般,*,*,*,*,ヒーロー,ヒーロー,ヒーローが助詞,格助詞,一般,*,*,*,が,ガ,ガ現れる動詞,自立,*,*,一段,基本形,現れる,アラワレル,アラワレル。記号,句点,*,*,*,*,。,。,。EOS

Next, try parsing the text with MeCab node parsing. A generator yielding theMeCabNode instances lets you efficiently iterate over the output without firstmaterializing each and every resulting MeCabNode instance. The MeCabNodeinstances yielded allow access to more detailed information about eachmorpheme.

Here we use aPython with-statement to automatically clean up after wefinish node parsing with the MeCab tagger. This is the recommended approachfor usingnatto-py in a production environment:

# Use a Python with-statement to ensure mecab_destroy is invoked#withMeCab()asnm:forninnm.parse('ピンチの時には必ずヒーローが現れる。',as_nodes=True):...# ignore any end-of-sentence nodes...ifnotn.is_eos():...print('{}\t{}'.format(n.surface,n.cost))...ピンチ3348の3722時5176に5083は5305必ず7525ヒーロー11363が10508現れる10841。7127

MeCab output formatting is extremely flexible and is highly recommended forany serious natural language processing task. Rather than parsing the MeCaboutput as a single, large string, use MeCab's--node-format option(short form-F) to customize the node'sfeature attribute.

morpheme surface
part-of-speech
part-of-speech ID
pronunciation

It is good practice when using--node-format to also specify nodeformatting in the case where the morpheme cannot be found in the dictionary,by using--unk-format (short form-U).

This example formats the nodefeature to capture the items above as acomma-separated value:

# MeCab options used:## -F    ... short-form of --node-format# %m    ... morpheme surface# %f[0] ... part-of-speech# %h    ... part-of-speech id (ipadic)# %f[8] ... pronunciation## -U    ... short-form of --unk-format#           output ?,?,?,? for morphemes not in dictionary#withMeCab(r'-F%m,%f[0],%h,%f[8]\n -U?,?,?,?\n')asnm:forninnm.parse('ピンチの時には必ずヒーローが現れる。',as_nodes=True):...# only normal nodes, ignore any end-of-sentence and unknown nodes...ifn.is_nor():...print(n.feature)...ピンチ,名詞,38,ピンチの,助詞,24,ノ時,名詞,66,トキに,助詞,13,ニは,助詞,16,ワ必ず,副詞,35,カナラズヒーロー,名詞,38,ヒーローが,助詞,13,ガ現れる,動詞,31,アラワレル。,記号,7,。

Partial parsing (制約付き解析), allows you to pass hints to MeCab onhow to tokenize morphemes when parsing. Most useful are boundary constraintparsing and feature constraint parsing.

With boundary constraint parsing, you can specify either a compiledreregular expression object or a string to tell MeCab where the boundaries ofa morpheme should be. Use theboundary_constraints keyword. For hints ontokenization, please seeRegular expression operations andre.finditerin particular.

This example uses the-F node-format option to customize the resultingMeCabNode feature attribute to extract:

%m - morpheme surface
%f[0] - node part-of-speech
%s - nodestat status value, 1 isunknown

Note that any such morphemes captured will have nodestat status of 1 (unknown):

importrewithMeCab(r'-F%m,\s%f[0],\s%s\n')asnm:text='俺は努力したよっ？ お前の10倍、いや100倍1000倍したよっ！'# capture 10倍, 100倍 and 1000倍 as single parts-of-speechpattern=re.compile('10+倍')forninnm.parse(text,boundary_constraints=pattern,as_nodes=True):...print(n.feature)...俺,名詞,0は,助詞,0努力,名詞,0し,動詞,0たよっ,動詞,0？,記号,0お前,名詞,0の,助詞,010倍,名詞,1、,記号,0いや,接続詞,0100倍,名詞,11000倍,名詞,1し,動詞,0たよっ,動詞,0！,記号,0EOS

With feature constraint parsing, you can provide instructions to MeCabon what feature to use for a matching morpheme. Use thefeature_constraints keyword to pass in atuple containing elementsthat themselves aretuple instances with a specific morpheme (str)and a corresponding feature (str), in order of constraint precedence:

withMeCab(r'-F%m,\s%f[0],\s%s\n')asnm:text='心の中で3回唱え、 ヒーロー見参！ヒーロー見参！ヒーロー見参！'features= (('ヒーロー見参','感動詞'),)forninnm.parse(text,feature_constraints=features,as_nodes=True):...print(n.feature)...心,名詞,0の,助詞,0中,名詞,0で,助詞,03,名詞,1回,名詞,0唱え,動詞,0、,記号,0ヒーロー見参,感動詞,1！,記号,0ヒーロー見参,感動詞,1！,記号,0ヒーロー見参,感動詞,1！,記号,0EOS

Learn More

Examples and more detailed information aboutnatto-py can be found on theproject Wiki.
Working code in Jupyter notebook form can be found under thisproject's notebooks directory.
API documentation on Read the Docs.

Contributing to natto-py

Usegit andcheck out the latest code at GitHub to make sure thefeature hasn't been implemented or the bug hasn't been fixed yet.
Browse the issue tracker to make sure someone already hasn't requested itand/or contributed it.
Fork the project.
Start a feature/bugfix branch.
Commit and push until you are happy with your contribution.
Make sure to add tests for it. This is important so I don't break it in afuture version unintentionally.
Please try not to mess with thesetup.py,CHANGELOG, or versionfiles. If you must have your own version, that is fine, but please isolateto its own commit so I can cherry-pick around it.
This project uses the following packages for development:
- Sphinx for document generation
- twine for secure uploads during release
- unittest for unit tests, as it is very natural and easy-to-use
- PyYAML for data loading during tests

Changelog

Please see theCHANGELOG for the release history.

Copyright

About

natto-py combines the Python programming language with MeCab, the part-of-speech and morphological analyzer for the Japanese language.

Releases13

1.0.1 Latest

Sep 16, 2022

+ 12 releases

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

natto-py

What is natto-py?

Requirements

Installation

Automatic Configuration

Explicit configuration via MECAB_PATH and MECAB_CHARSET

Usage

Learn More

Contributing to natto-py

Changelog

Copyright

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases13

Packages

Used by195

Contributors5

Uh oh!

Languages

Movatterモバイル変換

License

buruzaemon/natto-py

Folders and files

Latest commit

History

Repository files navigation

natto-py

What is natto-py?

Requirements

Installation

Automatic Configuration

Explicit configuration via MECAB_PATH and MECAB_CHARSET

Usage

Learn More

Contributing to natto-py

Changelog

Copyright

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases13

Packages0

Used by195

Contributors5

Uh oh!

Languages

Packages