Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

natto-py combines the Python programming language with MeCab, the part-of-speech and morphological analyzer for the Japanese language.

License

NotificationsYou must be signed in to change notification settings

buruzaemon/natto-py

Repository files navigation

What is natto-py?

A package leveraging FFI (foreign function interface),natto-py combinesthePython programming language withMeCab, the part-of-speech andmorphological analyzer for the Japanese language. No compiler is necessary, asit isnot a C extension.natto-py will run on Mac OS, Windows and*nix.

You can learn more aboutnatto-py at GitHub.

If you are still usingPython 2 after sunset, please stick with versionnatto-py==0.9.2.

versionpyversionslicensegithub-actionsDocumentation Status

Requirements

natto-py requires the following:

The following Python 3 versions are supported:

For Python 2, please use version0.9.2.

Installation

Installnatto-py as you would any other Python package:

$ pip install natto-py

This will automatically install thecffi package, whichnatto-py usesto bind to themecab library.

Automatic Configuration

As long as themecab (andmecab-config for *nix and Mac OS)executables are on yourPATH,natto-py does not require any explicitconfiguration.

  • On *nix and Mac OS, it queriesmecab-config to discover the path to thelibmecab.so orlibmecab.dylib, respectively.
  • On Windows, it queries the Windows Registry to locate the MeCab installation folder.
  • In order to convert character encodings to/from Unicode,natto-py will examine the charset of themecab system dictionary.

Explicit configuration via MECAB_PATH and MECAB_CHARSET

Ifnatto-py for some reason cannot locate themecab library,or if it cannot determine the correct charset used internally bymecab, then you will need to set theMECAB_PATH andMECAB_CHARSETenvironment variables.

  • Set theMECAB_PATH environment variable to the exact name/path to yourmecab library.
  • Set theMECAB_CHARSET environment variable to thecharset character encoding used by your system dictionary.

e.g., for Mac OS:

export MECAB_PATH=/usr/local/Cellar/mecab/0.996/lib/libmecab.dylibexport MECAB_CHARSET=utf8

e.g., for bash on UNIX/Linux:

export MECAB_PATH=/usr/local/lib/libmecab.soexport MECAB_CHARSET=euc-jp

e.g., on Windows:

setMECAB_PATH=C:\Program Files\MeCab\bin\libmecab.dllsetMECAB_CHARSET=shift-jis

e.g., from within a Python program:

importosos.environ['MECAB_PATH']='/usr/local/lib/libmecab.so'os.environ['MECAB_CHARSET']='utf-16'

Usage

Here's a very quick guide to usingnatto-py.

Instantiate a reference to themecab library, and display some details:

fromnattoimportMeCabnm=MeCab()print(nm)# displays details about the MeCab instance<natto.mecab.MeCabmodel=<cdata'mecab_model_t *'0x801c16300>,tagger=<cdata'mecab_t *'0x801c17470>,lattice=<cdata'mecab_lattice_t *'0x801c196c0>,libpath="/usr/local/lib/libmecab.so",options={},dicts=[<natto.dictionary.DictionaryInfodictionary='mecab_dictionary_info_t *'0x801c19540>,filepath="/usr/local/lib/mecab/dic/ipadic/sys.dic",charset=utf8,type=0],version=0.996>

Display details about themecab system dictionary used:

sysdic=nm.dicts[0]print(sysdic)# displays the MeCab system dictionary info<natto.dictionary.DictionaryInfodictionary='mecab_dictionary_info_t *'0x801c19540>,filepath="/usr/local/lib/mecab/dic/ipadic/sys.dic",charset=utf8,type=0>

Parse Japanese text and send the MeCab result as a single string tostdout:

print(nm.parse('ピンチの時には必ずヒーローが現れる。'))# MeCab result as a single stringピンチ名詞,一般,*,*,*,*,ピンチ,ピンチ,ピンチ助詞,連体化,*,*,*,*,,,名詞,非自立,副詞可能,*,*,*,,トキ,トキ助詞,格助詞,一般,*,*,*,,,助詞,係助詞,*,*,*,*,,,必ず副詞,助詞類接続,*,*,*,*,必ず,カナラズ,カナラズヒーロー名詞,一般,*,*,*,*,ヒーロー,ヒーロー,ヒーロー助詞,格助詞,一般,*,*,*,,,現れる動詞,自立,*,*,一段,基本形,現れる,アラワレル,アラワレル記号,句点,*,*,*,*,。,。,。EOS

Next, try parsing the text with MeCab node parsing. A generator yielding theMeCabNode instances lets you efficiently iterate over the output without firstmaterializing each and every resulting MeCabNode instance. The MeCabNodeinstances yielded allow access to more detailed information about eachmorpheme.

Here we use aPython with-statement to automatically clean up after wefinish node parsing with the MeCab tagger. This is the recommended approachfor usingnatto-py in a production environment:

# Use a Python with-statement to ensure mecab_destroy is invoked#withMeCab()asnm:forninnm.parse('ピンチの時には必ずヒーローが現れる。',as_nodes=True):...# ignore any end-of-sentence nodes...ifnotn.is_eos():...print('{}\t{}'.format(n.surface,n.cost))...ピンチ33483722517650835305必ず7525ヒーロー1136310508現れる108417127

MeCab output formatting is extremely flexible and is highly recommended forany serious natural language processing task. Rather than parsing the MeCaboutput as a single, large string, use MeCab's--node-format option(short form-F) to customize the node'sfeature attribute.

  • morpheme surface
  • part-of-speech
  • part-of-speech ID
  • pronunciation

It is good practice when using--node-format to also specify nodeformatting in the case where the morpheme cannot be found in the dictionary,by using--unk-format (short form-U).

This example formats the nodefeature to capture the items above as acomma-separated value:

# MeCab options used:## -F    ... short-form of --node-format# %m    ... morpheme surface# %f[0] ... part-of-speech# %h    ... part-of-speech id (ipadic)# %f[8] ... pronunciation## -U    ... short-form of --unk-format#           output ?,?,?,? for morphemes not in dictionary#withMeCab(r'-F%m,%f[0],%h,%f[8]\n -U?,?,?,?\n')asnm:forninnm.parse('ピンチの時には必ずヒーローが現れる。',as_nodes=True):...# only normal nodes, ignore any end-of-sentence and unknown nodes...ifn.is_nor():...print(n.feature)...ピンチ,名詞,38,ピンチ,助詞,24,,名詞,66,トキ,助詞,13,,助詞,16,必ず,副詞,35,カナラズヒーロー,名詞,38,ヒーロー,助詞,13,現れる,動詞,31,アラワレル。,記号,7,。

Partial parsing (制約付き解析), allows you to pass hints to MeCab onhow to tokenize morphemes when parsing. Most useful are boundary constraintparsing and feature constraint parsing.

With boundary constraint parsing, you can specify either a compiledreregular expression object or a string to tell MeCab where the boundaries ofa morpheme should be. Use theboundary_constraints keyword. For hints ontokenization, please seeRegular expression operations andre.finditerin particular.

This example uses the-F node-format option to customize the resultingMeCabNode feature attribute to extract:

  • %m - morpheme surface
  • %f[0] - node part-of-speech
  • %s - nodestat status value, 1 isunknown

Note that any such morphemes captured will have nodestat status of 1 (unknown):

importrewithMeCab(r'-F%m,\s%f[0],\s%s\n')asnm:text='俺は努力したよっ? お前の10倍、いや100倍1000倍したよっ!'# capture 10倍, 100倍 and 1000倍 as single parts-of-speechpattern=re.compile('10+倍')forninnm.parse(text,boundary_constraints=pattern,as_nodes=True):...print(n.feature)...,名詞,0,助詞,0努力,名詞,0,動詞,0たよっ,動詞,0?,記号,0お前,名詞,0,助詞,010,名詞,1、,記号,0いや,接続詞,0100,名詞,11000,名詞,1,動詞,0たよっ,動詞,0!,記号,0EOS

With feature constraint parsing, you can provide instructions to MeCabon what feature to use for a matching morpheme. Use thefeature_constraints keyword to pass in atuple containing elementsthat themselves aretuple instances with a specific morpheme (str)and a corresponding feature (str), in order of constraint precedence:

withMeCab(r'-F%m,\s%f[0],\s%s\n')asnm:text='心の中で3回唱え、 ヒーロー見参!ヒーロー見参!ヒーロー見参!'features= (('ヒーロー見参','感動詞'),)forninnm.parse(text,feature_constraints=features,as_nodes=True):...print(n.feature)...,名詞,0,助詞,0,名詞,0,助詞,03,名詞,1,名詞,0唱え,動詞,0、,記号,0ヒーロー見参,感動詞,1!,記号,0ヒーロー見参,感動詞,1!,記号,0ヒーロー見参,感動詞,1!,記号,0EOS

Learn More

Contributing to natto-py

  • Usegit andcheck out the latest code at GitHub to make sure thefeature hasn't been implemented or the bug hasn't been fixed yet.
  • Browse the issue tracker to make sure someone already hasn't requested itand/or contributed it.
  • Fork the project.
  • Start a feature/bugfix branch.
  • Commit and push until you are happy with your contribution.
  • Make sure to add tests for it. This is important so I don't break it in afuture version unintentionally.
  • Please try not to mess with thesetup.py,CHANGELOG, or versionfiles. If you must have your own version, that is fine, but please isolateto its own commit so I can cherry-pick around it.
  • This project uses the following packages for development:
    • Sphinx for document generation
    • twine for secure uploads during release
    • unittest for unit tests, as it is very natural and easy-to-use
    • PyYAML for data loading during tests

Changelog

Please see theCHANGELOG for the release history.

Copyright

Copyright © 2022, Brooke M. Fujita. All rights reserved. Please seetheLICENSE file for further details.

About

natto-py combines the Python programming language with MeCab, the part-of-speech and morphological analyzer for the Japanese language.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors5


[8]ページ先頭

©2009-2025 Movatter.jp