Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings
This repository was archived by the owner on Mar 9, 2023. It is now read-only.

Python version of Sudachi, a Japanese tokenizer.

License

NotificationsYou must be signed in to change notification settings

WorksApplications/SudachiPy

Repository files navigation

PyPi versionBuild Status

日本語

SudachiPy is a Python version ofSudachi, a Japanese morphological analyzer.

Warning

This repository is for 0.5.* version of SudachiPy, 0.6* and above are developed asSudachi.rs.

TL;DR

$ pip install sudachipy sudachidict_core$echo"高輪ゲートウェイ駅"| sudachipy高輪ゲートウェイ駅名詞,固有名詞,一般,*,*,*高輪ゲートウェイ駅EOS$echo"高輪ゲートウェイ駅"| sudachipy -m A高輪名詞,固有名詞,地名,一般,*,*高輪ゲートウェイ名詞,普通名詞,一般,*,*,*ゲートウェー駅名詞,普通名詞,一般,*,*,*駅EOS$echo"空缶空罐空きカン"| sudachipy -a空缶名詞,普通名詞,一般,*,*,*空き缶空缶アキカン0空罐名詞,普通名詞,一般,*,*,*空き缶空罐アキカン0空きカン名詞,普通名詞,一般,*,*,*空き缶空きカンアキカン0EOS

Setup

You need SudachiPy and a dictionary.

Step 1. Install SudachiPy

$ pip install sudachipy

Step 2. Get a Dictionary

You can get dictionary as a Python package. It make take a while to download the dictionary file (around 70MB for thecore edition).

$ pip install sudachidict_core

Alternatively, you can choose other dictionary editions. Seethis section for the detail.

Usage: As a command

There is a CLI commandsudachipy.

$echo"外国人参政権"| sudachipy外国人参政権名詞,普通名詞,一般,*,*,*外国人参政権EOS$echo"外国人参政権"| sudachipy -m A外国名詞,普通名詞,一般,*,*,*外国人接尾辞,名詞的,一般,*,*,*人参政名詞,普通名詞,一般,*,*,*参政権接尾辞,名詞的,一般,*,*,*権EOS
$ sudachipy tokenize -husage: sudachipy tokenize [-h] [-r file] [-m {A,B,C}] [-o file] [-s string]                          [-a] [-d] [-v]                          [file [file ...]]Tokenize Textpositional arguments:  file           text writtenin utf-8optional arguments:-h, --help     show thishelp message andexit-r file        the setting filein JSON format  -m {A,B,C}     the mode of splitting-o file        the output file-s string      sudachidicttype-a             print all of the fields-d             print the debug information  -v, --version  print sudachipy version

Output

Columns are tab separated.

  • Surface
  • Part-of-Speech Tags (comma separated)
  • Normalized Form

When you add the-a option, it additionally outputs

  • Dictionary Form
  • Reading Form
  • Dictionary ID
    • 0 for the system dictionary
    • 1 and above for theuser dictionaries
    • -1\t(OOV) if a word is Out-of-Vocabulary (not in the dictionary)
$echo"外国人参政権"| sudachipy -a外国人参政権名詞,普通名詞,一般,*,*,*外国人参政権外国人参政権ガイコクジンサンセイケン0EOS
echo"阿quei"| sudachipy -a阿名詞,普通名詞,一般,*,*,*阿阿-1(OOV)quei名詞,普通名詞,一般,*,*,*queiquei-1(OOV)EOS

Usage: As a Python package

Here is an example;

fromsudachipyimporttokenizerfromsudachipyimportdictionarytokenizer_obj=dictionary.Dictionary().create()
# Multi-granular Tokenizationmode=tokenizer.Tokenizer.SplitMode.C[m.surface()formintokenizer_obj.tokenize("国家公務員",mode)]# => ['国家公務員']mode=tokenizer.Tokenizer.SplitMode.B[m.surface()formintokenizer_obj.tokenize("国家公務員",mode)]# => ['国家', '公務員']mode=tokenizer.Tokenizer.SplitMode.A[m.surface()formintokenizer_obj.tokenize("国家公務員",mode)]# => ['国家', '公務', '員']
# Morpheme informationm=tokenizer_obj.tokenize("食べ",mode)[0]m.surface()# => '食べ'm.dictionary_form()# => '食べる'm.reading_form()# => 'タベ'm.part_of_speech()# => ['動詞', '一般', '*', '*', '下一段-バ行', '連用形-一般']
# Normalizationtokenizer_obj.tokenize("附属",mode)[0].normalized_form()# => '付属'tokenizer_obj.tokenize("SUMMER",mode)[0].normalized_form()# => 'サマー'tokenizer_obj.tokenize("シュミレーション",mode)[0].normalized_form()# => 'シミュレーション'

(With20200330core dictionary. The results may change when you use other versions)

Dictionary Edition

**WARNING:sudachipy link is no longer available in SudachiPy v0.5.2 and later. **

There are three editions of Sudachi Dictionary, namely,small,core, andfull. SeeWorksApplications/SudachiDict for the detail.

SudachiPy usessudachidict_core by default.

Dictionaries are installed as Python packagessudachidict_small,sudachidict_core, andsudachidict_full.

The dictionary files are not in the package itself, but it is downloaded upon installation.

Dictionary option: command line

You can specify the dictionary with the tokenize option-s.

$ pip install sudachidict_small$echo"外国人参政権"| sudachipy -s small
$ pip install sudachidict_full$echo"外国人参政権"| sudachipy -s full

Dictionary option: Python package

You can specify the dictionary with theDicionary() argument;config_path ordict_type.

classDictionary(config_path=None,resource_dir=None,dict_type=None)
  1. config_path
    • You can specify the file path to the setting file withconfig_path (See [Dictionary in The Setting File](#Dictionary in The Setting File) for the detail).
    • If the dictionary file is specified in the setting file assystemDict, SudachiPy will use the dictionary.
  2. dict_type
    • You can also specify the dictionary type withdict_type.
    • The available arguments aresmall,core, orfull.
    • If different dictionaries are specified withconfig_path anddict_type,a dictionary defineddict_type overrides those defined in the config path.
fromsudachipyimporttokenizerfromsudachipyimportdictionary# default: sudachidict_coretokenizer_obj=dictionary.Dictionary().create()# The dictionary given by the `systemDict` key in the config file (/path/to/sudachi.json) will be usedtokenizer_obj=dictionary.Dictionary(config_path="/path/to/sudachi.json").create()# The dictionary specified by `dict_type` will be set.tokenizer_obj=dictionary.Dictionary(dict_type="core").create()# sudachidict_core (same as default)tokenizer_obj=dictionary.Dictionary(dict_type="small").create()# sudachidict_smalltokenizer_obj=dictionary.Dictionary(dict_type="full").create()# sudachidict_full# The dictionary specified by `dict_type` overrides those defined in the config path.# In the following code, `sudachidict_full` will be used regardless of a dictionary defined in the config file.tokenizer_obj=dictionary.Dictionary(config_path="/path/to/sudachi.json",dict_type="full").create()

Dictionary in The Setting File

Alternatively, if the dictionary file is specified in the setting file,sudachi.json, SudachiPy will use that file.

{    "systemDict" : "relative/path/to/system.dic",    ...}

The default setting file issudachipy/resources/sudachi.json. You can specify yoursudachi.json with the-r option.

$ sudachipy -r path/to/sudachi.json

User Dictionary

To use a user dictionary,user.dic, placesudachi.json to anywhere you like, and adduserDict value with the relative path fromsudachi.json to youruser.dic.

{"userDict" :["relative/path/to/user.dic"],    ...}

Then specify yoursudachi.json with the-r option.

$ sudachipy -r path/to/sudachi.json

You can build a user dictionary with the subcommandubuild.

WARNING: v0.3.* ubuild contains bug.

$ sudachipy ubuild -husage: sudachipy ubuild [-h] [-d string] [-o file] [-s file] file [file ...]Build User Dictionarypositional arguments:  filesource files with CSV format (one or more)optional arguments:  -h, --help  show thishelp message andexit  -d string   description comment to be embedded on dictionary  -o file     output file (default: user.dic)  -s file     system dictionary path (default: system core dictionary path)

About the dictionary file format, please refer tothis document (written in Japanese, English version is not available yet).

Customized System Dictionary

$ sudachipy build -husage: sudachipy build [-h] [-o file] [-d string] -m file file [file ...]Build Sudachi Dictionarypositional arguments:  filesource files with CSV format (one of more)optional arguments:  -h, --help  show thishelp message andexit  -o file     output file (default: system.dic)  -d string   description comment to be embedded on dictionaryrequired named arguments:  -m file     connection matrix file with MeCab's matrix.def format

To use your customizedsystem.dic, placesudachi.json to anywhere you like, and overwritesystemDict value with the relative path fromsudachi.json to yoursystem.dic.

{    "systemDict" : "relative/path/to/system.dic",    ...}

Then specify yoursudachi.json with the-r option.

$ sudachipy -r path/to/sudachi.json

For Developers

Cython Build

$ python setup.py build_ext --inplace

Code Format

Runscripts/format.sh to check if your code is formatted correctly.

You need packagesflake8flake8-import-orderflake8-buitins (Seerequirements.txt).

Test

Runscripts/test.sh to run the tests.

Contact

Sudachi and SudachiPy are developed byWAP Tokushima Laboratory of AI and NLP.

Open an issue, or come to our Slack workspace for questions and discussion.

https://sudachi-dev.slack.com/ (Get invitationhere)

Enjoy tokenization!

About

Python version of Sudachi, a Japanese tokenizer.

Topics

Resources

License

Stars

Watchers

Forks

Sponsor this project

 

Packages

No packages published

Contributors15


[8]ページ先頭

©2009-2025 Movatter.jp