This repository was archived by the owner on Mar 9, 2023. It is now read-only.

WorksApplications/SudachiPyPublic archive

NotificationsYou must be signed in to change notification settings
Fork52
Star406

Python version of Sudachi, a Japanese tokenizer.

License

Apache-2.0 license

406 stars 52 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 408 Commits
.github		.github
docs		docs
scripts		scripts
sudachipy		sudachipy
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Repository files navigation

SudachiPy

日本語

SudachiPy is a Python version ofSudachi, a Japanese morphological analyzer.

Warning

This repository is for 0.5.* version of SudachiPy, 0.6* and above are developed asSudachi.rs.

TL;DR

$ pip install sudachipy sudachidict_core$echo"高輪ゲートウェイ駅"| sudachipy高輪ゲートウェイ駅名詞,固有名詞,一般,*,*,*高輪ゲートウェイ駅EOS$echo"高輪ゲートウェイ駅"| sudachipy -m A高輪名詞,固有名詞,地名,一般,*,*高輪ゲートウェイ名詞,普通名詞,一般,*,*,*ゲートウェー駅名詞,普通名詞,一般,*,*,*駅EOS$echo"空缶空罐空きカン"| sudachipy -a空缶名詞,普通名詞,一般,*,*,*空き缶空缶アキカン0空罐名詞,普通名詞,一般,*,*,*空き缶空罐アキカン0空きカン名詞,普通名詞,一般,*,*,*空き缶空きカンアキカン0EOS

Setup

You need SudachiPy and a dictionary.

Step 1. Install SudachiPy

$ pip install sudachipy

Step 2. Get a Dictionary

You can get dictionary as a Python package. It make take a while to download the dictionary file (around 70MB for thecore edition).

$ pip install sudachidict_core

Alternatively, you can choose other dictionary editions. Seethis section for the detail.

Usage: As a command

There is a CLI commandsudachipy.

$echo"外国人参政権"| sudachipy外国人参政権名詞,普通名詞,一般,*,*,*外国人参政権EOS$echo"外国人参政権"| sudachipy -m A外国名詞,普通名詞,一般,*,*,*外国人接尾辞,名詞的,一般,*,*,*人参政名詞,普通名詞,一般,*,*,*参政権接尾辞,名詞的,一般,*,*,*権EOS

$ sudachipy tokenize -husage: sudachipy tokenize [-h] [-r file] [-m {A,B,C}] [-o file] [-s string]                          [-a] [-d] [-v]                          [file [file ...]]Tokenize Textpositional arguments:  file           text writtenin utf-8optional arguments:-h, --help     show thishelp message andexit-r file        the setting filein JSON format  -m {A,B,C}     the mode of splitting-o file        the output file-s string      sudachidicttype-a             print all of the fields-d             print the debug information  -v, --version  print sudachipy version

Output

Columns are tab separated.

Surface
Part-of-Speech Tags (comma separated)
Normalized Form

When you add the-a option, it additionally outputs

Dictionary Form
Reading Form
Dictionary ID
- 0 for the system dictionary
- 1 and above for theuser dictionaries
- -1\t(OOV) if a word is Out-of-Vocabulary (not in the dictionary)

$echo"外国人参政権"| sudachipy -a外国人参政権名詞,普通名詞,一般,*,*,*外国人参政権外国人参政権ガイコクジンサンセイケン0EOS

echo"阿quei"| sudachipy -a阿名詞,普通名詞,一般,*,*,*阿阿-1(OOV)quei名詞,普通名詞,一般,*,*,*queiquei-1(OOV)EOS

Usage: As a Python package

Here is an example;

fromsudachipyimporttokenizerfromsudachipyimportdictionarytokenizer_obj=dictionary.Dictionary().create()

# Multi-granular Tokenizationmode=tokenizer.Tokenizer.SplitMode.C[m.surface()formintokenizer_obj.tokenize("国家公務員",mode)]# => ['国家公務員']mode=tokenizer.Tokenizer.SplitMode.B[m.surface()formintokenizer_obj.tokenize("国家公務員",mode)]# => ['国家', '公務員']mode=tokenizer.Tokenizer.SplitMode.A[m.surface()formintokenizer_obj.tokenize("国家公務員",mode)]# => ['国家', '公務', '員']

# Morpheme informationm=tokenizer_obj.tokenize("食べ",mode)[0]m.surface()# => '食べ'm.dictionary_form()# => '食べる'm.reading_form()# => 'タベ'm.part_of_speech()# => ['動詞', '一般', '*', '*', '下一段-バ行', '連用形-一般']

# Normalizationtokenizer_obj.tokenize("附属",mode)[0].normalized_form()# => '付属'tokenizer_obj.tokenize("SUMMER",mode)[0].normalized_form()# => 'サマー'tokenizer_obj.tokenize("シュミレーション",mode)[0].normalized_form()# => 'シミュレーション'

(With20200330core dictionary. The results may change when you use other versions)

Dictionary Edition

**WARNING:sudachipy link is no longer available in SudachiPy v0.5.2 and later. **

There are three editions of Sudachi Dictionary, namely,small,core, andfull. SeeWorksApplications/SudachiDict for the detail.

SudachiPy usessudachidict_core by default.

Dictionaries are installed as Python packagessudachidict_small,sudachidict_core, andsudachidict_full.

The dictionary files are not in the package itself, but it is downloaded upon installation.

Dictionary option: command line

You can specify the dictionary with the tokenize option-s.

$ pip install sudachidict_small$echo"外国人参政権"| sudachipy -s small

$ pip install sudachidict_full$echo"外国人参政権"| sudachipy -s full

Dictionary option: Python package

You can specify the dictionary with theDicionary() argument;config_path ordict_type.

classDictionary(config_path=None,resource_dir=None,dict_type=None)

config_path
- You can specify the file path to the setting file withconfig_path (See [Dictionary in The Setting File](#Dictionary in The Setting File) for the detail).
- If the dictionary file is specified in the setting file assystemDict, SudachiPy will use the dictionary.
dict_type
- You can also specify the dictionary type withdict_type.
- The available arguments aresmall,core, orfull.
- If different dictionaries are specified withconfig_path anddict_type,a dictionary defineddict_type overrides those defined in the config path.

fromsudachipyimporttokenizerfromsudachipyimportdictionary# default: sudachidict_coretokenizer_obj=dictionary.Dictionary().create()# The dictionary given by the `systemDict` key in the config file (/path/to/sudachi.json) will be usedtokenizer_obj=dictionary.Dictionary(config_path="/path/to/sudachi.json").create()# The dictionary specified by `dict_type` will be set.tokenizer_obj=dictionary.Dictionary(dict_type="core").create()# sudachidict_core (same as default)tokenizer_obj=dictionary.Dictionary(dict_type="small").create()# sudachidict_smalltokenizer_obj=dictionary.Dictionary(dict_type="full").create()# sudachidict_full# The dictionary specified by `dict_type` overrides those defined in the config path.# In the following code, `sudachidict_full` will be used regardless of a dictionary defined in the config file.tokenizer_obj=dictionary.Dictionary(config_path="/path/to/sudachi.json",dict_type="full").create()

Dictionary in The Setting File

Alternatively, if the dictionary file is specified in the setting file,sudachi.json, SudachiPy will use that file.

{    "systemDict" : "relative/path/to/system.dic",    ...}

The default setting file issudachipy/resources/sudachi.json. You can specify yoursudachi.json with the-r option.

$ sudachipy -r path/to/sudachi.json

User Dictionary

To use a user dictionary,user.dic, placesudachi.json to anywhere you like, and adduserDict value with the relative path fromsudachi.json to youruser.dic.

{"userDict" :["relative/path/to/user.dic"],    ...}

Then specify yoursudachi.json with the-r option.

$ sudachipy -r path/to/sudachi.json

You can build a user dictionary with the subcommandubuild.

WARNING: v0.3.* ubuild contains bug.

$ sudachipy ubuild -husage: sudachipy ubuild [-h] [-d string] [-o file] [-s file] file [file ...]Build User Dictionarypositional arguments:  filesource files with CSV format (one or more)optional arguments:  -h, --help  show thishelp message andexit  -d string   description comment to be embedded on dictionary  -o file     output file (default: user.dic)  -s file     system dictionary path (default: system core dictionary path)

About the dictionary file format, please refer tothis document (written in Japanese, English version is not available yet).

Customized System Dictionary

$ sudachipy build -husage: sudachipy build [-h] [-o file] [-d string] -m file file [file ...]Build Sudachi Dictionarypositional arguments:  filesource files with CSV format (one of more)optional arguments:  -h, --help  show thishelp message andexit  -o file     output file (default: system.dic)  -d string   description comment to be embedded on dictionaryrequired named arguments:  -m file     connection matrix file with MeCab's matrix.def format

To use your customizedsystem.dic, placesudachi.json to anywhere you like, and overwritesystemDict value with the relative path fromsudachi.json to yoursystem.dic.

{    "systemDict" : "relative/path/to/system.dic",    ...}

Then specify yoursudachi.json with the-r option.

$ sudachipy -r path/to/sudachi.json

For Developers

Cython Build

$ python setup.py build_ext --inplace

Code Format

Runscripts/format.sh to check if your code is formatted correctly.

You need packagesflake8flake8-import-orderflake8-buitins (Seerequirements.txt).

Test

Runscripts/test.sh to run the tests.

Contact

Sudachi and SudachiPy are developed byWAP Tokushima Laboratory of AI and NLP.

Open an issue, or come to our Slack workspace for questions and discussion.

https://sudachi-dev.slack.com/ (Get invitationhere)

Enjoy tokenization!

About

Python version of Sudachi, a Japanese tokenizer.

Releases21

v0.5.4 Latest

Sep 27, 2021

+ 20 releases

Sponsor this project

Learn more about GitHub Sponsors

Packages

No packages published

Movatterモバイル変換

Uh oh!

License

WorksApplications/SudachiPy

Folders and files

Latest commit

History

Repository files navigation

SudachiPy

Warning

TL;DR

Setup

Step 1. Install SudachiPy

Step 2. Get a Dictionary

Usage: As a command

Output

Usage: As a Python package

Dictionary Edition

Dictionary option: command line

Dictionary option: Python package

Dictionary in The Setting File

User Dictionary

Customized System Dictionary

For Developers

Cython Build

Code Format

Test

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases21

Sponsor this project

Uh oh!

Packages0

Uh oh!

Contributors15

Uh oh!

Languages

Packages