Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork52
Python version of Sudachi, a Japanese tokenizer.
License
WorksApplications/SudachiPy
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
SudachiPy is a Python version ofSudachi, a Japanese morphological analyzer.
This repository is for 0.5.* version of SudachiPy, 0.6* and above are developed asSudachi.rs.
$ pip install sudachipy sudachidict_core$echo"高輪ゲートウェイ駅"| sudachipy高輪ゲートウェイ駅名詞,固有名詞,一般,*,*,*高輪ゲートウェイ駅EOS$echo"高輪ゲートウェイ駅"| sudachipy -m A高輪名詞,固有名詞,地名,一般,*,*高輪ゲートウェイ名詞,普通名詞,一般,*,*,*ゲートウェー駅名詞,普通名詞,一般,*,*,*駅EOS$echo"空缶空罐空きカン"| sudachipy -a空缶名詞,普通名詞,一般,*,*,*空き缶空缶アキカン0空罐名詞,普通名詞,一般,*,*,*空き缶空罐アキカン0空きカン名詞,普通名詞,一般,*,*,*空き缶空きカンアキカン0EOS
You need SudachiPy and a dictionary.
$ pip install sudachipy
You can get dictionary as a Python package. It make take a while to download the dictionary file (around 70MB for thecore
edition).
$ pip install sudachidict_core
Alternatively, you can choose other dictionary editions. Seethis section for the detail.
There is a CLI commandsudachipy
.
$echo"外国人参政権"| sudachipy外国人参政権名詞,普通名詞,一般,*,*,*外国人参政権EOS$echo"外国人参政権"| sudachipy -m A外国名詞,普通名詞,一般,*,*,*外国人接尾辞,名詞的,一般,*,*,*人参政名詞,普通名詞,一般,*,*,*参政権接尾辞,名詞的,一般,*,*,*権EOS
$ sudachipy tokenize -husage: sudachipy tokenize [-h] [-r file] [-m {A,B,C}] [-o file] [-s string] [-a] [-d] [-v] [file [file ...]]Tokenize Textpositional arguments: file text writtenin utf-8optional arguments:-h, --help show thishelp message andexit-r file the setting filein JSON format -m {A,B,C} the mode of splitting-o file the output file-s string sudachidicttype-a print all of the fields-d print the debug information -v, --version print sudachipy version
Columns are tab separated.
- Surface
- Part-of-Speech Tags (comma separated)
- Normalized Form
When you add the-a
option, it additionally outputs
- Dictionary Form
- Reading Form
- Dictionary ID
0
for the system dictionary1
and above for theuser dictionaries-1\t(OOV)
if a word is Out-of-Vocabulary (not in the dictionary)
$echo"外国人参政権"| sudachipy -a外国人参政権名詞,普通名詞,一般,*,*,*外国人参政権外国人参政権ガイコクジンサンセイケン0EOS
echo"阿quei"| sudachipy -a阿名詞,普通名詞,一般,*,*,*阿阿-1(OOV)quei名詞,普通名詞,一般,*,*,*queiquei-1(OOV)EOS
Here is an example;
fromsudachipyimporttokenizerfromsudachipyimportdictionarytokenizer_obj=dictionary.Dictionary().create()
# Multi-granular Tokenizationmode=tokenizer.Tokenizer.SplitMode.C[m.surface()formintokenizer_obj.tokenize("国家公務員",mode)]# => ['国家公務員']mode=tokenizer.Tokenizer.SplitMode.B[m.surface()formintokenizer_obj.tokenize("国家公務員",mode)]# => ['国家', '公務員']mode=tokenizer.Tokenizer.SplitMode.A[m.surface()formintokenizer_obj.tokenize("国家公務員",mode)]# => ['国家', '公務', '員']
# Morpheme informationm=tokenizer_obj.tokenize("食べ",mode)[0]m.surface()# => '食べ'm.dictionary_form()# => '食べる'm.reading_form()# => 'タベ'm.part_of_speech()# => ['動詞', '一般', '*', '*', '下一段-バ行', '連用形-一般']
# Normalizationtokenizer_obj.tokenize("附属",mode)[0].normalized_form()# => '付属'tokenizer_obj.tokenize("SUMMER",mode)[0].normalized_form()# => 'サマー'tokenizer_obj.tokenize("シュミレーション",mode)[0].normalized_form()# => 'シミュレーション'
(With20200330
core
dictionary. The results may change when you use other versions)
**WARNING:sudachipy link
is no longer available in SudachiPy v0.5.2 and later. **
There are three editions of Sudachi Dictionary, namely,small
,core
, andfull
. SeeWorksApplications/SudachiDict for the detail.
SudachiPy usessudachidict_core
by default.
Dictionaries are installed as Python packagessudachidict_small
,sudachidict_core
, andsudachidict_full
.
The dictionary files are not in the package itself, but it is downloaded upon installation.
You can specify the dictionary with the tokenize option-s
.
$ pip install sudachidict_small$echo"外国人参政権"| sudachipy -s small
$ pip install sudachidict_full$echo"外国人参政権"| sudachipy -s full
You can specify the dictionary with theDicionary()
argument;config_path
ordict_type
.
classDictionary(config_path=None,resource_dir=None,dict_type=None)
config_path
- You can specify the file path to the setting file with
config_path
(See [Dictionary in The Setting File](#Dictionary in The Setting File) for the detail). - If the dictionary file is specified in the setting file as
systemDict
, SudachiPy will use the dictionary.
- You can specify the file path to the setting file with
dict_type
- You can also specify the dictionary type with
dict_type
. - The available arguments are
small
,core
, orfull
. - If different dictionaries are specified with
config_path
anddict_type
,a dictionary defineddict_type
overrides those defined in the config path.
- You can also specify the dictionary type with
fromsudachipyimporttokenizerfromsudachipyimportdictionary# default: sudachidict_coretokenizer_obj=dictionary.Dictionary().create()# The dictionary given by the `systemDict` key in the config file (/path/to/sudachi.json) will be usedtokenizer_obj=dictionary.Dictionary(config_path="/path/to/sudachi.json").create()# The dictionary specified by `dict_type` will be set.tokenizer_obj=dictionary.Dictionary(dict_type="core").create()# sudachidict_core (same as default)tokenizer_obj=dictionary.Dictionary(dict_type="small").create()# sudachidict_smalltokenizer_obj=dictionary.Dictionary(dict_type="full").create()# sudachidict_full# The dictionary specified by `dict_type` overrides those defined in the config path.# In the following code, `sudachidict_full` will be used regardless of a dictionary defined in the config file.tokenizer_obj=dictionary.Dictionary(config_path="/path/to/sudachi.json",dict_type="full").create()
Alternatively, if the dictionary file is specified in the setting file,sudachi.json
, SudachiPy will use that file.
{ "systemDict" : "relative/path/to/system.dic", ...}
The default setting file issudachipy/resources/sudachi.json. You can specify yoursudachi.json
with the-r
option.
$ sudachipy -r path/to/sudachi.json
To use a user dictionary,user.dic
, placesudachi.json to anywhere you like, and adduserDict
value with the relative path fromsudachi.json
to youruser.dic
.
{"userDict" :["relative/path/to/user.dic"], ...}
Then specify yoursudachi.json
with the-r
option.
$ sudachipy -r path/to/sudachi.json
You can build a user dictionary with the subcommandubuild
.
WARNING: v0.3.* ubuild contains bug.
$ sudachipy ubuild -husage: sudachipy ubuild [-h] [-d string] [-o file] [-s file] file [file ...]Build User Dictionarypositional arguments: filesource files with CSV format (one or more)optional arguments: -h, --help show thishelp message andexit -d string description comment to be embedded on dictionary -o file output file (default: user.dic) -s file system dictionary path (default: system core dictionary path)
About the dictionary file format, please refer tothis document (written in Japanese, English version is not available yet).
$ sudachipy build -husage: sudachipy build [-h] [-o file] [-d string] -m file file [file ...]Build Sudachi Dictionarypositional arguments: filesource files with CSV format (one of more)optional arguments: -h, --help show thishelp message andexit -o file output file (default: system.dic) -d string description comment to be embedded on dictionaryrequired named arguments: -m file connection matrix file with MeCab's matrix.def format
To use your customizedsystem.dic
, placesudachi.json to anywhere you like, and overwritesystemDict
value with the relative path fromsudachi.json
to yoursystem.dic
.
{ "systemDict" : "relative/path/to/system.dic", ...}
Then specify yoursudachi.json
with the-r
option.
$ sudachipy -r path/to/sudachi.json
$ python setup.py build_ext --inplace
Runscripts/format.sh
to check if your code is formatted correctly.
You need packagesflake8
flake8-import-order
flake8-buitins
(Seerequirements.txt
).
Runscripts/test.sh
to run the tests.
Sudachi and SudachiPy are developed byWAP Tokushima Laboratory of AI and NLP.
Open an issue, or come to our Slack workspace for questions and discussion.
https://sudachi-dev.slack.com/ (Get invitationhere)
Enjoy tokenization!
About
Python version of Sudachi, a Japanese tokenizer.
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Sponsor this project
Uh oh!
There was an error while loading.Please reload this page.
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors15
Uh oh!
There was an error while loading.Please reload this page.