Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork40
Sudachi in Rust 🦀 and new generation of SudachiPy
License
WorksApplications/sudachi.rs
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
sudachi.rs is a Rust implementation ofSudachi, a Japanese morphological analyzer.
Python implementation is also available:SudachiPy Documentation.
Install Python version
pip install --upgrade'sudachipy>=0.6.10'
or Rust version
$ git clone https://github.com/WorksApplications/sudachi.rs.git$cd ./sudachi.rs$ cargo build --release$ cargo install --path sudachi-cli/$ ./fetch_dictionary.sh$echo"高輪ゲートウェイ駅"| sudachi高輪ゲートウェイ駅 名詞,固有名詞,一般,*,*,* 高輪ゲートウェイ駅EOS
Multi-granular Tokenization
$echo 選挙管理委員会| sudachi選挙管理委員会 名詞,固有名詞,一般,*,*,* 選挙管理委員会EOS$echo 選挙管理委員会| sudachi --mode A選挙 名詞,普通名詞,サ変可能,*,*,* 選挙管理 名詞,普通名詞,サ変可能,*,*,* 管理委員 名詞,普通名詞,一般,*,*,* 委員会 名詞,普通名詞,一般,*,*,* 会EOS
Normalized Form
$echo 打込む かつ丼 附属 vintage| sudachi打込む 動詞,一般,*,*,五段-マ行,終止形-一般 打ち込む 空白,*,*,*,*,*かつ丼 名詞,普通名詞,一般,*,*,* カツ丼 空白,*,*,*,*,*附属 名詞,普通名詞,サ変可能,*,*,* 付属 空白,*,*,*,*,*vintage 名詞,普通名詞,一般,*,*,* ビンテージEOS
Wakati (space-delimited surface form) Output
$ cat lemon.txtえたいの知れない不吉な塊が私の心を始終圧えつけていた。焦躁と言おうか、嫌悪と言おうか――酒を飲んだあとに宿酔があるように、酒を毎日飲んでいると宿酔に相当した時期がやって来る。それが来たのだ。これはちょっといけなかった。$ sudachi --wakati lemon.txtえたい の 知れ ない 不吉 な 塊 が 私 の 心 を 始終 圧え つけ て い た 。焦躁 と 言おう か 、 嫌悪 と 言おう か ― ― 酒 を 飲ん だ あと に 宿酔 が ある よう に 、 酒 を 毎日 飲ん で いる と 宿酔 に 相当 し た 時期 が やっ て 来る 。それ が 来 た の だ 。 これ は ちょっと いけ なかっ た 。
You need sudachi.rs, default plugins, and a dictionary. (This crate don't include dictionary.)
git clone https://github.com/WorksApplications/sudachi.rs.git
Sudachi requires a dictionary to operate.You can download a dictionary ZIP file fromWorksApplications/SudachiDict (choose one fromsmall
,core
, orfull
), unzip it, and place thesystem_*.dic
file somewhere.By the default setting file, sudachi.rs assumes that it is placed atresources/system.dic
.
Optionally, you can use thefetch_dictionary.sh
shell script to download a dictionary and install it toresources/system.dic
(overrides).
# fetch latest core dictionary./fetch_dictionary.sh# fetch dictionary of specified version and type./fetch_dictionary.sh 20241021 small
cargo build --release
This was un-implemented and does not work currently, see#35
Specify thebake_dictionary
feature to embed a dictionary into the binary.Thesudachi
executable willcontain the dictionary binary.The baked dictionary will be used if no one is specified via cli option or setting file.
You must specify the path the dictionary file in theSUDACHI_DICT_PATH
environment variable when building.SUDACHI_DICT_PATH
is relative to the sudachi.rs directory (or absolute).
Example on Unix-like system:
# Download dictionary to resources/system.dic$ ./fetch_dictionary.sh# Build with bake_dictionary feature (relative path)$ env SUDACHI_DICT_PATH=resources/system.dic cargo build --release --features bake_dictionary# or# Build with bake_dictionary feature (absolute path)$ env SUDACHI_DICT_PATH=/path/to/my-sudachi.dic cargo build --release --features bake_dictionary
$cd sudachi.rs/$ cargo install --path sudachi-cli/$ which sudachi/Users/<USER>/.cargo/bin/sudachi$ sudachi -hsudachi 0.6.0A Japanese tokenizer...
$ sudachi -hA Japanese tokenizerUsage: sudachi [OPTIONS] [FILE] [COMMAND]Commands: build Builds system dictionary ubuild Builds user dictionary dumphelp Print this message or thehelp of the given subcommand(s)Arguments: [FILE] Input text file: If not present,read from STDINOptions: -r, --config-file<CONFIG_FILE> Path to the setting filein JSON format -p, --resource_dir<RESOURCE_DIR> Path to the root directory of resources -m, --mode<MODE> Split unit:"A" (short),"B" (middle), or"C" (Named Entity) [default: C] -o, --output<OUTPUT_FILE> Output text file: If not present, use stdout -a, --all Prints all fields -w, --wakati Outputs only surface form -d, --debug Debug mode: Print the debug information -l, --dict<DICTIONARY_PATH> Path to sudachi dictionary. If None, it refer config andthen baked dictionary --split-sentences<SPLIT_SENTENCES> How to split sentences [default: yes] -h, --help Printhelp (see more with'--help') -V, --version Print version
Columns are tab separated.
- Surface
- Part-of-Speech Tags (comma separated)
- Normalized Form
When you add the-a
(--all
) flag, it additionally outputs
- Dictionary Form
- Reading Form
- Dictionary ID
0
for the system dictionary1
and above for the user dictionaries-1
if a word is Out-of-Vocabulary (not in the dictionary)
- Synonym group IDs
(OOV)
if a word is Out-of-Vocabulary (not in the dictionary)
$echo"外国人参政権"| sudachi -a外国人参政権 名詞,普通名詞,一般,*,*,* 外国人参政権 外国人参政権 ガイコクジンサンセイケン 0 []EOS
echo"阿quei"| sudachipy -a阿 名詞,普通名詞,一般,*,*,* 阿 阿 -1 [] (OOV)quei 名詞,普通名詞,一般,*,*,* quei quei -1 [] (OOV)EOS
When you add-w
(--wakati
) flag, it outputs space-delimited surface instead.
$echo"外国人参政権"| sudachi -m A -w外国 人 参政 権
- Out of Vocabulary handling
- Easy dictionary file install & management,similar to SudachiPy
- Registration to crates.io
- agatan/yoin: A Japanese Morphological Analyzer written in pure Rust
- wareya/notmecab-rs: notmecab-rs is a very basic mecab clone, designed only to do parsing, not training.
- Sudachi Logo
- Crab illustration:Pixabay
About
Sudachi in Rust 🦀 and new generation of SudachiPy
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Sponsor this project
Uh oh!
There was an error while loading.Please reload this page.
Packages0
Uh oh!
There was an error while loading.Please reload this page.