Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Sudachi in Rust 🦀 and new generation of SudachiPy

License

NotificationsYou must be signed in to change notification settings

WorksApplications/sudachi.rs

Repository files navigation

Rust

sudachi.rs logo

sudachi.rs is a Rust implementation ofSudachi, a Japanese morphological analyzer.

日本語 README.

Python implementation is also available:SudachiPy Documentation.

TL;DR

Install Python version

pip install --upgrade'sudachipy>=0.6.10'

or Rust version

$ git clone https://github.com/WorksApplications/sudachi.rs.git$cd ./sudachi.rs$ cargo build --release$ cargo install --path sudachi-cli/$ ./fetch_dictionary.sh$echo"高輪ゲートウェイ駅"| sudachi高輪ゲートウェイ駅  名詞,固有名詞,一般,*,*,*    高輪ゲートウェイ駅EOS

Example

Multi-granular Tokenization

$echo 選挙管理委員会| sudachi選挙管理委員会  名詞,固有名詞,一般,*,*,*        選挙管理委員会EOS$echo 選挙管理委員会| sudachi --mode A選挙    名詞,普通名詞,サ変可能,*,*,*    選挙管理    名詞,普通名詞,サ変可能,*,*,*    管理委員    名詞,普通名詞,一般,*,*,*        委員会      名詞,普通名詞,一般,*,*,*        会EOS

Normalized Form

$echo 打込む かつ丼 附属 vintage| sudachi打込む  動詞,一般,*,*,五段-マ行,終止形-一般     打ち込む        空白,*,*,*,*,*かつ丼  名詞,普通名詞,一般,*,*,*        カツ丼        空白,*,*,*,*,*附属    名詞,普通名詞,サ変可能,*,*,*    付属        空白,*,*,*,*,*vintage 名詞,普通名詞,一般,*,*,*        ビンテージEOS

Wakati (space-delimited surface form) Output

$ cat lemon.txtえたいの知れない不吉な塊が私の心を始終圧えつけていた。焦躁と言おうか、嫌悪と言おうか――酒を飲んだあとに宿酔があるように、酒を毎日飲んでいると宿酔に相当した時期がやって来る。それが来たのだ。これはちょっといけなかった。$ sudachi --wakati lemon.txtえたい の 知れ ない 不吉 な 塊 が 私 の 心 を 始終 圧え つけ て い た 。焦躁 と 言おう か 、 嫌悪 と 言おう か ― ― 酒 を 飲ん だ あと に 宿酔 が ある よう に 、 酒 を 毎日 飲ん で いる と 宿酔 に 相当 し た 時期 が やっ て 来る 。それ が 来 た の だ 。 これ は ちょっと いけ なかっ た 。

Setup

You need sudachi.rs, default plugins, and a dictionary. (This crate don't include dictionary.)

1. Get the source code

git clone https://github.com/WorksApplications/sudachi.rs.git

2. Download a Sudachi Dictionary

Sudachi requires a dictionary to operate.You can download a dictionary ZIP file fromWorksApplications/SudachiDict (choose one fromsmall,core, orfull), unzip it, and place thesystem_*.dic file somewhere.By the default setting file, sudachi.rs assumes that it is placed atresources/system.dic.

Convenience Script

Optionally, you can use thefetch_dictionary.sh shell script to download a dictionary and install it toresources/system.dic (overrides).

# fetch latest core dictionary./fetch_dictionary.sh# fetch dictionary of specified version and type./fetch_dictionary.sh 20241021 small

3. Build

cargo build --release

Build (bake dictionary into binary)

This was un-implemented and does not work currently, see#35

Specify thebake_dictionary feature to embed a dictionary into the binary.Thesudachi executable willcontain the dictionary binary.The baked dictionary will be used if no one is specified via cli option or setting file.

You must specify the path the dictionary file in theSUDACHI_DICT_PATH environment variable when building.SUDACHI_DICT_PATH is relative to the sudachi.rs directory (or absolute).

Example on Unix-like system:

# Download dictionary to resources/system.dic$ ./fetch_dictionary.sh# Build with bake_dictionary feature (relative path)$ env SUDACHI_DICT_PATH=resources/system.dic cargo build --release --features bake_dictionary# or# Build with bake_dictionary feature (absolute path)$ env SUDACHI_DICT_PATH=/path/to/my-sudachi.dic cargo build --release --features bake_dictionary

4. Install

$cd sudachi.rs/$ cargo install --path sudachi-cli/$ which sudachi/Users/<USER>/.cargo/bin/sudachi$ sudachi -hsudachi 0.6.0A Japanese tokenizer...

Usage as a command

$ sudachi -hA Japanese tokenizerUsage: sudachi [OPTIONS] [FILE] [COMMAND]Commands:  build          Builds system dictionary  ubuild          Builds user dictionary  dumphelp          Print this message or thehelp of the given subcommand(s)Arguments:  [FILE]          Input text file: If not present,read from STDINOptions:  -r, --config-file<CONFIG_FILE>          Path to the setting filein JSON format  -p, --resource_dir<RESOURCE_DIR>          Path to the root directory of resources  -m, --mode<MODE>          Split unit:"A" (short),"B" (middle), or"C" (Named Entity) [default: C]  -o, --output<OUTPUT_FILE>          Output text file: If not present, use stdout  -a, --all          Prints all fields  -w, --wakati          Outputs only surface form  -d, --debug          Debug mode: Print the debug information  -l, --dict<DICTIONARY_PATH>          Path to sudachi dictionary. If None, it refer config andthen baked dictionary      --split-sentences<SPLIT_SENTENCES>          How to split sentences [default: yes]  -h, --help          Printhelp (see more with'--help')  -V, --version          Print version

Output

Columns are tab separated.

  • Surface
  • Part-of-Speech Tags (comma separated)
  • Normalized Form

When you add the-a (--all) flag, it additionally outputs

  • Dictionary Form
  • Reading Form
  • Dictionary ID
    • 0 for the system dictionary
    • 1 and above for the user dictionaries
    • -1 if a word is Out-of-Vocabulary (not in the dictionary)
  • Synonym group IDs
  • (OOV) if a word is Out-of-Vocabulary (not in the dictionary)
$echo"外国人参政権"| sudachi -a外国人参政権    名詞,普通名詞,一般,*,*,*        外国人参政権    外国人参政権    ガイコクジンサンセイケン      0       []EOS
echo"阿quei"| sudachipy -a阿      名詞,普通名詞,一般,*,*,*        阿      阿              -1      []      (OOV)quei    名詞,普通名詞,一般,*,*,*        quei    quei            -1      []      (OOV)EOS

When you add-w (--wakati) flag, it outputs space-delimited surface instead.

$echo"外国人参政権"| sudachi -m A -w外国 人 参政 権

API

SeeAPI reference page.

ToDo

  • Out of Vocabulary handling
  • Easy dictionary file install & management,similar to SudachiPy
  • Registration to crates.io

References

Sudachi

Morphological Analyzers in Rust

Logo

Sponsor this project

 

Packages

No packages published

Languages


[8]ページ先頭

©2009-2025 Movatter.jp