- Notifications
You must be signed in to change notification settings - Fork1
Viterbi-based accelerated tokenizer (Python wrapper)
License
Apache-2.0, MIT licenses found
Licenses found
daac-tools/python-vibrato
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Vibrato is a fast implementation of tokenization (or morphological analysis) based on the Viterbi algorithm.This is a Python wrapper for Vibrato.
Run the following command:
$ pip install vibrato
You need to install the Rust compiler followingthe documentation beforehand.vibrato usespyproject.toml
, so you also need to upgrade pip to version 19 or later.
$ pip install --upgrade pip
After setting up the environment, you can install vibrato as follows:
$ pip install git+https://github.com/daac-tools/python-vibrato
python-vibrato does not contain model files.To perform tokenization, followthe document of Vibrato to download distribution models or train your own models beforehand.
Check the version number as shown below to use compatible models:
>>>importvibrato>>>vibrato.VIBRATO_VERSION'0.5.1'
Examples:
>>>importvibrato>>>withopen('tests/data/system.dic','rb')asfp:...tokenizer=vibrato.Vibrato(fp.read())>>>tokens=tokenizer.tokenize('社長は火星猫だ')>>>len(tokens)5>>>tokens[0]Token {surface:"社長",feature:"名詞,普通名詞,一般,*" }>>>tokens[0].surface()'社長'>>>tokens[0].feature()'名詞,普通名詞,一般,*'>>>tokens[0].start()0>>>tokens[0].end()2
The distributed models are compressed in zstd format. If you want to load these compressed models,you must decompress them outside the API.
>>>importvibrato>>>importzstandard# zstandard package in PyPI>>>dctx=zstandard.ZstdDecompressor()>>>withopen('tests/data/system.dic.zst','rb')asfp:...withdctx.stream_reader(fp)asdict_reader:...tokenizer=vibrato.Vibrato(dict_reader.read())
Licensed under either of
- Apache License, Version 2.0(LICENSE-APACHE orhttp://www.apache.org/licenses/LICENSE-2.0)
- MIT license(LICENSE-MIT orhttp://opensource.org/licenses/MIT)
at your option.
About
Viterbi-based accelerated tokenizer (Python wrapper)
Topics
Resources
License
Apache-2.0, MIT licenses found
Licenses found
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors2
Uh oh!
There was an error while loading.Please reload this page.