- Notifications
You must be signed in to change notification settings - Fork1
🛥 Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.
License
Apache-2.0, MIT licenses found
Licenses found
daac-tools/python-vaporetto
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Vaporetto is a fast and lightweight pointwise prediction based tokenizer.This is a Python wrapper for Vaporetto.
Run the following command:
$ pip install vaporetto
You need to install the Rust compiler followingthe documentation beforehand.vaporetto usespyproject.toml
, so you also need to upgrade pip to version 19 or later.
$ pip install --upgrade pip
After setting up the environment, you can install vaporetto as follows:
$ pip install git+https://github.com/daac-tools/python-vaporetto
python-vaporetto does not contain model files.To perform tokenization, followthe document of Vaporetto to download distribution models or train your own models beforehand.
Check the version number as shown below to use compatible models:
>>>importvaporetto>>>vaporetto.VAPORETTO_VERSION'0.6.5'
Examples:
# Import vaporetto module>>>importvaporetto# Load the model file>>>withopen('tests/data/vaporetto.model','rb')asfp:...model=fp.read()# Create an instance of the Vaporetto>>>tokenizer=vaporetto.Vaporetto(model,predict_tags=True)# Tokenize>>>tokenizer.tokenize_to_string('まぁ社長は火星猫だ')'まぁ/名詞/マー 社長/名詞/シャチョー は/助詞/ワ 火星/名詞/カセー 猫/名詞/ネコ だ/助動詞/ダ'>>>tokens=tokenizer.tokenize('まぁ社長は火星猫だ')>>>len(tokens)6>>>tokens[0].surface()'まぁ'>>>tokens[0].tag(0)'名詞'>>>tokens[0].tag(1)'マー'>>> [token.surface()fortokenintokens]['まぁ','社長','は','火星','猫','だ']
The distributed models are compressed in zstd format. If you want to load these compressed models,you must decompress them outside the API.
>>>importvaporetto>>>importzstandard# zstandard package in PyPI>>>dctx=zstandard.ZstdDecompressor()>>>withopen('tests/data/vaporetto.model.zst','rb')asfp:...withdctx.stream_reader(fp)asdict_reader:...tokenizer=vaporetto.Vaporetto(dict_reader.read(),predict_tags=True)
You can also use KyTea's models as follows:
>>>withopen('path/to/jp-0.4.7-5.mod','rb')asfp:# doctest: +SKIP...tokenizer=vaporetto.Vaporetto.create_from_kytea_model(fp.read())
Note: Vaporetto does not support tag prediction with KyTea's models.
Licensed under either of
- Apache License, Version 2.0(LICENSE-APACHE orhttp://www.apache.org/licenses/LICENSE-2.0)
- MIT license(LICENSE-MIT orhttp://opensource.org/licenses/MIT)
at your option.
Seethe guidelines.
About
🛥 Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.
Topics
Resources
License
Apache-2.0, MIT licenses found
Licenses found
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors2
Uh oh!
There was an error while loading.Please reload this page.