daac-tools/python-vaporettoPublic

NotificationsYou must be signed in to change notification settings
Fork1
Star20

🛥 Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.

License

Apache-2.0, MIT licenses found

Licenses found

20 stars 1 fork Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
.github/workflows		.github/workflows
docs/source		docs/source
src		src
tests		tests
.readthedocs.yaml		.readthedocs.yaml
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README.md		README.md
noxfile.py		noxfile.py
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
vaporetto.pyi		vaporetto.pyi

Repository files navigation

🐍 python-vaporetto 🛥

Vaporetto is a fast and lightweight pointwise prediction based tokenizer.This is a Python wrapper for Vaporetto.

Installation

Install pre-built package from PyPI

Run the following command:

$ pip install vaporetto

Build from source

You need to install the Rust compiler followingthe documentation beforehand.vaporetto usespyproject.toml, so you also need to upgrade pip to version 19 or later.

$ pip install --upgrade pip

After setting up the environment, you can install vaporetto as follows:

$ pip install git+https://github.com/daac-tools/python-vaporetto

Example Usage

python-vaporetto does not contain model files.To perform tokenization, followthe document of Vaporetto to download distribution models or train your own models beforehand.

Check the version number as shown below to use compatible models:

>>>importvaporetto>>>vaporetto.VAPORETTO_VERSION'0.6.5'

Examples:

# Import vaporetto module>>>importvaporetto# Load the model file>>>withopen('tests/data/vaporetto.model','rb')asfp:...model=fp.read()# Create an instance of the Vaporetto>>>tokenizer=vaporetto.Vaporetto(model,predict_tags=True)# Tokenize>>>tokenizer.tokenize_to_string('まぁ社長は火星猫だ')'まぁ/名詞/マー 社長/名詞/シャチョー は/助詞/ワ 火星/名詞/カセー 猫/名詞/ネコ だ/助動詞/ダ'>>>tokens=tokenizer.tokenize('まぁ社長は火星猫だ')>>>len(tokens)6>>>tokens[0].surface()'まぁ'>>>tokens[0].tag(0)'名詞'>>>tokens[0].tag(1)'マー'>>> [token.surface()fortokenintokens]['まぁ','社長','は','火星','猫','だ']

Note for distributed models

The distributed models are compressed in zstd format. If you want to load these compressed models,you must decompress them outside the API.

>>>importvaporetto>>>importzstandard# zstandard package in PyPI>>>dctx=zstandard.ZstdDecompressor()>>>withopen('tests/data/vaporetto.model.zst','rb')asfp:...withdctx.stream_reader(fp)asdict_reader:...tokenizer=vaporetto.Vaporetto(dict_reader.read(),predict_tags=True)

Note for KyTea's models

You can also use KyTea's models as follows:

>>>withopen('path/to/jp-0.4.7-5.mod','rb')asfp:# doctest: +SKIP...tokenizer=vaporetto.Vaporetto.create_from_kytea_model(fp.read())

Note: Vaporetto does not support tag prediction with KyTea's models.

Speed Comparison

License

Licensed under either of

Apache License, Version 2.0(LICENSE-APACHE orhttp://www.apache.org/licenses/LICENSE-2.0)
MIT license(LICENSE-MIT orhttp://opensource.org/licenses/MIT)

at your option.

Contribution

Seethe guidelines.

About

🛥 Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.

Topics

python nlp rust japanese tokenizer analyzer segmentation morphological-analysis tokenization

Resources

Readme

License

Apache-2.0, MIT licenses found

Releases7

0.3.2 Latest

Jun 1, 2025

+ 6 releases

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Licenses found

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

🐍 python-vaporetto 🛥

Installation

Install pre-built package from PyPI

Build from source

Example Usage

Note for distributed models

Note for KyTea's models

Speed Comparison

License

Contribution

About

Topics

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases7

Packages

Uh oh!

Contributors2

Uh oh!

Languages

Movatterモバイル変換

License

Licenses found

daac-tools/python-vaporetto

Folders and files

Latest commit

History

Repository files navigation

🐍 python-vaporetto 🛥

Installation

Install pre-built package from PyPI

Build from source

Example Usage

Note for distributed models

Note for KyTea's models

Speed Comparison

License

Contribution

About

Topics

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases7

Packages0

Uh oh!

Contributors2

Uh oh!

Languages

Packages