Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

🛥 Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT
NotificationsYou must be signed in to change notification settings

daac-tools/python-vaporetto

Repository files navigation

Vaporetto is a fast and lightweight pointwise prediction based tokenizer.This is a Python wrapper for Vaporetto.

PyPIBuild StatusDocumentation Status

Installation

Install pre-built package from PyPI

Run the following command:

$ pip install vaporetto

Build from source

You need to install the Rust compiler followingthe documentation beforehand.vaporetto usespyproject.toml, so you also need to upgrade pip to version 19 or later.

$ pip install --upgrade pip

After setting up the environment, you can install vaporetto as follows:

$ pip install git+https://github.com/daac-tools/python-vaporetto

Example Usage

python-vaporetto does not contain model files.To perform tokenization, followthe document of Vaporetto to download distribution models or train your own models beforehand.

Check the version number as shown below to use compatible models:

>>>importvaporetto>>>vaporetto.VAPORETTO_VERSION'0.6.5'

Examples:

# Import vaporetto module>>>importvaporetto# Load the model file>>>withopen('tests/data/vaporetto.model','rb')asfp:...model=fp.read()# Create an instance of the Vaporetto>>>tokenizer=vaporetto.Vaporetto(model,predict_tags=True)# Tokenize>>>tokenizer.tokenize_to_string('まぁ社長は火星猫だ')'まぁ/名詞/マー 社長/名詞/シャチョー は/助詞/ワ 火星/名詞/カセー 猫/名詞/ネコ だ/助動詞/ダ'>>>tokens=tokenizer.tokenize('まぁ社長は火星猫だ')>>>len(tokens)6>>>tokens[0].surface()'まぁ'>>>tokens[0].tag(0)'名詞'>>>tokens[0].tag(1)'マー'>>> [token.surface()fortokenintokens]['まぁ','社長','は','火星','猫','だ']

Note for distributed models

The distributed models are compressed in zstd format. If you want to load these compressed models,you must decompress them outside the API.

>>>importvaporetto>>>importzstandard# zstandard package in PyPI>>>dctx=zstandard.ZstdDecompressor()>>>withopen('tests/data/vaporetto.model.zst','rb')asfp:...withdctx.stream_reader(fp)asdict_reader:...tokenizer=vaporetto.Vaporetto(dict_reader.read(),predict_tags=True)

Note for KyTea's models

You can also use KyTea's models as follows:

>>>withopen('path/to/jp-0.4.7-5.mod','rb')asfp:# doctest: +SKIP...tokenizer=vaporetto.Vaporetto.create_from_kytea_model(fp.read())

Note: Vaporetto does not support tag prediction with KyTea's models.

License

Licensed under either of

at your option.

Contribution

Seethe guidelines.

About

🛥 Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.

Topics

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Stars

Watchers

Forks

Packages

No packages published

Contributors2

  •  
  •  

[8]ページ先頭

©2009-2025 Movatter.jp