lighttransport/jagger-pythonPublic

NotificationsYou must be signed in to change notification settings
Fork1
Star12

Python binding for Jagger(C++ implementation of Pattern-based Japanese Morphological Analyzer)

License

BSD-2-Clause license

12 stars 1 fork Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
.github/workflows		.github/workflows
benchmark		benchmark
cmake		cmake
cpp_cli		cpp_cli
data		data
example		example
jagger		jagger
train		train
.cirrus.yml		.cirrus.yml
.clang-format		.clang-format
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
bootstrap-cpp-llvm-mingw-cross.sh		bootstrap-cpp-llvm-mingw-cross.sh
bootstrap-cpp-python.sh		bootstrap-cpp-python.sh
jagger.BSD		jagger.BSD
jagger.GPL		jagger.GPL
jagger.LGPL		jagger.LGPL
jagger.png		jagger.png
pyproject.toml		pyproject.toml
python-binding-train-jagger.cc		python-binding-train-jagger.cc
setup.py		setup.py

Repository files navigation

jagger-python

Python binding for Jagger(C++ implementation of Pattern-based Japanese Morphological Analyzer) :https://www.tkl.iis.u-tokyo.ac.jp/~ynaga/jagger/index.en.html

Install

$ python -m pip install jagger

This does not install model files.

You can download precompiled KWDLC model fromhttps://github.com/lighttransport/jagger-python/releases/download/v0.1.0/model_kwdlc.tar.gz(Note that KWDLC has unclear license/TermOfUse. Use it at your own risk)

Example

importjaggermodel_path="model/kwdlc/patterns"tokenizer=jagger.Jagger()tokenizer.load_model(model_path)text="吾輩は猫である。名前はまだない。"toks=tokenizer.tokenize(text)fortokintoks:print(tok.surface(),tok.feature())print("EOS")"""吾輩    名詞,普通名詞,*,*,吾輩,わがはい,代表表記:我が輩/わがはい カテゴリ:人は      助詞,副助詞,*,*,は,は,*猫      名詞,普通名詞,*,*,猫,ねこ,*である  判定詞,*,判定詞,デアル列基本形,だ,である,*。      特殊,句点,*,*,。,。,*名前    名詞,普通名詞,*,*,名前,なまえ,*は      助詞,副助詞,*,*,は,は,*まだ    副詞,*,*,*,まだ,まだ,*ない    形容詞,*,イ形容詞アウオ段,基本形,ない,ない,*。      特殊,句点,*,*,。,。,*EOS"""# print tagsfortokintoks:# print tag(split feature() by comma)print(tok.surface())foriinrange(tok.n_tags()):print("  tag[{}] = {}".format(i,tok.tag(i)))print("EOS")

Batch processing(experimental)

tokenize_batch tokenizes multiple lines(delimited by newline('\n', '\r', or '\r\n')) at once.Splitting lines is done in C++ side.

importjaggermodel_path="model/kwdlc/patterns"tokenizer=jagger.Jagger()tokenizer.load_model(model_path)text="""吾輩は猫である。名前はまだない。明日の天気は晴れです。"""# optional: set C++ threads(CPU cores) to use# default: Use all CPU cores.# tokenizer.set_threads(4)toks_list=tokenizer.tokenize_batch(text)fortoksintoks_list:fortokintoks:print(tok.surface(),tok.feature())

Train a model.

Pyhthon interface for training a model is not provided yet.For a while, you can build C++ trainer cli using CMake(Windows supported).Seetrain/ for details.

Limitation

Single line string must be less than 262,144 bytes(~= 87,000 UTF-8 Japanese chars).

Jagger version

Jagger version used in this Python binding is

2023-02-18

For developer

Editdev_mode=True in to enable asan + debug build

Run python script with

$ LD_PRELOAD=$(gcc -print-file-name=libasan.so) python FILE.pyor$ LD_PRELOAD=$(clang -print-file-name=libclang_rt.asan-x86_64.so) python FILE.py

Releasing

Version is created automatically usingsetuptools_scm.

tag it:git tag vX.Y.Z
push tag:git push --tags

TODO

Provide a model file trained from Wikipedia, UniDic, etc(clearer & permissive licencing&TermOfUse).
- Use GiNZA for morphological analysis.
Split feature vector(CSV) considering quote char when extracting tags.
- e.g. 'a,b,"c,d",e' => ["a", "b", "c,d", "e"]
Optimize C++ <-> Python interface
- string_view(or read-only string literal) for tag str.
- pickle support(for exchanging Python object when using multiprocessing)
  - https://pybind11.readthedocs.io/en/latest/advanced/classes.html#pickling-support

License

Python binding is available under 2-clause BSD licence.

Jagger andccedar_core.h is licensed under GPLv2/LGPLv2.1/BSD triple licenses.

Third party licences

stack_container.h: BSD like license.
nanocsv.h MIT license.

About

Python binding for Jagger(C++ implementation of Pattern-based Japanese Morphological Analyzer)

Releases4

v0.1.19 Latest

Jan 20, 2024

+ 3 releases

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

jagger-python

Install

Example

Batch processing(experimental)

Train a model.

Limitation

Jagger version

For developer

Releasing

TODO

License

Third party licences

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases4

Packages

Uh oh!

Languages

Movatterモバイル変換

License

lighttransport/jagger-python

Folders and files

Latest commit

History

Repository files navigation

jagger-python

Install

Example

Batch processing(experimental)

Train a model.

Limitation

Jagger version

For developer

Releasing

TODO

License

Third party licences

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases4

Packages0

Uh oh!

Languages

Packages