- Notifications
You must be signed in to change notification settings - Fork1
Python binding for Jagger(C++ implementation of Pattern-based Japanese Morphological Analyzer)
License
lighttransport/jagger-python
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Python binding for Jagger(C++ implementation of Pattern-based Japanese Morphological Analyzer) :https://www.tkl.iis.u-tokyo.ac.jp/~ynaga/jagger/index.en.html
$ python -m pip install jagger
This does not install model files.
You can download precompiled KWDLC model fromhttps://github.com/lighttransport/jagger-python/releases/download/v0.1.0/model_kwdlc.tar.gz(Note that KWDLC has unclear license/TermOfUse. Use it at your own risk)
importjaggermodel_path="model/kwdlc/patterns"tokenizer=jagger.Jagger()tokenizer.load_model(model_path)text="吾輩は猫である。名前はまだない。"toks=tokenizer.tokenize(text)fortokintoks:print(tok.surface(),tok.feature())print("EOS")"""吾輩 名詞,普通名詞,*,*,吾輩,わがはい,代表表記:我が輩/わがはい カテゴリ:人は 助詞,副助詞,*,*,は,は,*猫 名詞,普通名詞,*,*,猫,ねこ,*である 判定詞,*,判定詞,デアル列基本形,だ,である,*。 特殊,句点,*,*,。,。,*名前 名詞,普通名詞,*,*,名前,なまえ,*は 助詞,副助詞,*,*,は,は,*まだ 副詞,*,*,*,まだ,まだ,*ない 形容詞,*,イ形容詞アウオ段,基本形,ない,ない,*。 特殊,句点,*,*,。,。,*EOS"""# print tagsfortokintoks:# print tag(split feature() by comma)print(tok.surface())foriinrange(tok.n_tags()):print(" tag[{}] = {}".format(i,tok.tag(i)))print("EOS")
tokenize_batch
tokenizes multiple lines(delimited by newline('\n', '\r', or '\r\n')) at once.Splitting lines is done in C++ side.
importjaggermodel_path="model/kwdlc/patterns"tokenizer=jagger.Jagger()tokenizer.load_model(model_path)text="""吾輩は猫である。名前はまだない。明日の天気は晴れです。"""# optional: set C++ threads(CPU cores) to use# default: Use all CPU cores.# tokenizer.set_threads(4)toks_list=tokenizer.tokenize_batch(text)fortoksintoks_list:fortokintoks:print(tok.surface(),tok.feature())
Pyhthon interface for training a model is not provided yet.For a while, you can build C++ trainer cli using CMake(Windows supported).Seetrain/
for details.
Single line string must be less than 262,144 bytes(~= 87,000 UTF-8 Japanese chars).
Jagger version used in this Python binding is
2023-02-18
Editdev_mode=True
in to enable asan + debug build
Run python script with
$ LD_PRELOAD=$(gcc -print-file-name=libasan.so) python FILE.pyor$ LD_PRELOAD=$(clang -print-file-name=libclang_rt.asan-x86_64.so) python FILE.py
Version is created automatically usingsetuptools_scm
.
- tag it:
git tag vX.Y.Z
- push tag:
git push --tags
- Provide a model file trained from Wikipedia, UniDic, etc(clearer & permissive licencing&TermOfUse).
- Use GiNZA for morphological analysis.
- Split feature vector(CSV) considering quote char when extracting tags.
- e.g. 'a,b,"c,d",e' => ["a", "b", "c,d", "e"]
- Optimize C++ <-> Python interface
- string_view(or read-only string literal) for tag str.
- pickle support(for exchanging Python object when using multiprocessing)
Python binding is available under 2-clause BSD licence.
Jagger andccedar_core.h
is licensed under GPLv2/LGPLv2.1/BSD triple licenses.
- stack_container.h: BSD like license.
- nanocsv.h MIT license.
About
Python binding for Jagger(C++ implementation of Pattern-based Japanese Morphological Analyzer)
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.