- Notifications
You must be signed in to change notification settings - Fork45
Juman++ (a Morphological Analyzer Toolkit)
License
ku-nlp/jumanpp
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
A new morphological analyser that considers semantic plausibility ofword sequences by using a recurrent neural network language model (RNNLM).Version 2 has better accuracy and greatly (>250x) improved analysis speed thanthe original Juman++.
- OS: Linux, MacOS X or Windows.
- Compiler: C++14 compatible
- For example gcc 5.1+, clang 3.4+, MSVC 2017
- We test on GCC and clang on Linux/MacOS, mingw64-gcc and MSVC2017 on Windows
- CMake v3.1 or later
- For Ubuntu22.04, you need to install additional packages as follows:
sudo apt install libprotobuf-dev protobuf-compiler
Readthis document for CentOS and RHEL derivatives or non-CMake alternatives.
Download the package fromReleases
Important: The download should be around 300 MB. If it is not you have probably downloaded a source snapshot which does not contain a model.
$ tar xf jumanpp-<version>.tar.xz# decompress the package$cd jumanpp-<version># move into the directory$ mkdir bld# make a subdirectory for build$cd bld$ cmake .. \ -DCMAKE_BUILD_TYPE=Release\# you want to do this for performance -DCMAKE_INSTALL_PREFIX=<prefix># where to install Juman++$ make install -j<parallelism>
Important: Only the package distribution contains a pretrained model and can be used for analysis.The current git version is not compatible with the models of 2.0-rc1 and 2.0-rc2.
$ mkdir cmake-build-dir# CMake does not support in-source builds$cd cmake-build-dir$ cmake ..$ make# -j
% echo "魅力がたっぷりと詰まっている" | jumanpp魅力 みりょく 魅力 名詞 6 普通名詞 1 * 0 * 0 "代表表記:魅力/みりょく カテゴリ:抽象物"が が が 助詞 9 格助詞 1 * 0 * 0 NILたっぷり たっぷり たっぷり 副詞 8 * 0 * 0 * 0 "自動認識"と と と 助詞 9 格助詞 1 * 0 * 0 NIL詰まって つまって 詰まる 動詞 2 * 0 子音動詞ラ行 10 タ系連用テ形 14 "代表表記:詰まる/つまる ドメイン:料理・食事 自他動詞:他:詰める/つめる"いる いる いる 接尾辞 14 動詞性接尾辞 7 母音動詞 1 基本形 2 "代表表記:いる/いる"EOS
usage: jumanpp [options] -s, --specifics lattice format output (unsigned int [=5]) --beam <int> set local beam width used in analysis (unsigned int [=5]) -v, --version print version -h, --help print this message --model <file> specify a model location
Use--help
to see more options.
JUMAN++ can handle only utf-8 encoded text as an input.Lines beginning with#
will be interpreted as comments.
A set of scripts for training Jumandic model is available inthis repository.It is possible to modify the system dictionary to add other entries to the trained model.
Attention: You need to have access to Mainichi Shinbun for Year 1995 to be able to use Kyoto Univeristy corpus for training.
You can play around ourweb demowhich displays a subset of the whole lattice.The demo still uses v1 but, it will be updated to v2 soon.
You can see sentences in which two different beam configurations produce different analyses.Asrc/jumandic/jpp_jumandic_pathdiff
binary(source)(relative to a compilation root) does it.The only Jumandic-specific thing here is the usage ofcode-generated linear model inference.
Use the binary asjpp_jumandic_pathdiff <model> <input> > <output>
.
Outputs would be in the partial annotation format with a full beam results being the actual tags and trimmed beam results being written as comments.
Example:
# scores: -0.602687 -1.20004# 子がい pos:名詞 subpos:普通名詞 <------- trimmed beam result# S-ID:w201007-0080605751-6 COUNT:2熊本選抜にはマリノス、アントラーズのユースに行く 子 pos:名詞 subpos:普通名詞 <------- full beam result が pos:助詞 subpos:格助詞 い baseform:いる conjtype:母音動詞 pos:動詞 conjform:基本連用形ます
We also have a partial annotation tool. Please seehttps://github.com/eiennohito/nlp-tools-demo for details.
To get the best performance, you need to build withextended instruction sets.If you are planning to use Juman++ only locally,specify-DCMAKE_CXX_FLAGS="-march=native"
.
Works best on Intel Haswell and newer processors (because of FMA and BMI instruction set extensions).
Juman++ is a general tool.It does not depend on Jumandic or Japanese Language(albeit there are some Japanese-specific functionality).Seethis tutorial projectwhich shows how to implement a something similar to aT9 text inputfor the case when there are no word boundaries in the input text.
About the model itself:Morphological Analysis for Unsegmented Languages using Recurrent Neural Network Language Model. Hajime Morita, Daisuke Kawahara, Sadao Kurohashi. EMNLP 2015link,bibtex.
V2 Improvments:Juman++ v2: A Practical and Modern Morphological Analyzer. Arseny Tolmachev and Kurohashi Sadao. The Proceedings of the Twenty-fourth Annual Meeting of the Association for Natural Language Processing. March 2018, Okayama, Japan. (pdf,slides)
Morphological Analysis Workshop in ANLP2018 Slides: 形態素解析システムJuman++. 河原 大輔, Arseny Tolmachev. (in Japanese)slides.
Juman++: A Morphological Analysis Toolkit for Scriptio Continua. Arseny Tolmachev, Daisuke Kawahara and Sadao Kurohashi. EMNLP 2018, Brussels.pdf,poster,bibtex.
Design and Structure of The Juman++ Morphological Analyzer Toolkit. Arseny Tolmachev, Daisuke Kawahara, Sadao Kurohashi. Journal of Natural Language Processing, (paper,bibtex).
If you use Juman++ V1 in academic setting, then please cite the first work (EMNLP2015). If you use Juman++ V2, then please cite both the first and the fourth (EMNLP2018) papers.
- Arseny Tolmachev <arsenyat kotonoha.ws>
- Hajime Morita <hmoritaat nlp.ist.i.kyoto-u.ac.jp>
- Daisuke Kawahara <dkat i.kyoto-u.ac.jp>
- Sadao Kurohashi <kuroat i.kyoto-u.ac.jp>
The list of all libraries used by JUMAN++ ishere.
This is a branch for the Juman++ rewrite.The original version lives in thelegacy branch.
About
Juman++ (a Morphological Analyzer Toolkit)
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.