ku-nlp/jumanppPublic

NotificationsYou must be signed in to change notification settings
Fork45
Star394

Juman++ (a Morphological Analyzer Toolkit)

nlp.ist.i.kyoto-u.ac.jp/index.php?JUMAN%2B%2B

License

Apache-2.0 license

394 stars 45 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 1,093 Commits
.github/workflows		.github/workflows
cmake		cmake
docs		docs
libs		libs
model-template		model-template
sample		sample
script		script
src		src
test		test
travis		travis
.clang-format		.clang-format
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CMakeLists.txt		CMakeLists.txt
CONTRIBUTORS		CONTRIBUTORS
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
do_format.sh		do_format.sh
make_release.sh		make_release.sh
version.cmake		version.cmake

Repository files navigation

What is Juman++

A new morphological analyser that considers semantic plausibility ofword sequences by using a recurrent neural network language model (RNNLM).Version 2 has better accuracy and greatly (>250x) improved analysis speed thanthe original Juman++.

Installation

System Requirements

OS: Linux, MacOS X or Windows.
Compiler: C++14 compatible
- For example gcc 5.1+, clang 3.4+, MSVC 2017
- We test on GCC and clang on Linux/MacOS, mingw64-gcc and MSVC2017 on Windows

CMake v3.1 or later
For Ubuntu22.04, you need to install additional packages as follows:sudo apt install libprotobuf-dev protobuf-compiler

Readthis document for CentOS and RHEL derivatives or non-CMake alternatives.

Building from a package

Download the package fromReleases

Important: The download should be around 300 MB. If it is not you have probably downloaded a source snapshot which does not contain a model.

$ tar xf jumanpp-<version>.tar.xz# decompress the package$cd jumanpp-<version># move into the directory$ mkdir bld# make a subdirectory for build$cd bld$ cmake .. \  -DCMAKE_BUILD_TYPE=Release\# you want to do this for performance  -DCMAKE_INSTALL_PREFIX=<prefix># where to install Juman++$ make install -j<parallelism>

Building from git

Important: Only the package distribution contains a pretrained model and can be used for analysis.The current git version is not compatible with the models of 2.0-rc1 and 2.0-rc2.

$ mkdir cmake-build-dir# CMake does not support in-source builds$cd cmake-build-dir$ cmake ..$ make# -j

Usage

Quick start

% echo "魅力がたっぷりと詰まっている" | jumanpp魅力 みりょく 魅力 名詞 6 普通名詞 1 * 0 * 0 "代表表記:魅力/みりょく カテゴリ:抽象物"が が が 助詞 9 格助詞 1 * 0 * 0 NILたっぷり たっぷり たっぷり 副詞 8 * 0 * 0 * 0 "自動認識"と と と 助詞 9 格助詞 1 * 0 * 0 NIL詰まって つまって 詰まる 動詞 2 * 0 子音動詞ラ行 10 タ系連用テ形 14 "代表表記:詰まる/つまる ドメイン:料理・食事 自他動詞:他:詰める/つめる"いる いる いる 接尾辞 14 動詞性接尾辞 7 母音動詞 1 基本形 2 "代表表記:いる/いる"EOS

Main options

usage: jumanpp [options]   -s, --specifics              lattice format output (unsigned int [=5])  --beam <int>                 set local beam width used in analysis (unsigned int [=5])  -v, --version                print version  -h, --help                   print this message  --model <file>               specify a model location

Use--help to see more options.

Input

JUMAN++ can handle only utf-8 encoded text as an input.Lines beginning with# will be interpreted as comments.

Training Jumandic Model

A set of scripts for training Jumandic model is available inthis repository.It is possible to modify the system dictionary to add other entries to the trained model.

Attention: You need to have access to Mainichi Shinbun for Year 1995 to be able to use Kyoto Univeristy corpus for training.

Other

DEMO

You can play around ourweb demowhich displays a subset of the whole lattice.The demo still uses v1 but, it will be updated to v2 soon.

Extracting diffs caused by beam configurations

You can see sentences in which two different beam configurations produce different analyses.Asrc/jumandic/jpp_jumandic_pathdiff binary(source)(relative to a compilation root) does it.The only Jumandic-specific thing here is the usage ofcode-generated linear model inference.

Use the binary asjpp_jumandic_pathdiff <model> <input> > <output>.

Outputs would be in the partial annotation format with a full beam results being the actual tags and trimmed beam results being written as comments.

Example:

# scores: -0.602687 -1.20004# 子がい        pos:名詞        subpos:普通名詞 <------- trimmed beam result# S-ID:w201007-0080605751-6 COUNT:2熊本選抜にはマリノス、アントラーズのユースに行く        子      pos:名詞        subpos:普通名詞 <------- full beam result        が      pos:助詞        subpos:格助詞        い      baseform:いる   conjtype:母音動詞       pos:動詞        conjform:基本連用形ます

Partial Annotation Tool

We also have a partial annotation tool. Please seehttps://github.com/eiennohito/nlp-tools-demo for details.

Performance Notes

To get the best performance, you need to build withextended instruction sets.If you are planning to use Juman++ only locally,specify-DCMAKE_CXX_FLAGS="-march=native".

Works best on Intel Haswell and newer processors (because of FMA and BMI instruction set extensions).

Using Juman++ to create your own Morphological Analyzer

Juman++ is a general tool.It does not depend on Jumandic or Japanese Language(albeit there are some Japanese-specific functionality).Seethis tutorial projectwhich shows how to implement a something similar to aT9 text inputfor the case when there are no word boundaries in the input text.

Publications and Slides

About the model itself:Morphological Analysis for Unsegmented Languages using Recurrent Neural Network Language Model. Hajime Morita, Daisuke Kawahara, Sadao Kurohashi. EMNLP 2015link,bibtex.
V2 Improvments:Juman++ v2: A Practical and Modern Morphological Analyzer. Arseny Tolmachev and Kurohashi Sadao. The Proceedings of the Twenty-fourth Annual Meeting of the Association for Natural Language Processing. March 2018, Okayama, Japan. (pdf,slides)
Morphological Analysis Workshop in ANLP2018 Slides: 形態素解析システムJuman++. 河原大輔, Arseny Tolmachev. (in Japanese)slides.
Juman++: A Morphological Analysis Toolkit for Scriptio Continua. Arseny Tolmachev, Daisuke Kawahara and Sadao Kurohashi. EMNLP 2018, Brussels.pdf,poster,bibtex.
Design and Structure of The Juman++ Morphological Analyzer Toolkit. Arseny Tolmachev, Daisuke Kawahara, Sadao Kurohashi. Journal of Natural Language Processing, (paper,bibtex).

If you use Juman++ V1 in academic setting, then please cite the first work (EMNLP2015). If you use Juman++ V2, then please cite both the first and the fourth (EMNLP2018) papers.

Authors

Arseny Tolmachev <arsenyat kotonoha.ws>
Hajime Morita <hmoritaat nlp.ist.i.kyoto-u.ac.jp>
Daisuke Kawahara <dkat i.kyoto-u.ac.jp>
Sadao Kurohashi <kuroat i.kyoto-u.ac.jp>

Acknowledgement

The list of all libraries used by JUMAN++ ishere.

Notice

This is a branch for the Juman++ rewrite.The original version lives in thelegacy branch.

About

Juman++ (a Morphological Analyzer Toolkit)

nlp.ist.i.kyoto-u.ac.jp/index.php?JUMAN%2B%2B

Movatterモバイル変換

License

ku-nlp/jumanpp

Folders and files

Latest commit

History

Repository files navigation

What is Juman++

Installation

System Requirements

Building from a package

Building from git

Usage

Quick start

Main options

Input

Training Jumandic Model

Other

DEMO

Extracting diffs caused by beam configurations

Partial Annotation Tool

Performance Notes

Using Juman++ to create your own Morphological Analyzer

Publications and Slides

Authors

Acknowledgement

Notice

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Contributors12

Languages

Packages