ku-nlp/kwjaPublic

NotificationsYou must be signed in to change notification settings
Fork7
Star133

An integrated Japanese analyzer based on foundation models

License

MIT license

133 stars 7 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 2,071 Commits
.github		.github
configs		configs
scripts		scripts
src/kwja		src/kwja
sweep		sweep
tests		tests
.coveragerc		.coveragerc
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.typos.toml		.typos.toml
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Repository files navigation

KWJA: Kyoto-Waseda Japanese Analyzer 1

[Paper (ja)][Paper (en)][Slides]

KWJA is an integrated Japanese text analyzer based on foundation models.KWJA performs many text analysis tasks, including:

Typo correction
Sentence segmentation
Word segmentation
Word normalization
Morphological analysis
Word feature tagging
Base phrase feature tagging
NER (Named Entity Recognition)
Dependency parsing
Predicate-argument structure (PAS) analysis
Bridging reference resolution
Coreference resolution
Discourse relation analysis

Requirements

Python: 3.9+
Dependencies: Seepyproject.toml.
GPUs with CUDA (optional)
GPUs with MPS (optional)

Getting Started

Install KWJA with pip:

$ pip install kwja

Perform language analysis with thekwja command (the result is in the KNP format):

# Analyze a text$ kwja --text"KWJAは日本語の統合解析ツールです。汎用言語モデルを利用し、様々な言語解析を統一的な方法で解いています。"# Analyze text files and write the result to a file$ kwja --filename path/to/file1.txt --filename path/to/file2.txt> path/to/analyzed.knp# Analyze texts interactively$ kwjaPlease end your input with a new line andtype"EOD"KWJAは日本語の統合解析ツールです。汎用言語モデルを利用し、様々な言語解析を統一的な方法で解いています。EOD

If you use Windows and PowerShell, you need to setPYTHONUTF8 environment variable to1:

>$env:PYTHONUTF8 ="1"> kwja ...

The output is in the KNP format, which looks like the following:

# S-ID:202210010000-0-0 kwja:1.0.2* 2D+ 5D <rel type="=" sid="202210011918-0-0"/><体言><NE:ARTIFACT:KWJA>KWJA ＫWＪＡ KWJA 名詞 6 固有名詞 3 * 0 * 0 <基本句-主辞>は は は 助詞 9 副助詞 2 * 0 * 0 "代表表記:は/は" <代表表記:は/は>* 2D+ 2D <体言>日本 にほん 日本 名詞 6 地名 4 * 0 * 0 "代表表記:日本/にほん 地名:国" <代表表記:日本/にほん><地名:国><基本句-主辞>+ 4D <体言><係:ノ格>語 ご 語 名詞 6 普通名詞 1 * 0 * 0 "代表表記:語/ご 漢字読み:音 カテゴリ:抽象物" <代表表記:語/ご><漢字読み:音><カテゴリ:抽象物><基本句-主辞>の の の 助詞 9 接続助詞 3 * 0 * 0 "代表表記:の/の" <代表表記:の/の>...

Here are options forkwja command:

--text: Text to be analyzed.
--filename: Path to a text file to be analyzed. You can specify this option multiple times.
--model-size: Model size to be used. Specify one oftiny,base (default), andlarge.
--device: Device to be used. Specifycpu,cuda, ormps. If not specified, the device is automatically selected.
--typo-batch-size: Batch size for typo module.
--char-batch-size: Batch size for character module.
--seq2seq-batch-size: Batch size for seq2seq module.
--word-batch-size: Batch size for word module.
--tasks: Tasks to be performed. Specify one or more of the following values separated by commas:
- typo: Typo correction
- char: Sentence segmentation, Word segmentation, and Word normalization
- seq2seq: Word segmentation, Word normalization, Reading prediction, lemmatization, and Canonicalization.
- word: Morphological analysis, Named entity recognition, Word feature tagging, Dependency parsing, PAS analysis, Bridging reference resolution, and Coreference resolution

--config-file: Path to a custom configuration file.

You can read a KNP format file withrhoknp.

fromrhoknpimportDocumentwithopen("analyzed.knp")asf:parsed_document=Document.from_knp(f.read())

For more details about KNP format, seeReference.

Usage from Python

Make sure you havekwja command in your path:

$ which kwja/path/to/kwja

Installrhoknp:

$ pip install rhoknp

Perform language analysis with thekwja instance:

fromrhoknpimportKWJAkwja=KWJA()analyzed_document=kwja.apply("KWJAは日本語の統合解析ツールです。汎用言語モデルを利用し、様々な言語解析を統一的な方法で解いています。")

Configuration

kwja can be configured with a configuration file to set the default options.CheckConfig file content for details.

Config file location

On non-Windows systemskwja follows theXDG Base Directory Specificationconvention for the location of the configuration file.The configuration dirkwja uses is itself namedkwja.In that directory it refers to a file namedconfig.yaml.For most people it should be enough to put their config file at~/.config/kwja/config.yaml.You can also provide a configuration file in a non-standard location with an environment variableKWJA_CONFIG_FILE or a command line option--config-file.

Config file example

model_size:basedevice:cpunum_workers:0torch_compile:falsetypo_batch_size:1char_batch_size:1seq2seq_batch_size:1word_batch_size:1

Performance Table

typo, character, seq2seq, and word modules
- The performance on each task except typo correction and discourse relation analysis is the mean over all the corpora (KC, KWDLC, Fuman, and WAC) and over three runs with different random seeds.
- We set the learning rate of RoBERTa_LARGE (word) to 2e-5 because we failed to fine-tune it with a higher learning rate.Other hyperparameters are the same described in configs, which are tuned for DeBERTa_BASE.
seq2seq module
- The performance on each task is the mean over all the corpora (KC, KWDLC, Fuman, and WAC).
  - * denotes results of a single run
- Scores are calculated using a separatescript from the character and word modules.

Task		Model
Task		v1.x base (char,word )	v2.x base (char,word /seq2seq )	v1.x large (char,word )	v2.x large (char,word /seq2seq )
Typo Correction		79.0	76.7	80.8	83.1
Sentence Segmentation		-	98.4	-	98.6
Word Segmentation		98.5	98.1 / 98.2*	98.7	98.4 / 98.4*
Word Normalization		44.0	15.3	39.8	48.6
Morphological Analysis	POS	99.3	99.4	99.3	99.4
	sub-POS	98.1	98.5	98.2	98.5
	conjtype	99.4	99.6	99.2	99.6
	conjform	99.5	99.7	99.4	99.7
	reading	95.5	95.4 / 96.2*	90.8	95.6 / 96.8*
	lemma	-	- / 97.8*	-	- / 98.1*
	canon	-	- / 95.2*	-	- / 95.9*
Named Entity Recognition		83.0	84.6	82.1	85.9
Linguistic Feature Tagging	word	98.3	98.6	98.5	98.6
Linguistic Feature Tagging	base phrase	86.6	93.6	86.4	93.4
Dependency Parsing		92.9	93.5	93.8	93.6
Pas Analysis		74.2	76.9	75.3	77.5
Bridging Reference Resolution		66.5	67.3	65.2	67.5
Coreference Resolution		74.9	78.6	75.9	79.2
Discourse Relation Analysis		42.2	39.2	41.3	44.3

Citation

@InProceedings{Ueda2023a,author    ={Nobuhiro Ueda and Kazumasa Omura and Takashi Kodama and Hirokazu Kiyomaru and Yugo Murawaki and Daisuke Kawahara and Sadao Kurohashi},title     ={KWJA: A Unified Japanese Analyzer Based on Foundation Models},booktitle ={Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics: System Demonstrations},year      ={2023},address   ={Toronto, Canada},}

@InProceedings{植田2022,author    ={植田 暢大 and 大村 和正 and 児玉 貴志 and 清丸 寛一 and 村脇 有吾 and 河原 大輔 and 黒橋 禎夫},title     ={KWJA：汎用言語モデルに基づく日本語解析器},booktitle ={第253回自然言語処理研究会},year      ={2022},address   ={京都},}

@InProceedings{児玉2023,author    ={児玉 貴志 and 植田 暢大 and 大村 和正 and 清丸 寛一 and 村脇 有吾 and 河原 大輔 and 黒橋 禎夫},title     ={テキスト生成モデルによる日本語形態素解析},booktitle ={言語処理学会 第29回年次大会},year      ={2023},address   ={沖縄},}

License

This software is released under the MIT License, seeLICENSE.

Reference

KNP format

Footnotes

Pronunciation:/kuʒa/↩

About

An integrated Japanese analyzer based on foundation models

Releases22

v2.5.0 Latest

Apr 27, 2025

+ 21 releases

Packages

No packages published

Contributors8

Languages

Python98.9%
Other1.1%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

KWJA: Kyoto-Waseda Japanese Analyzer 1

Requirements

Getting Started

Usage from Python

Configuration

Config file location

Config file example

Performance Table

Citation

License

Reference

Footnotes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases22

Packages

Uh oh!

Contributors8

Uh oh!

Languages

Movatterモバイル変換

License

ku-nlp/kwja

Folders and files

Latest commit

History

Repository files navigation

KWJA: Kyoto-Waseda Japanese Analyzer1

Requirements

Getting Started

Usage from Python

Configuration

Config file location

Config file example

Performance Table

Citation

License

Reference

Footnotes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases22

Packages0

Uh oh!

Contributors8

Uh oh!

Languages

KWJA: Kyoto-Waseda Japanese Analyzer 1

Packages