Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork56
Self-contained Japanese Morphological Analyzer written in pure Go
License
ikawaha/kagome
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
Kagome is an open source Japanese morphological analyzer written in pure Go. It can tokenize Japanese text into words and analyze parts of speech, with dictionaries embedded in the binary for easy deployment.
Note
Key features (Improvements fromv1):
- Self-contained binaries with embedded dictionaries (MeCab-IPADIC, UniDic)
- Multiple segmentation modes for different use cases
- RESTful API server mode for production use
- WebAssembly support for browser environments
%kagome -hJapanese Morphological Analyzer -- github.com/ikawaha/kagome/v2usage: kagome <command>The commands are: [tokenize] - command line tokenize (*default) server - run tokenize server lattice - lattice viewer sentence - tiny sentence splitter version - show versiontokenize [-file input_file] [-dict dic_file] [-userdict user_dic_file] [-sysdict (ipa|uni)] [-simple false] [-mode (normal|search|extended)] [-split] [-json] -dict string dict -file string input file -json outputs in JSON format -mode string tokenize mode (normal|search|extended) (default "normal") -simple display abbreviated dictionary contents -split use tiny sentence splitter -sysdict string system dict type (ipa|uni) (default "ipa") -udict string user dict
%# piped standard input%echo"すもももももももものうち"| kagomeすもも名詞,一般,*,*,*,*,すもも,スモモ,スモモも助詞,係助詞,*,*,*,*,も,モ,モもも名詞,一般,*,*,*,*,もも,モモ,モモも助詞,係助詞,*,*,*,*,も,モ,モもも名詞,一般,*,*,*,*,もも,モモ,モモの助詞,連体化,*,*,*,*,の,ノ,ノうち名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチEOS
- For more details, see theCommands section.
# Install Kagome modulego get github.com/ikawaha/kagome/v2package mainimport ("fmt""strings""github.com/ikawaha/kagome-dict/ipa""github.com/ikawaha/kagome/v2/tokenizer")funcmain() {t,err:=tokenizer.New(ipa.Dict(),tokenizer.OmitBosEos())iferr!=nil {panic(err) }// wakati (simple word splitting/segmentation)fmt.Println("---wakati---")seg:=t.Wakati("すもももももももものうち")fmt.Println(seg)// tokenize w/ morphological analysisfmt.Println("---tokenize---")tokens:=t.Tokenize("すもももももももものうち")for_,token:=rangetokens {features:=strings.Join(token.Features(),",")fmt.Printf("%s\t%v\n",token.Surface,features) }}
output:
---wakati---[すもも も もも も もも の うち]---tokenize---すもも名詞,一般,*,*,*,*,すもも,スモモ,スモモも助詞,係助詞,*,*,*,*,も,モ,モもも名詞,一般,*,*,*,*,もも,モモ,モモも助詞,係助詞,*,*,*,*,も,モ,モもも名詞,一般,*,*,*,*,もも,モモ,モモの助詞,連体化,*,*,*,*,の,ノ,ノうち名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
- For more examples, see:
Toget thekagome command line tool, choose your preferred installation method below:
Go (recommended)
go install github.com/ikawaha/kagome/v2@latestHomebrew
#macOS and Linux (for both AMD64 and Arm64)brew install ikawaha/kagome/kagome
Manual Install
- For manual installation, download and extract the appropriate archived file for your OS and architecture from thereleases page.
- Note that the extracted binary must be placed in an accessible directory with execution permission.
Docker/Docker Compose
- See theDocker section below
Major sub-commands ofkagome command line tool.
%# interactive/REPL mode%kagomeすもももももももものうちすもも名詞,一般,*,*,*,*,すもも,スモモ,スモモも助詞,係助詞,*,*,*,*,も,モ,モもも名詞,一般,*,*,*,*,もも,モモ,モモも助詞,係助詞,*,*,*,*,も,モ,モもも名詞,一般,*,*,*,*,もも,モモ,モモの助詞,連体化,*,*,*,*,の,ノ,ノうち名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチEOS
%# piped standard inputecho "すもももももももものうち" | kagomeすもも 名詞,一般,*,*,*,*,すもも,スモモ,スモモも 助詞,係助詞,*,*,*,*,も,モ,モもも 名詞,一般,*,*,*,*,もも,モモ,モモも 助詞,係助詞,*,*,*,*,も,モ,モもも 名詞,一般,*,*,*,*,もも,モモ,モモの 助詞,連体化,*,*,*,*,の,ノ,ノうち 名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチEOS
%# JSON output%# (For jq command see https://jqlang.org/)%echo"猫"| kagome -json| jq.[ { "id": 286994, "start": 0, "end": 1, "surface": "猫", "class": "KNOWN", "pos": [ "名詞", "一般", "*", "*" ], "base_form": "猫", "reading": "ネコ", "pronunciation": "ネコ", "features": [ "名詞", "一般", "*", "*", "*", "*", "猫", "ネコ", "ネコ" ] }]
%# word splitting/segmentation only (equivalent to "wakati" functionality)%echo"すもももももももものうち"| kagome -json| jq -r'[.[].surface] | join("/")'すもも/も/もも/も/もも/の/うち
%# Extract only pronunciations using jq (for Text-to-Speech purposes, etc.)%echo"私ははにわよわわわんわん"| kagome -json| jq -r'.[].pronunciation'ワタシワハニワヨワワワンワン
For continuous usage,kagome provides a server mode to decouple the startup time of the tokenizer.
Start a server and try to access the "/tokenize" endpoint.
%kagome server&%curl -XPUT localhost:6060/tokenize -d'{"sentence":"すもももももももものうち", "mode":"normal"}'| jq.
Start a server and accesshttp://localhost:6060 in your browser.
%kagome server&Important
The demo web application usesgraphviz to draw a lattice. You need graphviz to be installed on your system.
[!TIP]Kagome can be compiled to WebAssembly (wasm) and run locally in a web browser as well. For details, see theWebAssembly section.
- Wasm Demo:https://ikawaha.github.io/kagome/
A debug tool of tokenize process outputs a lattice in graphviz dot format.
%kagome lattice 私は鰻| dot -Tpng -o lattice.pngSplit long text into sentences:
%echo"吾輩は猫である。名前はまだ無い。"| kagome sentence吾輩は猫である。名前はまだ無い。
This command is useful if a single line of data is too lengthy, and you want to avoid errors such asbufio.Scanner: token too long.
%echo"吾輩は猫である。名前はまだ無い。"| kagome -json| jq -r'[.[].surface] | join("/")'吾輩/は/猫/で/ある/。/名前/は/まだ/無い/。%echo"吾輩は猫である。名前はまだ無い。"| kagome sentence| kagome -json| jq -r'[.[].surface] | join("/")'吾輩/は/猫/で/ある/。名前/は/まだ/無い/。
This command is equivalent to the-split option of thetokenize command.
%echo"吾輩は猫である。名前はまだ無い。"| kagome -split -json| jq -r'[.[].surface] | join("/")'吾輩/は/猫/で/ある/。名前/は/まだ/無い/。
Currently supported dictionaries by default.
dict source package MeCab IPADIC mecab-ipadic-2.7.0-20070801 github.com/ikawaha/kagome-dict/ipa UniDIC unidic-mecab-2.1.2_src github.com/ikawaha/kagome-dict/uni Experimental Features
dict source package mecab-ipadic-NEologd mecab-ipadic-neologd github.com/ikawaha/kagome-ipa-neologd Korean MeCab mecab-ko-dic-2.1.1-20180720 github.com/ikawaha/kagome-dict-ko
Note
For more details and differences between the dictionaries, see thewiki.
Similar toKuromoji, Kagome also supports varioussegmentation modes (splitting strategies) to tokenize the input text.
- Normal: Regular segmentation
- Search: Use a heuristic to perform additional segmentation that isuseful for search purposes
- Extended: Similar to search mode, but also unknown words withuni-grams
| Untokenized | Normal | Search | Extended |
|---|---|---|---|
| 関西国際空港 | 関西国際空港 | 関西 国際 空港 | 関西 国際 空港 |
| 日本経済新聞 | 日本経済新聞 | 日本 経済 新聞 | 日本 経済 新聞 |
| シニアソフトウェアエンジニア | シニアソフトウェアエンジニア | シニア ソフトウェア エンジニア | シニア ソフトウェア エンジニア |
| デジカメを買った | デジカメ を 買っ た | デジカメ を 買っ た | デ ジ カ メ を 買っ た |
Note
If your purpose is for search, try changing the mode before switching to another dictionary.
We providescratch-based Docker images that simply run thekagome command line tool on various architectures: AMD64, Arm64, Arm32 (Arm v5, v6 and v7)
Pull the image
docker pull ikawaha/kagome:latest
# Alternatively, you can pull from GitHub Container Registrydocker pull ghcr.io/ikawaha/kagome:latestRun the command via Docker
# Interactive/REPL modedocker run --rm -it ikawaha/kagome:latest# If pulling from GitHub Container Registrydocker run --rm -it ghcr.io/ikawaha/kagome:latestRun the server via Docker
# Server mode (http://localhost:6060)docker run --rm -p 6060:6060 ikawaha/kagome:latest server# If pulling from GitHub Container Registrydocker run --rm -p 6060:6060 ghcr.io/ikawaha/kagome:latest serverdocker-compose.ymlexampleservices:kagome:image:ikawaha/kagome:latestports:["6060:6060"]command:serverrestart:unless-stopped
Note: Base image doesn't include Graphviz. For lattice visualization, seeexamples.
Kagome compiles to WebAssembly for browser use.
- Live demo:https://ikawaha.github.io/kagome/
- Source code:./_examples/wasm
Detailed Reference Manual in Japanese:
Community Wiki in English:
- MIT
About
Self-contained Japanese Morphological Analyzer written in pure Go
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Sponsor this project
Uh oh!
There was an error while loading.Please reload this page.
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.


