NotificationsYou must be signed in to change notification settings
Fork2
Star15

Joint source channel model for Japanese Kana Kanji conversion, Chinese pinyin input and CJE mixed input.

You must be signed in to change notification settings

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
data		data
src		src
tools		tools
README.md		README.md
TODO		TODO
waf		waf
wscript		wscript

Repository files navigation

JSC: Joint Souce Channel Model and Decoder

JSC is an implementation of joint source channel or joint n-gram model with monotonic decoder.

A Joint Source-Channel Model for Machine Transliteration, Li Haizhou, Zhang Min, Su Jian.http://acl.ldc.upenn.edu/acl2004/main/pdf/121_pdf_2-col.pdf

It can be used for machine transliteration, Japanese kana-kanji conversion, Chinese pinyin input, English word segmentation or pronunciation inference.

Requirement

JSC requires Unix, gcc, python, and marisa-trie. If you want to use RPC server, you also need libevent.

marisa-trie 0.2.0 or laterhttp://code.google.com/p/marisa-trie/libevent 2.0 or laterhttp://libevent.org/

Install

To install JSC, type these commands into your console.

$ ./waf configure [--prefix=INSTALL_DIRECTORY]$ ./waf build$ sudo ./waf install

Usage

jsc-decode

jsc-decode command convert source string into target string via joint source channel model.You can provide queries through standard input line by line.

options:-d directory: specify data directory or prefix (default: ./)-f format: specify format (segment [default], plain, debug)-t table: specify table [romaji] mode (both [default], on, off)-l: turn off sentence-beginning/ending label

jsc-build

jsc-build command build model files in binary format from n-gram file in text format.

options:-d directory: specify data directory or prefix (default: ./)-m model: specify model file name (default: ngram)-t trie_num: specify trie number in marisa-trie (default: 3)-r: build reverse model

jsc-server

jsc-server command provides RPC server via simple TCP protocol.You can provide queries through telnet command line by line.

options:-d directory: specify data directory or prefix (default: ./)-f format: specify format (segment [default], plain, debug)-t table: specify table [romaji] mode (both [default], on, off)-p port: specify port number (default: 40714)-l: turn off sentence-beginning/ending label

Sample Applications

Japanese Kana Kanji Conversion

For Japanese Kana Kanji conversion, a model is provided at data/japanese directory. By default, both romaji and hiragana input are allowed.

$ ./build/jsc-decode -d data/japanese/わたしのなまえはなかのです。わたし の 名前 は 中野 です 。arayurugenjitsuwosubetejibunnnohouhenejimagetanodaあらゆる 現実 を 全て 自分 の ほう へ ネジ 曲げ た の だ

Japanese Pronunciation Inference

For Japanese pronunciation inference, a model is provided at data/japanese-reverse directory.

$ ./build/jsc-decode -d data/japanese-reverse/魔理沙は大変なものを盗んでいきましたま りさ は たいへん な もの を ぬす ん で い き ま し た

Chinese Pinyin Input

For Chinese Pinyin input, a model is provided at data/chinese/ directory.

$ ./build/jsc-decode -d data/chinese/woaiziranyuyanchuli我 爱 自然 语言 处理zhejianshitagegehaibuzhidaone这 件 事 她 哥哥 海部 知道 呢

Chinese Hanzi-to-Pinyin Conversion

For Chinese Hanzi-to-Pinyin Conversion, a model is provided at data/chinese-reverse/ directory.

$ ./build/jsc-decode -d data/chinese-reverse/汉字拼音转换hanzi pinyin zhuanhuan

English word segmentation / automatic capitalization

For English input, a model is provided at data/english/ directory.

$ ./build/jsc-decode -d data/english/alicewasbeginningtogetverytiredofsittingbyhersisteronthebankAlice was beginning to get very tired of sitting by her sister on the bankistandheretodayhumbledbythetaskbeforeusgratefulforthetrustyouhavebestowedmindfulofthesacrificesbornebyourancestorsI Stand here today humbled by the task before us grateful for the trust you have bestowed mindful of the sacrifices borne by our ancestors

Mixed Input

For English/Japanese/Chinese mixed input, a model is provided at data/mixed/ directory. The language is detected automatically.

$ ./build/jsc-decode -d data/mixed/thisisapencilThis is a pencilkyouhayoitenkidesune今日 は 良い 天気 です ねwoshizhongguoren我 是 中国 人thisistotemohaochi!This is とても 好吃 ！

Data Structure

Directories

Top directory contains these files and directories:

README.md   this filebuild/      built by waf automaticallydata/       model filessrc/        source and header files for C++tools/      command tools by C++waf         waf build scriptwscript     waf settings

File format

N-gram file should be SRILM format.

http://www.speech.sri.com/projects/srilm/

Target string and source string should be coupled with character '/'; e.g. "私/わたし"

Reference

Accuracy

Language	F-score	Size
Japanese	0.937	10MB
Chinese	0.895	9MB
English	not ready	9MB
Mixed	not ready	27MB

Paper

Please refer this paper if you need.

Yoh Okuno and Shinsuke Mori, An Ensemble Model of Word-based and Character-based Models for Japanese and Chinese Input Method, Workshop on Advances in Text Input Methods, 2012.http://yoh.okuno.name/pdf/wtim2012.pdf

About

Joint source channel model for Japanese Kana Kanji conversion, Chinese pinyin input and CJE mixed input.

Releases

No releases published

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

JSC: Joint Souce Channel Model and Decoder

Requirement

Install

Usage

jsc-decode

jsc-build

jsc-server

Sample Applications

Japanese Kana Kanji Conversion

Japanese Pronunciation Inference

Chinese Pinyin Input

Chinese Hanzi-to-Pinyin Conversion

English word segmentation / automatic capitalization

Mixed Input

Data Structure

Directories

File format

Reference

Accuracy

Paper

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

Movatterモバイル変換

yohokuno/jsc

Folders and files

Latest commit

History

Repository files navigation

JSC: Joint Souce Channel Model and Decoder

Requirement

Install

Usage

jsc-decode

jsc-build

jsc-server

Sample Applications

Japanese Kana Kanji Conversion

Japanese Pronunciation Inference

Chinese Pinyin Input

Chinese Hanzi-to-Pinyin Conversion

English word segmentation / automatic capitalization

Mixed Input

Data Structure

Directories

File format

Reference

Accuracy

Paper

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages