Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Joint source channel model for Japanese Kana Kanji conversion, Chinese pinyin input and CJE mixed input.

NotificationsYou must be signed in to change notification settings

yohokuno/jsc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

JSC is an implementation of joint source channel or joint n-gram model with monotonic decoder.

A Joint Source-Channel Model for Machine Transliteration, Li Haizhou, Zhang Min, Su Jian.http://acl.ldc.upenn.edu/acl2004/main/pdf/121_pdf_2-col.pdf

It can be used for machine transliteration, Japanese kana-kanji conversion, Chinese pinyin input, English word segmentation or pronunciation inference.

Requirement

JSC requires Unix, gcc, python, and marisa-trie. If you want to use RPC server, you also need libevent.

marisa-trie 0.2.0 or laterhttp://code.google.com/p/marisa-trie/libevent 2.0 or laterhttp://libevent.org/

Install

To install JSC, type these commands into your console.

$ ./waf configure [--prefix=INSTALL_DIRECTORY]$ ./waf build$ sudo ./waf install

Usage

jsc-decode

jsc-decode command convert source string into target string via joint source channel model.You can provide queries through standard input line by line.

options:-d directory: specify data directory or prefix (default: ./)-f format: specify format (segment [default], plain, debug)-t table: specify table [romaji] mode (both [default], on, off)-l: turn off sentence-beginning/ending label

jsc-build

jsc-build command build model files in binary format from n-gram file in text format.

options:-d directory: specify data directory or prefix (default: ./)-m model: specify model file name (default: ngram)-t trie_num: specify trie number in marisa-trie (default: 3)-r: build reverse model

jsc-server

jsc-server command provides RPC server via simple TCP protocol.You can provide queries through telnet command line by line.

options:-d directory: specify data directory or prefix (default: ./)-f format: specify format (segment [default], plain, debug)-t table: specify table [romaji] mode (both [default], on, off)-p port: specify port number (default: 40714)-l: turn off sentence-beginning/ending label

Sample Applications

Japanese Kana Kanji Conversion

For Japanese Kana Kanji conversion, a model is provided at data/japanese directory. By default, both romaji and hiragana input are allowed.

$ ./build/jsc-decode -d data/japanese/わたしのなまえはなかのです。わたし の 名前 は 中野 です 。arayurugenjitsuwosubetejibunnnohouhenejimagetanodaあらゆる 現実 を 全て 自分 の ほう へ ネジ 曲げ た の だ

Japanese Pronunciation Inference

For Japanese pronunciation inference, a model is provided at data/japanese-reverse directory.

$ ./build/jsc-decode -d data/japanese-reverse/魔理沙は大変なものを盗んでいきましたま りさ は たいへん な もの を ぬす ん で い き ま し た

Chinese Pinyin Input

For Chinese Pinyin input, a model is provided at data/chinese/ directory.

$ ./build/jsc-decode -d data/chinese/woaiziranyuyanchuli我 爱 自然 语言 处理zhejianshitagegehaibuzhidaone这 件 事 她 哥哥 海部 知道 呢

Chinese Hanzi-to-Pinyin Conversion

For Chinese Hanzi-to-Pinyin Conversion, a model is provided at data/chinese-reverse/ directory.

$ ./build/jsc-decode -d data/chinese-reverse/汉字拼音转换hanzi pinyin zhuanhuan

English word segmentation / automatic capitalization

For English input, a model is provided at data/english/ directory.

$ ./build/jsc-decode -d data/english/alicewasbeginningtogetverytiredofsittingbyhersisteronthebankAlice was beginning to get very tired of sitting by her sister on the bankistandheretodayhumbledbythetaskbeforeusgratefulforthetrustyouhavebestowedmindfulofthesacrificesbornebyourancestorsI Stand here today humbled by the task before us grateful for the trust you have bestowed mindful of the sacrifices borne by our ancestors

Mixed Input

For English/Japanese/Chinese mixed input, a model is provided at data/mixed/ directory. The language is detected automatically.

$ ./build/jsc-decode -d data/mixed/thisisapencilThis is a pencilkyouhayoitenkidesune今日 は 良い 天気 です ねwoshizhongguoren我 是 中国 人thisistotemohaochi!This is とても 好吃 !

Data Structure

Directories

Top directory contains these files and directories:

README.md   this filebuild/      built by waf automaticallydata/       model filessrc/        source and header files for C++tools/      command tools by C++waf         waf build scriptwscript     waf settings

File format

N-gram file should be SRILM format.

http://www.speech.sri.com/projects/srilm/

Target string and source string should be coupled with character '/'; e.g. "私/わたし"

Reference

Accuracy

LanguageF-scoreSize
Japanese0.93710MB
Chinese0.8959MB
Englishnot ready9MB
Mixednot ready27MB

Paper

Please refer this paper if you need.

Yoh Okuno and Shinsuke Mori, An Ensemble Model of Word-based and Character-based Models for Japanese and Chinese Input Method, Workshop on Advances in Text Input Methods, 2012.http://yoh.okuno.name/pdf/wtim2012.pdf

About

Joint source channel model for Japanese Kana Kanji conversion, Chinese pinyin input and CJE mixed input.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp