You signed in with another tab or window.Reload to refresh your session.You signed out in another tab or window.Reload to refresh your session.You switched accounts on another tab or window.Reload to refresh your session.Dismiss alert
JSC is an implementation of joint source channel or joint n-gram model with monotonic decoder.
A Joint Source-Channel Model for Machine Transliteration, Li Haizhou, Zhang Min, Su Jian.http://acl.ldc.upenn.edu/acl2004/main/pdf/121_pdf_2-col.pdf
It can be used for machine transliteration, Japanese kana-kanji conversion, Chinese pinyin input, English word segmentation or pronunciation inference.
Requirement
JSC requires Unix, gcc, python, and marisa-trie. If you want to use RPC server, you also need libevent.
marisa-trie 0.2.0 or laterhttp://code.google.com/p/marisa-trie/libevent 2.0 or laterhttp://libevent.org/
Install
To install JSC, type these commands into your console.
jsc-decode command convert source string into target string via joint source channel model.You can provide queries through standard input line by line.
options:-d directory: specify data directory or prefix (default: ./)-f format: specify format (segment [default], plain, debug)-t table: specify table [romaji] mode (both [default], on, off)-l: turn off sentence-beginning/ending label
jsc-build
jsc-build command build model files in binary format from n-gram file in text format.
options:-d directory: specify data directory or prefix (default: ./)-m model: specify model file name (default: ngram)-t trie_num: specify trie number in marisa-trie (default: 3)-r: build reverse model
jsc-server
jsc-server command provides RPC server via simple TCP protocol.You can provide queries through telnet command line by line.
options:-d directory: specify data directory or prefix (default: ./)-f format: specify format (segment [default], plain, debug)-t table: specify table [romaji] mode (both [default], on, off)-p port: specify port number (default: 40714)-l: turn off sentence-beginning/ending label
Sample Applications
Japanese Kana Kanji Conversion
For Japanese Kana Kanji conversion, a model is provided at data/japanese directory. By default, both romaji and hiragana input are allowed.
English word segmentation / automatic capitalization
For English input, a model is provided at data/english/ directory.
$ ./build/jsc-decode -d data/english/alicewasbeginningtogetverytiredofsittingbyhersisteronthebankAlice was beginning to get very tired of sitting by her sister on the bankistandheretodayhumbledbythetaskbeforeusgratefulforthetrustyouhavebestowedmindfulofthesacrificesbornebyourancestorsI Stand here today humbled by the task before us grateful for the trust you have bestowed mindful of the sacrifices borne by our ancestors
Mixed Input
For English/Japanese/Chinese mixed input, a model is provided at data/mixed/ directory. The language is detected automatically.
$ ./build/jsc-decode -d data/mixed/thisisapencilThis is a pencilkyouhayoitenkidesune今日 は 良い 天気 です ねwoshizhongguoren我 是 中国 人thisistotemohaochi!This is とても 好吃 !
Data Structure
Directories
Top directory contains these files and directories:
README.md this filebuild/ built by waf automaticallydata/ model filessrc/ source and header files for C++tools/ command tools by C++waf waf build scriptwscript waf settings
File format
N-gram file should be SRILM format.
http://www.speech.sri.com/projects/srilm/
Target string and source string should be coupled with character '/'; e.g. "私/わたし"
Reference
Accuracy
Language
F-score
Size
Japanese
0.937
10MB
Chinese
0.895
9MB
English
not ready
9MB
Mixed
not ready
27MB
Paper
Please refer this paper if you need.
Yoh Okuno and Shinsuke Mori, An Ensemble Model of Word-based and Character-based Models for Japanese and Chinese Input Method, Workshop on Advances in Text Input Methods, 2012.http://yoh.okuno.name/pdf/wtim2012.pdf
About
Joint source channel model for Japanese Kana Kanji conversion, Chinese pinyin input and CJE mixed input.