You signed in with another tab or window.Reload to refresh your session.You signed out in another tab or window.Reload to refresh your session.You switched accounts on another tab or window.Reload to refresh your session.Dismiss alert
You probably want a modern GPU to train faster, as the developer uses p2.xlarge instance in AWS.
Preparing your data
Training data is text file in UTF-8 and each line corresponds to a sentence.A sentence is segmented by space character into words, and a word is a pair oftarget(i.e. Kanji, Hiragana or Katakana) andsource (Hiragana), concatenated by slash character.
私/わたし の/の 名前/なまえ は/は 中野/なかの です/です 。/。
Test data should contain different sentences from training data, but ideally its domain is same to training data.Source file should contain source sentences without space.
きょうのてんきははれです。
Target file should contain target sentences without space.
今日の天気は晴れです。
Pre-processing BCCWJ
The developer uses human-annotated part of BCCWJ as training and testing corpus.You can use the scripts in this repository to pre-process the XML files after extracted from compressed file.For example, the following commands parse and split data to train, test source and test target files.
Now you can train your own model with default parameters.
train.py train.txt model
See help for optional parameters such as number of hidden units and dropout probability.
train.py --help
Decoding sentences
Once trained your model, you can decode sentences using it.
decode.py model
Type source sentence on your console, it will show decoded sentence like this.
きょうのてんきははれです。今日の天気は晴れです。きょじんにせんせい巨人に先制
Alternatively, you can give file names as input or output.
decode.py model --input_file test.source.txt --output_file model/test.decode.txt
You can trade decoding time with accuracy by tuning pruning parameters such as beam size and viterbi size.For example, the following option is faster than default beam size 5 but less accurate.
decode.py model --beam_size 1
Evaluating results
You can evaluate decoded results if you have target sentences as reference.