yohokuno/neural_imePublic

NotificationsYou must be signed in to change notification settings
Fork9
Star67

Neural IME: Neural Input Method Engine

License

MIT license

67 stars 9 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
decode.py		decode.py
decode_both.py		decode_both.py
decode_ngram.py		decode_ngram.py
evaluate.py		evaluate.py
experiment.py		experiment.py
grid_search.py		grid_search.py
parse_bccwj.py		parse_bccwj.py
predict.py		predict.py
predict_both.py		predict_both.py
predict_ngram.py		predict_ngram.py
rnn_predictor.py		rnn_predictor.py
rnn_trainer.py		rnn_trainer.py
split_source_target.py		split_source_target.py
split_train_test_valid.py		split_train_test_valid.py
train.py		train.py
utility.py		utility.py

Repository files navigation

Neural IME: Neural Input Method Engine

Japanese input method engine can enter next level with deep learning technology.

Prerequisite

You need following software to use Neural IME.

Python 3.5
TensorFlow 0.10

The developer uses Mac OS X 10.11.4, Anaconda 4.1, PyCharm 5.0.4 but it should work elsewhere.

Experimental results

The neural model outperformed N-gram model on reference corpus as shown below.

Metrics	N-gram	RNN
Sentence Accuracy	41.5%	44.2%
Prediction Accuracy	22.9%	26.7%

Training your own models

For training and testing your own models, you need annotated datasuch as DVD version of BCCWJ (Balanced Corpus of Contemporary Written Japanese).

http://pj.ninjal.ac.jp/corpus_center/bccwj/en/

You probably want a modern GPU to train faster, as the developer uses p2.xlarge instance in AWS.

Preparing your data

Training data is text file in UTF-8 and each line corresponds to a sentence.A sentence is segmented by space character into words, and a word is a pair oftarget(i.e. Kanji, Hiragana or Katakana) andsource (Hiragana), concatenated by slash character.

私/わたしの/の名前/なまえは/は中野/なかのです/です。/。

Test data should contain different sentences from training data, but ideally its domain is same to training data.Source file should contain source sentences without space.

きょうのてんきははれです。

Target file should contain target sentences without space.

今日の天気は晴れです。

Pre-processing BCCWJ

The developer uses human-annotated part of BCCWJ as training and testing corpus.You can use the scripts in this repository to pre-process the XML files after extracted from compressed file.For example, the following commands parse and split data to train, test source and test target files.

parse_bccwj.py 'BCCWJ/CORE/M-XML/*.xml' > parsed.txtsplit_train_test_valid.py parsed.txt train.txt test.txt valid.txtsplit_source_target.py test.txt test.target.txt test.source.txt

Training neural models

Now you can train your own model with default parameters.

train.py train.txt model

See help for optional parameters such as number of hidden units and dropout probability.

train.py --help

Decoding sentences

Once trained your model, you can decode sentences using it.

decode.py model

Type source sentence on your console, it will show decoded sentence like this.

きょうのてんきははれです。今日の天気は晴れです。きょじんにせんせい巨人に先制

Alternatively, you can give file names as input or output.

decode.py model --input_file test.source.txt --output_file model/test.decode.txt

You can trade decoding time with accuracy by tuning pruning parameters such as beam size and viterbi size.For example, the following option is faster than default beam size 5 but less accurate.

decode.py model --beam_size 1

Evaluating results

You can evaluate decoded results if you have target sentences as reference.

evaluate.py model/test.decode.txt test.target.txt

This command gives something like this:

precision: 93.59 recall: 93.58 f-score: 93.59 accuracy: 34.06

Precision, recall and F-score are character-based metrics based on longest common subsequence,and accuracy is a sentence-level metric.

Hyperparameter search

You can use grid search script to find best hyperparameters.

grid_search.py train.txt valid.source.txt valid.target.txt model --hidden_size 50 100 200 400

Training N-gram models

In order to train N-gram models as baseline for comparing with neural models, you need to install and use SRILM toolkit.

http://www.speech.sri.com/projects/srilm/

Once installed, you can run the following command to train the model.

ngram-count -text train.txt -lm ngram.txt -kndiscount -order 2

Now you can decode sentences using the N-gram model.

decode_ngram.py ngram.txt

Or you can combine both neural model and N-gram model.

decode_both.py neural_model ngram.txt

Reference

Yoh Okuno, Neural IME: Neural Input Method Engine, The 8th Input Method Workshop, 2016.

About

Neural IME: Neural Input Method Engine

Releases

No releases published

Packages

No packages published

Languages

Python100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Neural IME: Neural Input Method Engine

Prerequisite

Experimental results

Training your own models

Preparing your data

Pre-processing BCCWJ

Training neural models

Decoding sentences

Evaluating results

Hyperparameter search

Training N-gram models

Reference

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

Movatterモバイル変換

License

yohokuno/neural_ime

Folders and files

Latest commit

History

Repository files navigation

Neural IME: Neural Input Method Engine

Prerequisite

Experimental results

Training your own models

Preparing your data

Pre-processing BCCWJ

Training neural models

Decoding sentences

Evaluating results

Hyperparameter search

Training N-gram models

Reference

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages