Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

50k English-Japanese Parallel Corpus for Machine Translation Benchmark.

NotificationsYou must be signed in to change notification settings

odashi/small_parallel_enja

Repository files navigation

This directory includes a small parallel corpus for English-Japanesetranslation task. These data are extracted fromTANAKA Corpusby filtering sentence length 4 to 16 words.

English sentences are tokenized usingStanford Tokenizerand lowercased.Japanese sentences are tokenized usingKyTea.

All texts are encoded in UTF-8. Sentence separator is'\n' and word separatoris' '.

Attention: some English words have different tokenization results from Stanford Tokenizer,e.g., "don't" -> "don" "'t", which may came from preprocessing errors.Please take care of using this dataset in token-level evaluation.

Corpus Statistics

File#sentences#words#vocabulary
train.en50,000391,0476,634
- train.en.00010,00078,0493,447
- train.en.00110,00078,2233,418
- train.en.00210,00078,4273,430
- train.en.00310,00078,1183,402
- train.en.00410,00078,2303,405
train.ja50,000565,6188,774
- train.ja.00010,000113,2094,181
- train.ja.00110,000112,8524,102
- train.ja.00210,000113,0444,105
- train.ja.00310,000113,3464,183
- train.ja.00410,000113,1674,174
dev.en5003,931816
dev.ja5005,668894
test.en5003,998839
test.ja5005,635884

About

50k English-Japanese Parallel Corpus for Machine Translation Benchmark.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages


[8]ページ先頭

©2009-2025 Movatter.jp