- Notifications
You must be signed in to change notification settings - Fork14
odashi/small_parallel_enja
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This directory includes a small parallel corpus for English-Japanesetranslation task. These data are extracted fromTANAKA Corpusby filtering sentence length 4 to 16 words.
English sentences are tokenized usingStanford Tokenizerand lowercased.Japanese sentences are tokenized usingKyTea.
All texts are encoded in UTF-8. Sentence separator is'\n'
and word separatoris' '
.
Attention: some English words have different tokenization results from Stanford Tokenizer,e.g., "don't" -> "don" "'t", which may came from preprocessing errors.Please take care of using this dataset in token-level evaluation.
File | #sentences | #words | #vocabulary |
---|---|---|---|
train.en | 50,000 | 391,047 | 6,634 |
- train.en.000 | 10,000 | 78,049 | 3,447 |
- train.en.001 | 10,000 | 78,223 | 3,418 |
- train.en.002 | 10,000 | 78,427 | 3,430 |
- train.en.003 | 10,000 | 78,118 | 3,402 |
- train.en.004 | 10,000 | 78,230 | 3,405 |
train.ja | 50,000 | 565,618 | 8,774 |
- train.ja.000 | 10,000 | 113,209 | 4,181 |
- train.ja.001 | 10,000 | 112,852 | 4,102 |
- train.ja.002 | 10,000 | 113,044 | 4,105 |
- train.ja.003 | 10,000 | 113,346 | 4,183 |
- train.ja.004 | 10,000 | 113,167 | 4,174 |
dev.en | 500 | 3,931 | 816 |
dev.ja | 500 | 5,668 | 894 |
test.en | 500 | 3,998 | 839 |
test.ja | 500 | 5,635 | 884 |
About
50k English-Japanese Parallel Corpus for Machine Translation Benchmark.
Resources
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
No releases published
Packages0
No packages published