- Notifications
You must be signed in to change notification settings - Fork0
yusugomori/jesc_small
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Small Japanese-English Subtitle Corpus. Sentences are extracted fromJESC: Japanese-English Subtitle Corpus, and filtered with the length of 4 to 16 words.
Both Japanese and English sentences are tokenized withStanfordNLP (v0.2.0).
All texts are encoded in UTF-8. Sentence separator is'\n'
and word separator is' '
.
Additionally, all tokenized data can be downloaded fromhere.
File | #sentences | #words | #vocabulary |
---|---|---|---|
train.en | 100,000 | 809,353 | 29,682 |
train.ja | 100,000 | 808,157 | 46,471 |
dev.en | 1,000 | 8,025 | 1,827 |
dev.ja | 1,000 | 8,163 | 2,340 |
test.en | 1,000 | 8,057 | 1,805 |
test.ja | 1,000 | 8,084 | 2,306 |
This repo is inspired bysmall_parallel_enja.
About
Small Japanese-English Subtitle Corpus
Resources
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
No releases published
Packages0
No packages published