Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Small Japanese-English Subtitle Corpus

NotificationsYou must be signed in to change notification settings

yusugomori/jesc_small

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Small Japanese-English Subtitle Corpus. Sentences are extracted fromJESC: Japanese-English Subtitle Corpus, and filtered with the length of 4 to 16 words.

Both Japanese and English sentences are tokenized withStanfordNLP (v0.2.0).

All texts are encoded in UTF-8. Sentence separator is'\n' and word separator is' '.

Additionally, all tokenized data can be downloaded fromhere.

Corpus statistics

File#sentences#words#vocabulary
train.en100,000809,35329,682
train.ja100,000808,15746,471
dev.en1,0008,0251,827
dev.ja1,0008,1632,340
test.en1,0008,0571,805
test.ja1,0008,0842,306

This repo is inspired bysmall_parallel_enja.

About

Small Japanese-English Subtitle Corpus

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp