Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

This directory includes a giant Japanese-English subtitle corpus. The raw data comes from the Stanford’s JESC project.

NotificationsYou must be signed in to change notification settings

DayuanJiang/giant_ja-en_parallel_corpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

This directory includes a giant Japanese-English subtitle corpus. The raw data comes from the Stanford’sJESC project.

Data Example

# test.ja顔面 パンチ かい ?お姉ちゃん 、 何で ?もしくは 実際 の 私 の 要求 を 満たす こと も かのう でしょ う 。分かっ た 、 リジー 。夫 を 自分 で 、 けがす こと に なり ます 。あの 、 それ くらい に 、 し て おい て くれ ない ?お 掛け 下さい 。
# test.enso face punch , huh ?lisa , no !or you could actually meet my need .me ! ok , lizzy .my husband would defile himself .hey , can you leave it at that ?we can sit in here .

Contents

  • A large corpus consisting of 2.8 million sentences.
  • Translations of casual language, colloquialisms, expository writing, and narrative discourse. These are domains that are hard to find in JA-EN MT.

Modifications

Several pre-processing has been done to make the dataset easier to use.

Overall:

  • Delete the pair that Japanese phrase only have only one word.
  • The data has been split into train/dev/test set with following size
    • train: 2,795,067 phrase pairs
    • dev: 2,800 phrase pairs
    • test: 2,800 phrase pairs

For English text:

  • Add ‘.’ to the end of English phrase if it do not end with punctuation.
  • Tokenize text with `nltk.

For Japanese text:

  • Add ‘。’ to the end of Japanese phrase if it do not end with punctuation.
  • Replace space inside the phrase with ‘、’.
  • Tokenize text with tokenizerMecab and dictionarymecab-ipadic-neologd.

About

This directory includes a giant Japanese-English subtitle corpus. The raw data comes from the Stanford’s JESC project.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages


[8]ページ先頭

©2009-2025 Movatter.jp