Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

A large parallel corpus of English and Japanese

NotificationsYou must be signed in to change notification settings

rpryzant/JESC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Welcome to the JESC code release! This repo contains the crawlers, parsers, aligners, and various tools used to create the Japanese-English Subtitle Corpus (JESC).

Requirements

Use pip:pip install -r requirements.txt

Additionally, some of the corpus_processing scripts make use ofgoogle/sentencepiece, which has installation instructions on its github page.

Instructions

Each file is a standalone tool with usage instructions given in the comment header. These files are organized into the following categories (subdirectories):

  • corpus_generation: Scripts for downloading, parsing, and aligning subtitles from the internet.

  • corpus_cleaning: Scripts for converting file formats, thresholding on length ratios, and spellchecking.

  • corpus_processing: Scripts for manipulating completed datasets, including tokenization and train/test/dev splitting.

Citation

Please give the proper citation or credit if you use these data:

@ARTICLE{pryzant_jesc_2017,   author = {{Pryzant}, R. and {Chung}, Y. and {Jurafsky}, D. and {Britz}, D.},    title = "{JESC: Japanese-English Subtitle Corpus}",  journal = {ArXiv e-prints},archivePrefix = "arXiv",   eprint = {1710.10639}, keywords = {Computer Science - Computation and Language},     year = 2017,    month = oct,}             ```

About

A large parallel corpus of English and Japanese

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp