- Notifications
You must be signed in to change notification settings - Fork12
rpryzant/JESC
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Welcome to the JESC code release! This repo contains the crawlers, parsers, aligners, and various tools used to create the Japanese-English Subtitle Corpus (JESC).
Use pip:pip install -r requirements.txt
Additionally, some of the corpus_processing scripts make use ofgoogle/sentencepiece, which has installation instructions on its github page.
Each file is a standalone tool with usage instructions given in the comment header. These files are organized into the following categories (subdirectories):
corpus_generation: Scripts for downloading, parsing, and aligning subtitles from the internet.
corpus_cleaning: Scripts for converting file formats, thresholding on length ratios, and spellchecking.
corpus_processing: Scripts for manipulating completed datasets, including tokenization and train/test/dev splitting.
Please give the proper citation or credit if you use these data:
@ARTICLE{pryzant_jesc_2017, author = {{Pryzant}, R. and {Chung}, Y. and {Jurafsky}, D. and {Britz}, D.}, title = "{JESC: Japanese-English Subtitle Corpus}", journal = {ArXiv e-prints},archivePrefix = "arXiv", eprint = {1710.10639}, keywords = {Computer Science - Computation and Language}, year = 2017, month = oct,} ```
About
A large parallel corpus of English and Japanese
Resources
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Uh oh!
There was an error while loading.Please reload this page.