Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation

License

NotificationsYou must be signed in to change notification settings

shyyhs/CourseraParallelCorpusMining

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Overview

This repo is for our paperCoursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation.

It contains both the dataset and all source codes in the paper.

Keywords: Japanese-English parallel dataset, educational domain machine translation, lectures translation, multistage fine-tuning

Dataset

#lines#docsDescription
Test206850Human-validated
Dev55516Human-validated
Train50543818Automatic aligned
High quality

Table 1: English-Japanese parallel dataset in educational domain.

#lines#docsDescription
Test200990Human-validated
Dev86534Human-validated
Train40074997Automatic aligned
High quality

Table 2: English-Chinese parallel dataset in educational domain.

It contains high quality English-Japanese parallel sentences and documents from siteCoursera. Please refer our paper for details.

Update: We updated the English-Japanese dataset and it contains more sentences. We added a new English-Chinese dataset.

Source code

Also it contain the source codes described in the paper:

  1. Crawling multi-language subtitle documents from Coursera usingyoutube-dl.
  2. Extracting subtitle files of the desired language pair, data normalization and data clean.
  3. Using machine translation and sentence embedding combined with DP to extract parallel sentence pairs in comparable document pairs.
  4. Multistage fine-tuning techniques to leverage out-of- and in- domain data to train a MT system for lectures domain translation.

Experiment results

Ja->EnEn->Ja
Coursera dataset only6.26.4
Combined with OOD datasets27.518.5
Zh->EnEn->Zh
Coursera dataset only14.814.5
Combined with OOD datasets29.529.1

Table 2: BLEU scores of using only Coursera dataset and combined withASPEC,TED Talks datasets for Japanese-English and news commentary, TED Talks for Chinese-English with multistage fine-tuning techniques. Please refer our paper for details.

Reference

Please cite our paper if you used our code or dataset:

@inproceedings{song-etal-2020-coursera,    title = "{C}oursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation",    author = "Song, Haiyue  and      Dabre, Raj  and      Fujita, Atsushi  and      Kurohashi, Sadao",    booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference",    month = may,    year = "2020",    address = "Marseille, France",    publisher = "European Language Resources Association",    url = "https://www.aclweb.org/anthology/2020.lrec-1.449",    pages = "3640--3649",    language = "English",    ISBN = "979-10-95546-34-4",}@article{Haiyue Song2024,  title={Bilingual Corpus Mining and Multistage Fine-tuning for Improving Machine Translation of Lecture Transcripts},  author={Haiyue Song and Raj Dabre and Chenhui Chu and Atsushi Fujita and Sadao Kurohashi},  journal={Journal of Information Processing},  volume={32},  number={ },  pages={628-640},  year={2024},  doi={10.2197/ipsjjip.32.628}}

Contact

If you have any question, please contactsong@nlp.ist.i.kyoto-u.ac.jp

About

Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp