shyyhs/CourseraParallelCorpusMiningPublic

NotificationsYou must be signed in to change notification settings
Fork1
Star14

Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation

License

Apache-2.0 license

14 stars 1 fork Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
data		data
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Repository files navigation

Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation

Overview

This repo is for our paperCoursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation.

It contains both the dataset and all source codes in the paper.

Keywords: Japanese-English parallel dataset, educational domain machine translation, lectures translation, multistage fine-tuning

Dataset

	#lines	#docs	Description
Test	2068	50	Human-validated
Dev	555	16	Human-validated
Train	50543	818	Automatic aligned High quality

Table 1: English-Japanese parallel dataset in educational domain.

	#lines	#docs	Description
Test	2009	90	Human-validated
Dev	865	34	Human-validated
Train	40074	997	Automatic aligned High quality

Table 2: English-Chinese parallel dataset in educational domain.

It contains high quality English-Japanese parallel sentences and documents from siteCoursera. Please refer our paper for details.

Update: We updated the English-Japanese dataset and it contains more sentences. We added a new English-Chinese dataset.

Source code

Also it contain the source codes described in the paper:

Crawling multi-language subtitle documents from Coursera usingyoutube-dl.
Extracting subtitle files of the desired language pair, data normalization and data clean.
Using machine translation and sentence embedding combined with DP to extract parallel sentence pairs in comparable document pairs.
Multistage fine-tuning techniques to leverage out-of- and in- domain data to train a MT system for lectures domain translation.

Experiment results

	Ja->En	En->Ja
Coursera dataset only	6.2	6.4
Combined with OOD datasets	27.5	18.5

	Zh->En	En->Zh
Coursera dataset only	14.8	14.5
Combined with OOD datasets	29.5	29.1

Table 2: BLEU scores of using only Coursera dataset and combined withASPEC,TED Talks datasets for Japanese-English and news commentary, TED Talks for Chinese-English with multistage fine-tuning techniques. Please refer our paper for details.

Reference

Please cite our paper if you used our code or dataset:

@inproceedings{song-etal-2020-coursera,    title = "{C}oursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation",    author = "Song, Haiyue  and      Dabre, Raj  and      Fujita, Atsushi  and      Kurohashi, Sadao",    booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference",    month = may,    year = "2020",    address = "Marseille, France",    publisher = "European Language Resources Association",    url = "https://www.aclweb.org/anthology/2020.lrec-1.449",    pages = "3640--3649",    language = "English",    ISBN = "979-10-95546-34-4",}@article{Haiyue Song2024,  title={Bilingual Corpus Mining and Multistage Fine-tuning for Improving Machine Translation of Lecture Transcripts},  author={Haiyue Song and Raj Dabre and Chenhui Chu and Atsushi Fujita and Sadao Kurohashi},  journal={Journal of Information Processing},  volume={32},  number={ },  pages={628-640},  year={2024},  doi={10.2197/ipsjjip.32.628}}

Contact

If you have any question, please contactsong@nlp.ist.i.kyoto-u.ac.jp

About

Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation

Releases

No releases published

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation

Overview

Dataset

Source code

Experiment results

Reference

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

Movatterモバイル変換

License

shyyhs/CourseraParallelCorpusMining

Folders and files

Latest commit

History

Repository files navigation

Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation

Overview

Dataset

Source code

Experiment results

Reference

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages