- Notifications
You must be signed in to change notification settings - Fork1
AMI Meeting Parallel Corpus
License
tsuruoka-lab/AMI-Meeting-Parallel-Corpus
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
©2020, The University of Tokyo
Theoriginal AMI Meeting Corpus is a multi-modal dataset containing 100 hours of meeting recordings in English.The parallel version was constructed by asking professional translators to translate utterances from the original corpus into Japanese. Since the original corpus consists of speech transcripts, the English sentences contain a lot of short utterances (e.g., "Yeah", "Okay") or fillers (e.g., "Um"), and these are translated into Japanese as well. Therefore, it contains many duplicate sentences.
We provide training, development and evaluation splits from the AMI Meeting Parallel Corpus. In this repository we publicly share the full development and evaluation sets and a part of the training data set.
Training | Development | Evaluation | |
---|---|---|---|
Sentences | 20,000 | 2,000 | 2,000 |
Scenarios | 30 | 5 | 5 |
The corpus is structured in json format consisting of documents, which consist of sentence pairs. Each sentence pair has a sentence number, speaker identifier (to distinguish different speakers), text in English and Japanese, and original language (always English).
[{"id":"IS1004a","original_language":"en","conversation": [...,{"no":22,"speaker":"A","ja_sentence":"では、このプロジェクトの目的は、あー、新しいリモコンを作ることです。","en_sentence":"So, the goal of this project is to uh developed a new remote control."},...]},...]
Our dataset is released under theCreative Commons Attribution-ShareAlike (CC BY 4.0) license.
If you use this dataset, please cite the following paper:Matīss Rikters, Ryokan Ri, Tong Li, and Toshiaki Nakazawa (2020). "Document-aligned Japanese-English Conversation Parallel Corpus." In Proceedings of the Fifth Conference on Machine Translation, 2020.
@InProceedings{rikters-EtAl:2020:WMT,author ={Rikters, Matīss and Ri, Ryokan and Li, Tong and Nakazawa, Toshiaki},title ={Document-aligned Japanese-English Conversation Parallel Corpus},booktitle ={Proceedings of the Fifth Conference on Machine Translation},month ={November},year ={2020},address ={Online},publisher ={Association for Computational Linguistics},pages ={637--643},url ={https://www.aclweb.org/anthology/2020.wmt-1.74}}
This work was supported by "Research and Development of Deep Learning Technology for Advanced Multilingual Speech Translation", the Commissioned Research of National Institute of Information and Communications Technology (NICT), JAPAN.
About
AMI Meeting Parallel Corpus
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Contributors2
Uh oh!
There was an error while loading.Please reload this page.