- Notifications
You must be signed in to change notification settings - Fork0
An ambiguous subtitles dataset for visual scene-aware machine translation
License
ku-nlp/VISA
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
VISA is a dataset that consists of 40k Japanese-English parallel sentence pairs and corresponding video clips with the following key features:
- The parallel sentences aresubtitles from movies and TV episodes
- The source subtitles areambiguous, which means they have multiple possible translations with different meanings
- We divide the dataset intoPolysemy andOmission according to the cause of ambiguity
Polysemy:
放せ! --> Let me go!
Omission:
銃を持ってる。 --> I have a gun.
Split | Train | Validation | test |
---|---|---|---|
Polysemy | 18,666 | 1,000 | 1,000 |
Omission | 17,214 | 1,000 | 1,000 |
Combined | 35,880 | 2,000 | 2,000 |
You can read json files to find the mapping from videos to parallel subtitle pairs.
video_file_name: { { "ja": Japanese_subtitle }, { "en": English_subtitle } }
Please, note that by downloading the dataset, you agree to the following conditions:
- Do not re-distribute the dataset without our permission.
- The dataset can only be used for research purposes. Any other use is explicitly prohibited.
If you are interested in the video features of VISA, you can download them from the following links:
- The I3D Features of VISA:http://lotus.kuee.kyoto-u.ac.jp/~yihang/dataset/VISA_i3d.zip
- The RCNN Features of VISA:http://lotus.kuee.kyoto-u.ac.jp/~yihang/dataset/VISA_rcnn.zip
If you find this dataset helpful, please cite our publication "VISA: An Ambiguous Subtitles Dataset for Visual Scene-Aware Machine Translation":
@inproceedings{li-etal-2022-visa, title = "{VISA}: An Ambiguous Subtitles Dataset for Visual Scene-aware Machine Translation", author = "Li, Yihang and Shimizu, Shuichiro and Gu, Weiqi and Chu, Chenhui and Kurohashi, Sadao", booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference", month = jun, year = "2022", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://aclanthology.org/2022.lrec-1.725", pages = "6735--6743",}
If you have any questions about this dataset, please contactliyh@nlp.ist.i.kyoto-u.ac.jp.
About
An ambiguous subtitles dataset for visual scene-aware machine translation
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Contributors3
Uh oh!
There was an error while loading.Please reload this page.