cl-tohoku/PheMTPublic

NotificationsYou must be signed in to change notification settings
Fork2
Star15

A phenomenon-wise evaluation dataset for Japanese-English machine translation robustness. The dataset is based on the MTNT dataset, with additional annotations of four linguistic phenomena; Proper Noun, Abbreviated Noun, Colloquial Expression, and Variant. COLING 2020.

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
abbrev		abbrev
colloq		colloq
eval_tools		eval_tools
proper		proper
src		src
variant		variant
README.md		README.md
mtnt_approp_annotated.tsv		mtnt_approp_annotated.tsv

Repository files navigation

PheMT: A Phenomenon-wise Dataset for Machine Translation Robustness on User-Generated Contents

Introduction

PheMT is a phenomenon-wise dataset designed for evaluating the robustness of Japanese-English machine translation systems.The dataset is based on the MTNT dataset^[1], with additional annotations of four linguistic phenomena common in UGC; Proper Noun, Abbreviated Noun, Colloquial Expression, and Variant.COLING 2020.

Seethe paper for more information.

New!! ready-to-useevaluation tools are now available! (Feb. 2021)

About this repository

This repository contains the following.

.├── README.md├── mtnt_approp_annotated.tsv # pre-filtered MTNT dataset with annotated appropriateness (See Appendix A)├── proper│   ├── proper.alignment # translations of targeted expressions│   ├── proper.en # references│   ├── proper.ja # source sentences│   └── proper.tsv├── abbrev│   ├── abbrev.alignment│   ├── abbrev.en│   ├── abbrev.norm.ja # normalized source sentences│   ├── abbrev.orig.ja # original source sentences│   └── abbrev.tsv├── colloq│   ├── colloq.alignment│   ├── colloq.en│   ├── colloq.norm.ja│   ├── colloq.orig.ja│   └── colloq.tsv├── variant│   ├── variant.alignment│   ├── variant.en│   ├── variant.norm.ja│   ├── variant.orig.ja│   └── variant.tsv└── src    └── calc_acc.py # script for calculating translation accuracy

Please feed both original and normalized versions of source sentences to your model to get the difference of arbitrary metrics as a robustness measure.Also, we extracted translations for expressions presenting targeted phenomena.We recommend usingsrc/calc_acc.py to measure the effect of each phenomenon more directly with the help of translation accuracy.

USAGE:python calc_acc.py system_output {proper, abbrev, colloq, variant}.alignment

Basic statistics and examples from the dataset

Statistics

Dataset	# sent.	# unique expressions (ratio)	average edit distance
Proper Noun	943	747 (79.2%)	(no normalized version)
Abbreviated Noun	348	234 (67.2%)	5.04
Colloquial Expression	172	153 (89.0%)	1.77
Variant	103	97 (94.2%)	3.42

Examples

- Abbreviated Nounoriginal source : 地味なアプデ (apude, meaning update) だがnormalized source : 地味なアップデート (update) だがreference : That’s a plain update thoughalignment : update- Colloquial Expressionoriginal source : ここまで描いて飽きた、かなちい (kanachii, meaning sad)normalized source : ここまで描いて飽きた、かなしい (kanashii)reference : Drawing this much then getting bored, how sad.alignment : sad

Citation

If you use our dataset for your research, please cite the following paper:

@inproceedings{fujii-etal-2020-phemt,    title = "{P}he{MT}: A Phenomenon-wise Dataset for Machine Translation Robustness on User-Generated Contents",    author = "Fujii, Ryo  and      Mita, Masato  and      Abe, Kaori  and      Hanawa, Kazuaki  and      Morishita, Makoto  and      Suzuki, Jun  and      Inui, Kentaro",    booktitle = "Proceedings of the 28th International Conference on Computational Linguistics",    month = dec,    year = "2020",    address = "Barcelona, Spain (Online)",    publisher = "International Committee on Computational Linguistics",    url = "https://www.aclweb.org/anthology/2020.coling-main.521",    pages = "5929--5943",}

Reference

[1] Michel and Neubig (2018), MTNT: A Testbed for Machine Translation of Noisy Text.

About

Releases

No releases published

Packages

No packages published

Languages

Python100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

PheMT: A Phenomenon-wise Dataset for Machine Translation Robustness on User-Generated Contents

Introduction

About this repository

Basic statistics and examples from the dataset

Citation

Reference

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

Movatterモバイル変換

cl-tohoku/PheMT

Folders and files

Latest commit

History

Repository files navigation

PheMT: A Phenomenon-wise Dataset for Machine Translation Robustness on User-Generated Contents

Introduction

About this repository

Basic statistics and examples from the dataset

Citation

Reference

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages