Recent advances in commonsense reasoning depend on large-scale human-annotated training sets to achieve peak performance. However, manual curation of training sets is expensive and has been shown to introduce annotation artifacts that neural models can readily exploit and overfit to. We propose a novel generative data augmentation technique, G-DAUGˆC, that aims to achieve more accurate and robust learning in a low-resource setting. Our approach generates synthetic examples using pretrained language models and selects the most informative and diverse set of examples for data augmentation. On experiments with multiple commonsense reasoning benchmarks, G-DAUGˆC consistently outperforms existing data augmentation methods based on back-translation, establishing a new state-of-the-art on WinoGrande, CODAH, and CommonsenseQA, as well as enhances out-of-distribution generalization, proving to be robust against adversaries or perturbations. Our analysis demonstrates that G-DAUGˆC produces a diverse set of fluent training examples, and that its selection and training approaches are important for performance.
Yiben Yang, Chaitanya Malaviya, Jared Fernandez, Swabha Swayamdipta, Ronan Le Bras, Ji-Ping Wang, Chandra Bhagavatula, Yejin Choi, and Doug Downey. 2020.Generative Data Augmentation for Commonsense Reasoning. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 1008–1025, Online. Association for Computational Linguistics.
@inproceedings{yang-etal-2020-generative, title = "Generative Data Augmentation for Commonsense Reasoning", author = "Yang, Yiben and Malaviya, Chaitanya and Fernandez, Jared and Swayamdipta, Swabha and Le Bras, Ronan and Wang, Ji-Ping and Bhagavatula, Chandra and Choi, Yejin and Downey, Doug", editor = "Cohn, Trevor and He, Yulan and Liu, Yang", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2020.findings-emnlp.90/", doi = "10.18653/v1/2020.findings-emnlp.90", pages = "1008--1025", abstract = "Recent advances in commonsense reasoning depend on large-scale human-annotated training sets to achieve peak performance. However, manual curation of training sets is expensive and has been shown to introduce annotation artifacts that neural models can readily exploit and overfit to. We propose a novel generative data augmentation technique, G-DAUG{\textasciicircum}C, that aims to achieve more accurate and robust learning in a low-resource setting. Our approach generates synthetic examples using pretrained language models and selects the most informative and diverse set of examples for data augmentation. On experiments with multiple commonsense reasoning benchmarks, G-DAUG{\textasciicircum}C consistently outperforms existing data augmentation methods based on back-translation, establishing a new state-of-the-art on WinoGrande, CODAH, and CommonsenseQA, as well as enhances out-of-distribution generalization, proving to be robust against adversaries or perturbations. Our analysis demonstrates that G-DAUG{\textasciicircum}C produces a diverse set of fluent training examples, and that its selection and training approaches are important for performance."}
<?xml version="1.0" encoding="UTF-8"?><modsCollection xmlns="http://www.loc.gov/mods/v3"><mods ID="yang-etal-2020-generative"> <titleInfo> <title>Generative Data Augmentation for Commonsense Reasoning</title> </titleInfo> <name type="personal"> <namePart type="given">Yiben</namePart> <namePart type="family">Yang</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Chaitanya</namePart> <namePart type="family">Malaviya</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Jared</namePart> <namePart type="family">Fernandez</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Swabha</namePart> <namePart type="family">Swayamdipta</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Ronan</namePart> <namePart type="family">Le Bras</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Ji-Ping</namePart> <namePart type="family">Wang</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Chandra</namePart> <namePart type="family">Bhagavatula</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Yejin</namePart> <namePart type="family">Choi</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Doug</namePart> <namePart type="family">Downey</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <originInfo> <dateIssued>2020-11</dateIssued> </originInfo> <typeOfResource>text</typeOfResource> <relatedItem type="host"> <titleInfo> <title>Findings of the Association for Computational Linguistics: EMNLP 2020</title> </titleInfo> <name type="personal"> <namePart type="given">Trevor</namePart> <namePart type="family">Cohn</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Yulan</namePart> <namePart type="family">He</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Yang</namePart> <namePart type="family">Liu</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <originInfo> <publisher>Association for Computational Linguistics</publisher> <place> <placeTerm type="text">Online</placeTerm> </place> </originInfo> <genre authority="marcgt">conference publication</genre> </relatedItem> <abstract>Recent advances in commonsense reasoning depend on large-scale human-annotated training sets to achieve peak performance. However, manual curation of training sets is expensive and has been shown to introduce annotation artifacts that neural models can readily exploit and overfit to. We propose a novel generative data augmentation technique, G-DAUG⌃C, that aims to achieve more accurate and robust learning in a low-resource setting. Our approach generates synthetic examples using pretrained language models and selects the most informative and diverse set of examples for data augmentation. On experiments with multiple commonsense reasoning benchmarks, G-DAUG⌃C consistently outperforms existing data augmentation methods based on back-translation, establishing a new state-of-the-art on WinoGrande, CODAH, and CommonsenseQA, as well as enhances out-of-distribution generalization, proving to be robust against adversaries or perturbations. Our analysis demonstrates that G-DAUG⌃C produces a diverse set of fluent training examples, and that its selection and training approaches are important for performance.</abstract> <identifier type="citekey">yang-etal-2020-generative</identifier> <identifier type="doi">10.18653/v1/2020.findings-emnlp.90</identifier> <location> <url>https://aclanthology.org/2020.findings-emnlp.90/</url> </location> <part> <date>2020-11</date> <extent unit="page"> <start>1008</start> <end>1025</end> </extent> </part></mods></modsCollection>
%0 Conference Proceedings%T Generative Data Augmentation for Commonsense Reasoning%A Yang, Yiben%A Malaviya, Chaitanya%A Fernandez, Jared%A Swayamdipta, Swabha%A Le Bras, Ronan%A Wang, Ji-Ping%A Bhagavatula, Chandra%A Choi, Yejin%A Downey, Doug%Y Cohn, Trevor%Y He, Yulan%Y Liu, Yang%S Findings of the Association for Computational Linguistics: EMNLP 2020%D 2020%8 November%I Association for Computational Linguistics%C Online%F yang-etal-2020-generative%X Recent advances in commonsense reasoning depend on large-scale human-annotated training sets to achieve peak performance. However, manual curation of training sets is expensive and has been shown to introduce annotation artifacts that neural models can readily exploit and overfit to. We propose a novel generative data augmentation technique, G-DAUG⌃C, that aims to achieve more accurate and robust learning in a low-resource setting. Our approach generates synthetic examples using pretrained language models and selects the most informative and diverse set of examples for data augmentation. On experiments with multiple commonsense reasoning benchmarks, G-DAUG⌃C consistently outperforms existing data augmentation methods based on back-translation, establishing a new state-of-the-art on WinoGrande, CODAH, and CommonsenseQA, as well as enhances out-of-distribution generalization, proving to be robust against adversaries or perturbations. Our analysis demonstrates that G-DAUG⌃C produces a diverse set of fluent training examples, and that its selection and training approaches are important for performance.%R 10.18653/v1/2020.findings-emnlp.90%U https://aclanthology.org/2020.findings-emnlp.90/%U https://doi.org/10.18653/v1/2020.findings-emnlp.90%P 1008-1025
Yiben Yang, Chaitanya Malaviya, Jared Fernandez, Swabha Swayamdipta, Ronan Le Bras, Ji-Ping Wang, Chandra Bhagavatula, Yejin Choi, and Doug Downey. 2020.Generative Data Augmentation for Commonsense Reasoning. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 1008–1025, Online. Association for Computational Linguistics.