Existing metrics for evaluating the quality of automatically generated questions such as BLEU, ROUGE, BERTScore, and BLEURT compare the reference and predicted questions, providing a high score when there is a considerable lexical overlap or semantic similarity between the candidate and the reference questions. This approach has two major shortcomings. First, we need expensive human-provided reference questions. Second, it penalises valid questions that may not have high lexical or semantic similarity to the reference questions. In this paper, we propose a new metric, RQUGE, based on the answerability of the candidate question given the context. The metric consists of a question-answering and a span scorer modules, using pre-trained models from existing literature, thus it can be used without any further training. We demonstrate that RQUGE has a higher correlation with human judgment without relying on the reference question. Additionally, RQUGE is shown to be more robust to several adversarial corruptions. Furthermore, we illustrate that we can significantly improve the performance of QA models on out-of-domain datasets by fine-tuning on synthetic data generated by a question generation model and reranked by RQUGE.
Alireza Mohammadshahi, Thomas Scialom, Majid Yazdani, Pouya Yanki, Angela Fan, James Henderson, and Marzieh Saeidi. 2023.RQUGE: Reference-Free Metric for Evaluating Question Generation by Answering the Question. InFindings of the Association for Computational Linguistics: ACL 2023, pages 6845–6867, Toronto, Canada. Association for Computational Linguistics.
@inproceedings{mohammadshahi-etal-2023-rquge, title = "{RQUGE}: Reference-Free Metric for Evaluating Question Generation by Answering the Question", author = "Mohammadshahi, Alireza and Scialom, Thomas and Yazdani, Majid and Yanki, Pouya and Fan, Angela and Henderson, James and Saeidi, Marzieh", editor = "Rogers, Anna and Boyd-Graber, Jordan and Okazaki, Naoaki", booktitle = "Findings of the Association for Computational Linguistics: ACL 2023", month = jul, year = "2023", address = "Toronto, Canada", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.findings-acl.428/", doi = "10.18653/v1/2023.findings-acl.428", pages = "6845--6867", abstract = "Existing metrics for evaluating the quality of automatically generated questions such as BLEU, ROUGE, BERTScore, and BLEURT compare the reference and predicted questions, providing a high score when there is a considerable lexical overlap or semantic similarity between the candidate and the reference questions. This approach has two major shortcomings. First, we need expensive human-provided reference questions. Second, it penalises valid questions that may not have high lexical or semantic similarity to the reference questions. In this paper, we propose a new metric, RQUGE, based on the answerability of the candidate question given the context. The metric consists of a question-answering and a span scorer modules, using pre-trained models from existing literature, thus it can be used without any further training. We demonstrate that RQUGE has a higher correlation with human judgment without relying on the reference question. Additionally, RQUGE is shown to be more robust to several adversarial corruptions. Furthermore, we illustrate that we can significantly improve the performance of QA models on out-of-domain datasets by fine-tuning on synthetic data generated by a question generation model and reranked by RQUGE."}
<?xml version="1.0" encoding="UTF-8"?><modsCollection xmlns="http://www.loc.gov/mods/v3"><mods ID="mohammadshahi-etal-2023-rquge"> <titleInfo> <title>RQUGE: Reference-Free Metric for Evaluating Question Generation by Answering the Question</title> </titleInfo> <name type="personal"> <namePart type="given">Alireza</namePart> <namePart type="family">Mohammadshahi</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Thomas</namePart> <namePart type="family">Scialom</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Majid</namePart> <namePart type="family">Yazdani</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Pouya</namePart> <namePart type="family">Yanki</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Angela</namePart> <namePart type="family">Fan</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">James</namePart> <namePart type="family">Henderson</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Marzieh</namePart> <namePart type="family">Saeidi</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <originInfo> <dateIssued>2023-07</dateIssued> </originInfo> <typeOfResource>text</typeOfResource> <relatedItem type="host"> <titleInfo> <title>Findings of the Association for Computational Linguistics: ACL 2023</title> </titleInfo> <name type="personal"> <namePart type="given">Anna</namePart> <namePart type="family">Rogers</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Jordan</namePart> <namePart type="family">Boyd-Graber</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Naoaki</namePart> <namePart type="family">Okazaki</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <originInfo> <publisher>Association for Computational Linguistics</publisher> <place> <placeTerm type="text">Toronto, Canada</placeTerm> </place> </originInfo> <genre authority="marcgt">conference publication</genre> </relatedItem> <abstract>Existing metrics for evaluating the quality of automatically generated questions such as BLEU, ROUGE, BERTScore, and BLEURT compare the reference and predicted questions, providing a high score when there is a considerable lexical overlap or semantic similarity between the candidate and the reference questions. This approach has two major shortcomings. First, we need expensive human-provided reference questions. Second, it penalises valid questions that may not have high lexical or semantic similarity to the reference questions. In this paper, we propose a new metric, RQUGE, based on the answerability of the candidate question given the context. The metric consists of a question-answering and a span scorer modules, using pre-trained models from existing literature, thus it can be used without any further training. We demonstrate that RQUGE has a higher correlation with human judgment without relying on the reference question. Additionally, RQUGE is shown to be more robust to several adversarial corruptions. Furthermore, we illustrate that we can significantly improve the performance of QA models on out-of-domain datasets by fine-tuning on synthetic data generated by a question generation model and reranked by RQUGE.</abstract> <identifier type="citekey">mohammadshahi-etal-2023-rquge</identifier> <identifier type="doi">10.18653/v1/2023.findings-acl.428</identifier> <location> <url>https://aclanthology.org/2023.findings-acl.428/</url> </location> <part> <date>2023-07</date> <extent unit="page"> <start>6845</start> <end>6867</end> </extent> </part></mods></modsCollection>
%0 Conference Proceedings%T RQUGE: Reference-Free Metric for Evaluating Question Generation by Answering the Question%A Mohammadshahi, Alireza%A Scialom, Thomas%A Yazdani, Majid%A Yanki, Pouya%A Fan, Angela%A Henderson, James%A Saeidi, Marzieh%Y Rogers, Anna%Y Boyd-Graber, Jordan%Y Okazaki, Naoaki%S Findings of the Association for Computational Linguistics: ACL 2023%D 2023%8 July%I Association for Computational Linguistics%C Toronto, Canada%F mohammadshahi-etal-2023-rquge%X Existing metrics for evaluating the quality of automatically generated questions such as BLEU, ROUGE, BERTScore, and BLEURT compare the reference and predicted questions, providing a high score when there is a considerable lexical overlap or semantic similarity between the candidate and the reference questions. This approach has two major shortcomings. First, we need expensive human-provided reference questions. Second, it penalises valid questions that may not have high lexical or semantic similarity to the reference questions. In this paper, we propose a new metric, RQUGE, based on the answerability of the candidate question given the context. The metric consists of a question-answering and a span scorer modules, using pre-trained models from existing literature, thus it can be used without any further training. We demonstrate that RQUGE has a higher correlation with human judgment without relying on the reference question. Additionally, RQUGE is shown to be more robust to several adversarial corruptions. Furthermore, we illustrate that we can significantly improve the performance of QA models on out-of-domain datasets by fine-tuning on synthetic data generated by a question generation model and reranked by RQUGE.%R 10.18653/v1/2023.findings-acl.428%U https://aclanthology.org/2023.findings-acl.428/%U https://doi.org/10.18653/v1/2023.findings-acl.428%P 6845-6867
[RQUGE: Reference-Free Metric for Evaluating Question Generation by Answering the Question](https://aclanthology.org/2023.findings-acl.428/) (Mohammadshahi et al., Findings 2023)
Alireza Mohammadshahi, Thomas Scialom, Majid Yazdani, Pouya Yanki, Angela Fan, James Henderson, and Marzieh Saeidi. 2023.RQUGE: Reference-Free Metric for Evaluating Question Generation by Answering the Question. InFindings of the Association for Computational Linguistics: ACL 2023, pages 6845–6867, Toronto, Canada. Association for Computational Linguistics.