Movatterモバイル変換


[0]ホーム

URL:


MLSUM: The Multilingual Summarization Corpus

Thomas Scialom,Paul-Alexis Dray,Sylvain Lamprier,Benjamin Piwowarski,Jacopo Staiano


Abstract
We present MLSUM, the first large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages – namely, French, German, Spanish, Russian, Turkish. Together with English news articles from the popular CNN/Daily mail dataset, the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community. We report cross-lingual comparative analyses based on state-of-the-art systems. These highlight existing biases which motivate the use of a multi-lingual dataset.
Anthology ID:
2020.emnlp-main.647
Volume:
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Month:
November
Year:
2020
Address:
Online
Editors:
Bonnie Webber,Trevor Cohn,Yulan He,Yang Liu
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8051–8067
Language:
URL:
https://aclanthology.org/2020.emnlp-main.647/
DOI:
10.18653/v1/2020.emnlp-main.647
Bibkey:
Cite (ACL):
Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier, Benjamin Piwowarski, and Jacopo Staiano. 2020.MLSUM: The Multilingual Summarization Corpus. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8051–8067, Online. Association for Computational Linguistics.
Cite (Informal):
MLSUM: The Multilingual Summarization Corpus (Scialom et al., EMNLP 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.emnlp-main.647.pdf
Video:
 https://slideslive.com/38938723
Data
MLSUMBigPatentCNN/Daily MailLCSTSMLQANEWSROOMNew York Times Annotated CorpusSNLI


[8]ページ先頭

©2009-2025 Movatter.jp