We present MLSUM, the first large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages – namely, French, German, Spanish, Russian, Turkish. Together with English news articles from the popular CNN/Daily mail dataset, the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community. We report cross-lingual comparative analyses based on state-of-the-art systems. These highlight existing biases which motivate the use of a multi-lingual dataset.
Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier, Benjamin Piwowarski, and Jacopo Staiano. 2020.MLSUM: The Multilingual Summarization Corpus. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8051–8067, Online. Association for Computational Linguistics.
@inproceedings{scialom-etal-2020-mlsum, title = "{MLSUM}: The Multilingual Summarization Corpus", author = "Scialom, Thomas and Dray, Paul-Alexis and Lamprier, Sylvain and Piwowarski, Benjamin and Staiano, Jacopo", editor = "Webber, Bonnie and Cohn, Trevor and He, Yulan and Liu, Yang", booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2020.emnlp-main.647/", doi = "10.18653/v1/2020.emnlp-main.647", pages = "8051--8067", abstract = "We present MLSUM, the first large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages {--} namely, French, German, Spanish, Russian, Turkish. Together with English news articles from the popular CNN/Daily mail dataset, the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community. We report cross-lingual comparative analyses based on state-of-the-art systems. These highlight existing biases which motivate the use of a multi-lingual dataset."}
%0 Conference Proceedings%T MLSUM: The Multilingual Summarization Corpus%A Scialom, Thomas%A Dray, Paul-Alexis%A Lamprier, Sylvain%A Piwowarski, Benjamin%A Staiano, Jacopo%Y Webber, Bonnie%Y Cohn, Trevor%Y He, Yulan%Y Liu, Yang%S Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)%D 2020%8 November%I Association for Computational Linguistics%C Online%F scialom-etal-2020-mlsum%X We present MLSUM, the first large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages – namely, French, German, Spanish, Russian, Turkish. Together with English news articles from the popular CNN/Daily mail dataset, the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community. We report cross-lingual comparative analyses based on state-of-the-art systems. These highlight existing biases which motivate the use of a multi-lingual dataset.%R 10.18653/v1/2020.emnlp-main.647%U https://aclanthology.org/2020.emnlp-main.647/%U https://doi.org/10.18653/v1/2020.emnlp-main.647%P 8051-8067
Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier, Benjamin Piwowarski, and Jacopo Staiano. 2020.MLSUM: The Multilingual Summarization Corpus. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8051–8067, Online. Association for Computational Linguistics.