We present the Multilingual TEDx corpus, built to support speech recognition(ASR) and speech translation (ST) research across many non-Englishsource languages. The corpus is a collection of audio recordings fromTEDx talks in 8 source languages. We segment transcripts into sentencesand align them to the source-language audio and target-language translations.The corpus is released along with open-sourced code enabling extensionto new talks and languages as they become available. Our corpus creationmethodology can be applied to more languages than previous work, andcreates multi-way parallel evaluation sets. We provide baselines inmultiple ASR and ST settings, including multilingual models to improvetranslation performance for low-resource language pairs.
@inproceedings{salesky21_interspeech, title = {The Multilingual TEDx Corpus for Speech Recognition and Translation}, author = {Elizabeth Salesky and Matthew Wiesner and Jacob Bremerman and Roldano Cattoni and Matteo Negri and Marco Turchi and Douglas W. Oard and Matt Post}, year = {2021}, booktitle = {Interspeech 2021}, pages = {3655--3659}, doi = {10.21437/Interspeech.2021-11}, issn = {2958-1796},}
Cite as:Salesky, E., Wiesner, M., Bremerman, J., Cattoni, R., Negri, M., Turchi, M., Oard, D.W., Post, M. (2021) The Multilingual TEDx Corpus for Speech Recognition and Translation. Proc. Interspeech 2021, 3655-3659, doi: 10.21437/Interspeech.2021-11