Previous work on end-to-end translation from speech has primarily used frame-level features as speech representations, which creates longer, sparser sequences than text. We show that a naive method to create compressed phoneme-like speech representations is far more effective and efficient for translation than traditional frame-level speech features. Specifically, we generate phoneme labels for speech frames and average consecutive frames with the same label to create shorter, higher-level source sequences for translation. We see improvements of up to 5 BLEU on both our high and low resource language pairs, with a reduction in training time of 60%. Our improvements hold across multiple data sizes and two language pairs.
@inproceedings{salesky-etal-2019-exploring, title = "Exploring Phoneme-Level Speech Representations for End-to-End Speech Translation", author = "Salesky, Elizabeth and Sperber, Matthias and Black, Alan W", editor = "Korhonen, Anna and Traum, David and M{\`a}rquez, Llu{\'i}s", booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics", month = jul, year = "2019", address = "Florence, Italy", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/P19-1179/", doi = "10.18653/v1/P19-1179", pages = "1835--1841", abstract = "Previous work on end-to-end translation from speech has primarily used frame-level features as speech representations, which creates longer, sparser sequences than text. We show that a naive method to create compressed phoneme-like speech representations is far more effective and efficient for translation than traditional frame-level speech features. Specifically, we generate phoneme labels for speech frames and average consecutive frames with the same label to create shorter, higher-level source sequences for translation. We see improvements of up to 5 BLEU on both our high and low resource language pairs, with a reduction in training time of 60{\%}. Our improvements hold across multiple data sizes and two language pairs."}
<?xml version="1.0" encoding="UTF-8"?><modsCollection xmlns="http://www.loc.gov/mods/v3"><mods ID="salesky-etal-2019-exploring"> <titleInfo> <title>Exploring Phoneme-Level Speech Representations for End-to-End Speech Translation</title> </titleInfo> <name type="personal"> <namePart type="given">Elizabeth</namePart> <namePart type="family">Salesky</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Matthias</namePart> <namePart type="family">Sperber</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Alan</namePart> <namePart type="given">W</namePart> <namePart type="family">Black</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <originInfo> <dateIssued>2019-07</dateIssued> </originInfo> <typeOfResource>text</typeOfResource> <relatedItem type="host"> <titleInfo> <title>Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</title> </titleInfo> <name type="personal"> <namePart type="given">Anna</namePart> <namePart type="family">Korhonen</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">David</namePart> <namePart type="family">Traum</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Lluís</namePart> <namePart type="family">Màrquez</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <originInfo> <publisher>Association for Computational Linguistics</publisher> <place> <placeTerm type="text">Florence, Italy</placeTerm> </place> </originInfo> <genre authority="marcgt">conference publication</genre> </relatedItem> <abstract>Previous work on end-to-end translation from speech has primarily used frame-level features as speech representations, which creates longer, sparser sequences than text. We show that a naive method to create compressed phoneme-like speech representations is far more effective and efficient for translation than traditional frame-level speech features. Specifically, we generate phoneme labels for speech frames and average consecutive frames with the same label to create shorter, higher-level source sequences for translation. We see improvements of up to 5 BLEU on both our high and low resource language pairs, with a reduction in training time of 60%. Our improvements hold across multiple data sizes and two language pairs.</abstract> <identifier type="citekey">salesky-etal-2019-exploring</identifier> <identifier type="doi">10.18653/v1/P19-1179</identifier> <location> <url>https://aclanthology.org/P19-1179/</url> </location> <part> <date>2019-07</date> <extent unit="page"> <start>1835</start> <end>1841</end> </extent> </part></mods></modsCollection>
%0 Conference Proceedings%T Exploring Phoneme-Level Speech Representations for End-to-End Speech Translation%A Salesky, Elizabeth%A Sperber, Matthias%A Black, Alan W.%Y Korhonen, Anna%Y Traum, David%Y Màrquez, Lluís%S Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics%D 2019%8 July%I Association for Computational Linguistics%C Florence, Italy%F salesky-etal-2019-exploring%X Previous work on end-to-end translation from speech has primarily used frame-level features as speech representations, which creates longer, sparser sequences than text. We show that a naive method to create compressed phoneme-like speech representations is far more effective and efficient for translation than traditional frame-level speech features. Specifically, we generate phoneme labels for speech frames and average consecutive frames with the same label to create shorter, higher-level source sequences for translation. We see improvements of up to 5 BLEU on both our high and low resource language pairs, with a reduction in training time of 60%. Our improvements hold across multiple data sizes and two language pairs.%R 10.18653/v1/P19-1179%U https://aclanthology.org/P19-1179/%U https://doi.org/10.18653/v1/P19-1179%P 1835-1841