Automatic closed-captioning of video is a useful application of speech recognition technology but poses numerous challenges when applied to open-domain user-uploaded videos such as those on YouTube. In this work, we explore a strategy to improve decoding accuracy for video transcription by decoding each video with a language model (LM) adapted specifically to the topics that the video covers. Taxonomic topic classifiers are used to determine the topic content of videos and to build a large set of topic-specific LMs from web documents. We consider strategies for selecting and interpolating LMs in both supervised and unsupervised scenarios in a two-pass lattice rescoring framework. Experiments on a YouTube video corpus show a 10% relative reduction in WER over generic single-pass transcriptions as well as a statistically significant 2.5% reduction over rescoring with a very large non-adapted LM built from all the documents.
@inproceedings{thadani12_interspeech, title = {On-the-fly topic adaptation for YouTube video transcription}, author = {Kapil Thadani and Fadi Biadsy and Dan Bikel}, year = {2012}, booktitle = {Interspeech 2012}, pages = {210--213}, doi = {10.21437/Interspeech.2012-69}, issn = {2958-1796},}
Cite as:Thadani, K., Biadsy, F., Bikel, D. (2012) On-the-fly topic adaptation for YouTube video transcription. Proc. Interspeech 2012, 210-213, doi: 10.21437/Interspeech.2012-69