In this paper we report state-of-the-art results on LibriSpeech amongend-to-end speech recognition models without any external trainingdata. Our model, Jasper, uses only 1D convolutions, batch normalization,ReLU, dropout, and residual connections. To improve training, we furtherintroduce a new layer-wise optimizer called NovoGrad. Through experiments,we demonstrate that the proposed deep architecture performs as wellor better than more complex choices. Our deepest Jasper variant uses54 convolutional layers. With this architecture, we achieve 2.95% WERusing a beam-search decoder with an external neural language modeland 3.86% WER with a greedy decoder on LibriSpeech test-clean. We alsoreport competitive results on Wall Street Journal and the Hub5’00conversational evaluation datasets.
@inproceedings{li19_interspeech, title = {{Jasper: An End-to-End Convolutional Neural Acoustic Model}}, author = {Jason Li and Vitaly Lavrukhin and Boris Ginsburg and Ryan Leary and Oleksii Kuchaiev and Jonathan M. Cohen and Huyen Nguyen and Ravi Teja Gadde}, year = {2019}, booktitle = {{Interspeech 2019}}, pages = {71--75}, doi = {10.21437/Interspeech.2019-1819}, issn = {2958-1796},}Cite as:Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T. (2019) Jasper: An End-to-End Convolutional Neural Acoustic Model. Proc. Interspeech 2019, 71-75, doi: 10.21437/Interspeech.2019-1819