An automatic speech recognition (ASR) performance has greatly improvedwith the introduction of convolutional neural network (CNN) or long-shortterm memory (LSTM) for acoustic modeling. Recently, a convolutionalLSTM (CLSTM) has been proposed to directly use convolution operationwithin the LSTM blocks and combine the advantages of both CNN and LSTMstructures into a single architecture. This paper presents the firstattempt to use CLSTMs for acoustic modeling. In addition, we proposea new forward-backward architecture to exploit long-term left/rightcontext efficiently. The proposed scheme combines forward and backwardLSTMs at different time points of an utterance with the aim of modelinglong term frame invariant information such as speaker characteristics,channel etc. Furthermore, the proposed forward-backward architecturecan be trained with truncated back-propagation-through-time unlikeconventional bidirectional LSTM (BLSTM) architectures. Therefore, weare able to train deeply stacked CLSTM acoustic models, which is practicallychallenging with conventional BLSTMs. Experimental results show thatboth CLSTM and forward-backward LSTM improve word error rates significantlycompared to standard CNN and LSTM architectures.
@inproceedings{karita17_interspeech, title = {Forward-Backward Convolutional LSTM for Acoustic Modeling}, author = {Shigeki Karita and Atsunori Ogawa and Marc Delcroix and Tomohiro Nakatani}, year = {2017}, booktitle = {Interspeech 2017}, pages = {1601--1605}, doi = {10.21437/Interspeech.2017-554}, issn = {2958-1796},}
Cite as:Karita, S., Ogawa, A., Delcroix, M., Nakatani, T. (2017) Forward-Backward Convolutional LSTM for Acoustic Modeling. Proc. Interspeech 2017, 1601-1605, doi: 10.21437/Interspeech.2017-554