Advancing sequence-to-sequence based speech recognition

Zoltan Tuske; Kartik Audhkhasi; George Saon

doi:10.21437/Interspeech.2019-3018

INTERSPEECH 2019

Conference paper

15 Sep 2019

Advancing sequence-to-sequence based speech recognition

View publication

Abstract

The paper presents our endeavor to improve state-of-the-art speech recognition results using attention based neural network approaches. Our test focus was LibriSpeech, a well-known, publicly available, large, speech corpus, but the methodologies are clearly applicable to other tasks. After systematic application of standard techniques - sophisticated data augmentation, various dropout schemes, scheduled sampling, warm-restart -, and optimizing search configurations, our model achieves 4.0% and 11.7% word error rate (WER) on the test-clean and test-other sets, without any external language model. A powerful recurrent language model drops the error rate further to 2.7% and 8.2%. Thus, we not only report the lowest sequence-to-sequence model based numbers on this task to date, but our single system even challenges the best result known in the literature, namely a hybrid model together with recurrent language model rescoring. A simple ROVER combination of several of our attention based systems achieved 2.5% and 7.3% WER on the clean and other test sets.

Conference paper