Michael Picheny, Zoltan Tuske, et al.
INTERSPEECH 2019
The paper presents our endeavor to improve state-of-the-art speech recognition results using attention based neural network approaches. Our test focus was LibriSpeech, a well-known, publicly available, large, speech corpus, but the methodologies are clearly applicable to other tasks. After systematic application of standard techniques - sophisticated data augmentation, various dropout schemes, scheduled sampling, warm-restart -, and optimizing search configurations, our model achieves 4.0% and 11.7% word error rate (WER) on the test-clean and test-other sets, without any external language model. A powerful recurrent language model drops the error rate further to 2.7% and 8.2%. Thus, we not only report the lowest sequence-to-sequence model based numbers on this task to date, but our single system even challenges the best result known in the literature, namely a hybrid model together with recurrent language model rescoring. A simple ROVER combination of several of our attention based systems achieved 2.5% and 7.3% WER on the clean and other test sets.
Michael Picheny, Zoltan Tuske, et al.
INTERSPEECH 2019
George Saon, Tom Sercu, et al.
INTERSPEECH 2016
George Saon
SLT 2014
Thomas Bohnstingl, Ayush Garg, et al.
ICASSP 2022