W.H. Adams, Giridharan Iyengar, et al.
Eurasip Journal on Applied Signal Processing
We investigate the use of visual, mouth-region information in improving automatic speech recognition (ASR) of the speech impaired. Given the video of an utterance by such a subject, we first extract appearance-based visual features from the mouth region-of-interest, and we use a feature fusion method to combine them with the subject's audio features into bimodal observations. Subsequently, we adapt the parameters of a speaker-independent, audio-visual hidden Markov model, trained on a large database of hearing subjects, to the audio-visual features extracted from the speech impaired videos. We consider a number of speaker adaptation techniques, and we study their performance in the case of a single speech impaired subject uttering continuous read speech, as well as connected digits. For both tasks, maximum-a-posteriori adaptation followed by maximum likelihood linear regression performs the best, achieving a word error rate relative reduction of 61% and 96%, respectively, over unadapted audio-visual ASR, and a 13% and 58% relative reduction over audio-only speaker-adapted ASR. In addition, we compare audio-only and audio-visual speaker-adapted ASR of the single speech impaired subject to ASR of subjects with normal speech, over a wide range of audio channel signal-to-noise ratios. Interestingly, for the small-vocabulary connected digits task, audio-visual ASR performance is almost identical across the two populations.
W.H. Adams, Giridharan Iyengar, et al.
Eurasip Journal on Applied Signal Processing
Djamel Mostefa, Nicolas Moreau, et al.
Language Resources and Evaluation
Iain Matthews, Gerasimos Potamianos, et al.
ICME 2001
Jintao Jiang, Gerasimos Potamianos, et al.
ICME 2005