David Cash, Dennis Hofheinz, et al.
Journal of Cryptology
Audio-Visual Speech Recognition (AVSR) combines lip-based video with audio and can improve performance in noise, but most methods are trained only on English data. One limitation is the lack of large-scale multilingual video data, which makes it hard to train models from scratch. In this work, we propose mWhisper-Flamingo for multilingual AVSR which combines the strengths of a pre-trained audio model (Whisper) and video model (AV-HuBERT). To enable better multi-modal integration and improve the noisy multilingual performance, we introduce decoder modality dropout where the model is trained both on paired audio-visual inputs and separate audio/visual inputs. mWhisper-Flamingo achieves state-of-the-art WER on MuAViC, an AVSR dataset of 9 languages. Audio-visual mWhisper-Flamingo consistently outperforms audio-only Whisper on all languages in noisy conditions.
David Cash, Dennis Hofheinz, et al.
Journal of Cryptology
J.P. Locquet, J. Perret, et al.
SPIE Optical Science, Engineering, and Instrumentation 1998
Satoshi Hada
IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences
A. Gupta, R. Gross, et al.
SPIE Advances in Semiconductors and Superconductors 1990