Oznur Alkan, Massimilliano Mattetti, et al.
INFORMS 2020
This paper considers schemes for determining which of a set of faces on screen, if any, is producing speech in a video soundtrack. Whilst motivated by the TREC 2002 (Video Retrieval Track) monologue detection task, the schemes are also applicable to voice and face-based biometrics systems, for assessing lip synchronization quality in movie editing and computer animation, and for speaker localization in video. Several approaches are discussed: two implementations of a generic mutual-information-based measure of the degree of synchrony between signals, which can be used with or without prior speech and face detection, and a stronger model-based scheme which follows speech and face detection with an assessment of face and lip movement plausibility. Schemes are compared on a corpus of 1016 test cases containing multiple faces and multiple speakers, a test set 200 times larger than the nearest comparable test set of which we are aware. The most successful and computationally cheapest scheme obtains an accuracy of 82% on the task of picking the "consistent"speaker from a set including three confusers. A final experiment demonstrates the potential utility of the scheme for speaker localization in video.
Oznur Alkan, Massimilliano Mattetti, et al.
INFORMS 2020
Casey Dugan, Werner Geyer, et al.
CHI 2010
Rajesh Balchandran, Leonid Rachevsky, et al.
INTERSPEECH 2009
Elaine Hill
Human-Computer Interaction