C. Neti, Salim Roukos
ASRU 1997
We describe methods for automatic labeling of high-level semantic concepts in documentary style videos. The emphasis of this paper is on audio processing and on fusing information from multiple modalities. The work described represents initial work towards a trainable system that acquires a collection of generic "intermediate" semantic concepts across modalities (such as audio, video, text) and combines information from these modalities for automatic labeling of a "high-level" concept. Initial results suggest that multi-modal fusion achieves a 12.5% relative improvement over the best unimodal model.
C. Neti, Salim Roukos
ASRU 1997
G. Potamianos, C. Neti, et al.
ICASSP 2004
A. Amir, G. Iyengar, et al.
ICASSP 2004
Belle L. Tseng, Ching-Yung Lin, et al.
ICME 2002