Gang Wang, Fei Wang, et al.
IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
In this paper we investigate discriminative training of models and feature space for a multi-stream hidden Markov model (HMM) based audio-visual speech recognizer (AVSR). Since the two streams are used together in decoding, we propose to train the parameters of the two streams jointly. This is in contrast to prior work which has considered discriminative training of parameters in each stream independent of the other. In experiments on a 20-speaker one-hour speaker independent test set, we obtain 22% relative gain on AVSR performance over A/V models whose parameters are trained separately, and 50% relative gain on AVSR over the baseline maximum-likelihood models. On a noisy (mismatched to training) test set, we obtain 21% relative gain over A/V models whose parameters are trained separately. This represents 30% relative improvement over the maximum-likelihood baseline. Copyright © 2009 ISCA.
Gang Wang, Fei Wang, et al.
IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
Atsuyoshi Nakamura, Naoki Abe
Electronic Commerce Research
Kun Wang, Juwei Shi, et al.
PACT 2011
Benny Kimelfeld, Yehoshua Sagiv
ICDT 2013