Haiping Li, Fangxin Chen, et al.
ICASSP 2003
In this paper, we present a conditional pronunciation modeling method for the speaker detection task that does not rely on acoustic vectors. Aiming at exploiting higher-level information carried by the speech signal, it uses time-aligned streams of phones and phonemes to model a speaker's specific pronunciation. Our system uses phonemes drawn from a lexicon of pronunciations of words recognized by an automatic speech recognition system to generate the phoneme stream and an open-loop phone recognizer to generate a phone stream. The phoneme and phone streams are aligned at the frame level and conditional probabilities of a phone, given a phoneme, are estimated using co-occurrence counts. A likelihood detector is then applied to these probabilities. Performance is measured using the NIST Extended Data paradigm and the Switchboard-I corpus. Using 8 training conversations for enrollment, a 2.1% equal error rate was achieved. Extensions and alternatives, as well as fusion experiments, are presented and discussed.
Haiping Li, Fangxin Chen, et al.
ICASSP 2003
Juan M. Huerta, David Lubensky
ICASSP 2003
Jiří Navrátil, David Klusáček
ICASSP 2007
G. Iyengar, H.J. Nock, et al.
ICASSP 2003