Tetsuya Takiguchi, Masafumi Nishimura
ICASSP 2004
Accurate voice activity detection (VAD) is important for robust automatic speech recognition (ASR) systems. This paper proposes a statistical-model-based noise-robust VAD algorithm using long-term temporal information and harmonic-structure-based features in speech. Long-term temporal information has recently become an ASR focus, but has not yet been deeply investigated for VAD. In this paper, we first consider the temporal features in a cepstral domain calculated over the average phoneme duration. In contrast, the harmonic structures are well-known bearers of acoustic information in human voices, but that information is difficult to exploit statistically. This paper further describes a new method to exploit the harmonic structure information with statistical models, providing additional noise robustness. The proposed method including both the long-term temporal and the static harmonic features led to considerable improvements under low SNR conditions, with 77.7% error reduction on average as compared with the ETSI AFE-VAD in our VAD testing. In addition, the word error rate was reduced by 29.1% in a test that included a full ASR system. © 2010 IEEE.
Tetsuya Takiguchi, Masafumi Nishimura
ICASSP 2004
Toru Nakashika, Ryuki Tachibana, et al.
INTERSPEECH 2010
Gakuto Kurata, Abhinav Sethy, et al.
Speech Communication
Takashi Fukuda, Masayuki Suzuki, et al.
INTERSPEECH 2017