Frequency warping based on mapping formant parameters
Zhi-Wei Shuang, Raimo Bakis, et al.
ICSLP 2006
In statistical HMM-based text-to-speech systems (STTS), speech feature dynamics is modeled by first- and second-order feature frame differences, which, typically, do not satisfactorily represent frame to frame feature dynamics present in natural speech. The reduced dynamics results in over-smoothing of speech features, often sounding as muffled synthesized speech. In this correspondence, we propose a method to enhance a baseline STTS system by introducing a segment-wise model representation with a norm constraint. The segment-wise representation provides additional degrees of freedom in speech feature determination. We exploit these degrees of freedom for increasing the speech feature vector norm to match a norm constraint. As a result, statistically generated speech features are less over-smoothed, resulting in more natural sounding speech, as judged by listening tests. © 2006 IEEE.
Zhi-Wei Shuang, Raimo Bakis, et al.
ICSLP 2006
Alexander Sorin, Slava Shechtman, et al.
ICASSP 2015
Alexander Sorin, Slava Shechtman, et al.
INTERSPEECH 2018
Slava Shechtman, Raul Fernandez, et al.
INTERSPEECH 2021