Dan Chazan, Ron Hoory, et al.
INTERSPEECH - Eurospeech 2005
In statistical HMM-based text-to-speech systems (STTS), speech feature dynamics is modeled by first- and second-order feature frame differences, which, typically, do not satisfactorily represent frame to frame feature dynamics present in natural speech. The reduced dynamics results in over-smoothing of speech features, often sounding as muffled synthesized speech. In this correspondence, we propose a method to enhance a baseline STTS system by introducing a segment-wise model representation with a norm constraint. The segment-wise representation provides additional degrees of freedom in speech feature determination. We exploit these degrees of freedom for increasing the speech feature vector norm to match a norm constraint. As a result, statistically generated speech features are less over-smoothed, resulting in more natural sounding speech, as judged by listening tests. © 2006 IEEE.
Dan Chazan, Ron Hoory, et al.
INTERSPEECH - Eurospeech 2005
Slava Shechtman
SSW 2007
Stas Tiomkin, David Malah, et al.
IEEE Transactions on Audio, Speech and Language Processing
Yosi Mass, Slava Shechtman, et al.
INTERSPEECH 2018