Uniform speech parameterization for multi-form segment synthesis
Alexander Sorin, Slava Shechtman, et al.
INTERSPEECH 2011
Controllable generation of emphasis in speech is desirable for expressive TTS systems utilized in various dialog applications. Usually such models remain voice-specific and the strength of emphasis can't be readily controlled. In this work we present a flexible emphatic prosody generation model based on Deep Recurrent Neural Networks (DRNN) for controllable word-level emphasis realization. The word emphasis DRNN model was trained on syllable-level piecewise linear prosodic trajectory parameters. A special data preprocessing technique was introduced to enable emphasis strength control, allowing to generate emphatic prosody trajectories of various strength. Additionally, we trained a DRNN model generating a sentence-level emphasis, i.e. producing whole sentences in forceful, decisive manner. Both models preserve quality and naturalness of the baseline TTS output.
Alexander Sorin, Slava Shechtman, et al.
INTERSPEECH 2011
Alexander Sorin, Slava Shechtman, et al.
INTERSPEECH 2014
Aditya Vempaty, Bhavya Kailkhura, et al.
ICASSP 2018
Charbel Sakr, Jungwook Choi, et al.
ICASSP 2018