Takashi Fukuda, Samuel Thomas
INTERSPEECH 2021
Although Sequence-to-Sequence (S2S) architectures have become state-of-the-art in speech synthesis, the best models benefit from access to moderate-to-large amounts of training data, posing a resource bottleneck when we are interested in generating speech in a variety of expressive styles. In this work we explore a S2S architecture variant that is capable of generating a variety of stylistic expressive variations observed in a limited amount of training data, and of transplanting that style to a neutral target speaker for whom no labeled expressive resources exist. The architecture is furthermore controllable, allowing the user to select an operating point that conveys a desired level of expressiveness. We evaluate this proposal against a classically supervised baseline via perceptual listening tests, and demonstrate that i) it is able to outperform the baseline in terms of its generalizability to neutral speakers, ii) it is strongly preferred in terms of its ability to convey expressiveness, and iii) it provides a reasonable trade-off between expressiveness and naturalness, allowing the user to tune it to the particular demands of a given application.
Takashi Fukuda, Samuel Thomas
INTERSPEECH 2021
Asaf Rendel, Raul Fernandez, et al.
ICASSP 2016
Andrew Rosenberg, Raul Fernandez, et al.
ICASSP 2018
Slava Shechtman
ICASSP 2013