Synthesis of expressive speaking styles with limited training data in a multi-speaker, prosody-controllable sequence-to-sequence architecture

Slava Shechtman; Raul Fernandez; Alexander Sorin; David Haws

doi:10.21437/Interspeech.2021-1446

INTERSPEECH 2021

Conference paper

30 Aug 2021

Synthesis of expressive speaking styles with limited training data in a multi-speaker, prosody-controllable sequence-to-sequence architecture

View publication

Abstract

Although Sequence-to-Sequence (S2S) architectures have become state-of-the-art in speech synthesis, the best models benefit from access to moderate-to-large amounts of training data, posing a resource bottleneck when we are interested in generating speech in a variety of expressive styles. In this work we explore a S2S architecture variant that is capable of generating a variety of stylistic expressive variations observed in a limited amount of training data, and of transplanting that style to a neutral target speaker for whom no labeled expressive resources exist. The architecture is furthermore controllable, allowing the user to select an operating point that conveys a desired level of expressiveness. We evaluate this proposal against a classically supervised baseline via perceptual listening tests, and demonstrate that i) it is able to outperform the baseline in terms of its generalizability to neutral speakers, ii) it is strongly preferred in terms of its ability to convey expressiveness, and iii) it provides a reasonable trade-off between expressiveness and naturalness, allowing the user to tune it to the particular demands of a given application.

Conference paper