Kernel methods match deep neural networks on TIMIT
Po-Sen Huang, Haim Avron, et al.
ICASSP 2014
Models for the prediction of prosodic events, such as pitch accents and phrasal boundaries, often rely on machine learning models that combine a set of input features aggregated over a finite, and usually short, number of observations to model context. Dynamic models go a step further by explicitly incorporating a model of state sequence, but even then, many practical implementations are limited to a low-order finite-state machine. This Markovian assumption, however, does not properly address the interaction between short- and long-term contextual factors that is known to affect the realization and placement of these prosodic events. Bidirectional Recurrent Neural Networks (BiRNNs) are a class of models that overcome this limitation by predicting the outputs as a function of a state variable that accumulates information over the entire input sequence, and by stacking several layers to form a deep architecture able to extract more structure from the input features. These models have already demonstrated state-of-the-art performance on some prosodic regression tasks. In this work we examine a new application of BiRNNs to the task of classifying categorical prosodic events, and demonstrate that they outperform baseline systems.
Po-Sen Huang, Haim Avron, et al.
ICASSP 2014
Bhuvana Ramabhadran, Jing Huang, et al.
INTERSPEECH - Eurospeech 2003
Asaf Rendel, Raul Fernandez, et al.
ICASSP 2016
Tara N. Sainath, Avishy Carmi, et al.
ICASSP 2010