Towards More Likely Models for AI Planning
Turguy Caglar, Sirine Belhaj, et al.
IJCAI 2023
Large-scale auto-regressive language models pretrained on massive text have demonstrated their impressive ability to perform new natural language tasks with only a few text examples, without the need for fine-tuning. Recent studies further show that such a few-shot learning ability can be extended to the text-image setting by training an encoder to encode the images into embeddings functioning like the text embeddings of the language model. Interested in exploring the possibility of transferring the few-shot learning ability to the audio-text setting, we propose a novel speech understanding framework, WAVPROMPT, where we finetune a wav2vec model to generate a sequence of audio embeddings understood by the language model. We show that WAVPROMPT is a few-shot learner that can perform speech understanding tasks better than a naïve text baseline. We conduct detailed ablation studies on different components and hyperparameters to empirically identify the best model configuration. In addition, we conduct a non-speech understanding experiment to show WAVPROMPT can extract more information than just the transcriptions. The source code is available at https://github.com/Hertin/WavPrompt.
Turguy Caglar, Sirine Belhaj, et al.
IJCAI 2023
Eduardo Almeida Soares, Dmitry Zubarev, et al.
ICLR 2025
Srikanth Tamilselvam, Dinesh Khandelwal, et al.
ACML 2022
Eduardo Almeida Soares, Victor Shirasuna, et al.
ACS Fall 2024