Shivashankar Subramanian, Ioana Baldini, et al.
IAAI 2020
Data balancing is a known technique for im-proving the performance of classification tasks.In this work we define a novel balancing-via-generation framework termedBalaGen.Bala-Genconsists of a flexible balancing policy cou-pled with a text generation mechanism. Com-bined, these two techniques can be used to aug-ment a dataset for more balanced distribution.We evaluateBalaGenon three publicly avail-able semantic utterance classification (SUC)datasets. One of these is a new COVID-19Q&A dataset published here for the first time.Our work demonstrates that optimal balanc-ing policies can significantly improve classi-fier performance, while augmenting just partof the classes and under-sampling others. Fur-thermore, capitalizing on the advantages ofbalancing, we show its usefulness in all rele-vantBalaGenframework components. We val-idate the superiority ofBalaGenon ten seman-tic utterance datasets taken from real-life goal-oriented dialogue systems. Based on our re-sults we encourage using data balancing priorto training for text classification tasks.
Shivashankar Subramanian, Ioana Baldini, et al.
IAAI 2020
Kevin Gu, Eva Tuecke, et al.
ICML 2024
Gabriele Picco, Lam Thanh Hoang, et al.
EMNLP 2021
Kshitij P. Fadnis, Nathaniel Mills, et al.
EMNLP 2020