Balancing via Generation for Multi-Class Text Classification Improvement

Naama Tepper; Esther Goldbraich; Naama Zwerdling; George Kour; Ateret Anaby-Tavor; Boaz Carmeli

EMNLP 2020

Conference paper

16 Nov 2020

Balancing via Generation for Multi-Class Text Classification Improvement

Abstract

Data balancing is a known technique for im-proving the performance of classification tasks.In this work we define a novel balancing-via-generation framework termedBalaGen.Bala-Genconsists of a flexible balancing policy cou-pled with a text generation mechanism. Com-bined, these two techniques can be used to aug-ment a dataset for more balanced distribution.We evaluateBalaGenon three publicly avail-able semantic utterance classification (SUC)datasets. One of these is a new COVID-19Q&A dataset published here for the first time.Our work demonstrates that optimal balanc-ing policies can significantly improve classi-fier performance, while augmenting just partof the classes and under-sampling others. Fur-thermore, capitalizing on the advantages ofbalancing, we show its usefulness in all rele-vantBalaGenframework components. We val-idate the superiority ofBalaGenon ten seman-tic utterance datasets taken from real-life goal-oriented dialogue systems. Based on our re-sults we encourage using data balancing priorto training for text classification tasks.

Conference paper