Conference paper

Voice Activity-based Text Segmentation for ASR Text Denormalization

Abstract

We introduce a novel technique for text capitalization and punctuation recovery (CP) systems that learn from voice-activity cues to effectively enhance the output readability of E2E ASR. Commonly E2E ASR systems produce uncapitalized texts with no punctuation marks. In such situations, CP systems are introduced as the external modules to denormalize the ASR output; however, they suffer from performance degradation due to the difference between text segmentation resulting from ASR and those used in CP system construction. ASR systems generally produce decoded text of input speech segments determined by a VAD algorithm, while CP systems are often constructed on grammatically well-segmented full-sentence texts. To reduce this gap, we construct CP system by using pseudo VAD-segmented texts given by a text segmentation model designed using voice activity cues. Our method reduces false predictions by 4.5%-18.9% compared to the baseline while appropriately formatting the ASR texts.