Semantic tokenization of verbalized numbers in language modeling

Xiaoqiang Luo; Martin Franz

ICSLP 2000

Conference paper

16 Oct 2000

Semantic tokenization of verbalized numbers in language modeling

Abstract

In spoken dialog systems, number strings frequently carry crucial information such as DATE, TIME, and PRICE. Yet numbers are inherently difficult to recognize, partly because reliable statistics for training a language model is hard to obtain. In this paper, we take the advantage of the fact that dialog systems perform some form of semantic parsing. We use this parsing information to distinguish between the occurrences of number expressions in various semantic roles, as for example between the word "one" in "one o'clock", "sunday june one" and "another one" to improve the performance of the language model and thus reduce the error rate. We process number expressions with the same spelling, but different semantics, as separate language model tokens. We have tested this approach in a speech recognition system used as a part of a dialog system for the Air Travel domain. In a controlled experiment, the proposed technique yields a healthy 9.75% relative (overall) word error reduction on a test set of 689 sentences, collected using a live telephony Air Travel system.

Conference paper