Preksha Nema, Mitesh M. Khapra, et al.
ACL 2017
We propose an unsupervised approach for substring-based transliteration which incorporates two new sources of knowledge in the learning process: (i) context by learning substring mappings, as opposed to single character mappings, and (ii) phonetic features which capture cross-lingual character similarity via prior distributions. Our approach is a two-stage iterative, boot-strapping solution, which vastly outperforms Ravi and Knight (2009)’s state-of-the-art unsupervised transliteration method and outperforms a rule-based baseline by up to 50% for top-1 accuracy on multiple language pairs. We show that substring-based models are superior to character-based models, and observe that their top-10 accuracy is comparable to the top-1 accuracy of supervised systems. Our method only requires a phonemic representation of the words. This is possible for many language-script combinations which have a high grapheme-to-phoneme correspondence e.g. scripts of Indian languages derived from the Brahmi script. Hence, Indian languages were the focus of our experiments. For other languages, a grapheme-to-phoneme converter would be required.
Preksha Nema, Mitesh M. Khapra, et al.
ACL 2017
Ramesh Nallapati, Bowen Zhou, et al.
CoNLL 2016
Abhijit Mishra, Diptesh Kanojia, et al.
ACL 2016
Sarath Chandar, Mitesh M. Khapra, et al.
Neural Computation