C. Mohan, Don Haderle, et al.
ACM Transactions on Database Systems (TODS)
Detection of string and column delimiters is a critical first step in the automated ingestion of files containing tabular data. In this paper we present an algorithm that uses a logistic-regression classifier to evaluate whether a particular choice of delimiters is correct. The delimiter choice that is given the highest score by the classifier is chosen as the one most likely to be correct. The algorithm makes the correct choice over 90% of the time on a test data set of files with a variety of different delimiters.
C. Mohan, Don Haderle, et al.
ACM Transactions on Database Systems (TODS)
Yan Yao Jian Cao, Srikumar Venugopal, et al.
HPCC/SmartCity/DSS 2016
Joshua Hui, Sarah Knoop, et al.
IHI 2012
Shilpi Ahuja, Mary Roth, et al.
ICDMW 2016