Clio grows up: From research prototype to industrial tool
Laura M. Haas, Mauricio A. Hernández, et al.
SIGMOD 2005
70% Of the time spent on data analytics is not actually spent on data analytics, but rather, in data wrangling: The process of finding, interpreting, extracting, preparing and recombining the data to be analyzed. For data that is collected as free-form text, the lack of standards or competing standards often results in a variety of formats for expressing the same type of data, making the data wrangling step a tedious and error-prone process. For example, US street addresses may be expressed with a house number, PO Box, rural or military route, and/or a direction-All of which can be abbreviated or spelled out in a variety of ways. In this paper, we present an algorithm that uses machine learning to efficiently and automatically identify categories of attributes, such as geo-spatial, that are present in a data file and we discuss results on a variety of real data sets. Our implementation can be used to automatically prepare data for consumption by other tools and services, such as mapping and visualization tools, and is motivated by and in support of a customizable severe weather alerting service.
Laura M. Haas, Mauricio A. Hernández, et al.
SIGMOD 2005
Eser Kandogan, Mary Roth, et al.
Big Data 2015
Mary Roth, Mauricio A. Hernandez, et al.
IBM Systems Journal
Eser Kandogan, Mary Roth, et al.
ICDEW 2018