CodeGenWrangler: Data Wrangling task automation using Code-Generating Models

Akella Ashlesha; Abhijit Manatkar; Krishnasuri Narayanam; Sameep Mehta

NAACL 2025

Conference paper

29 Apr 2025

CodeGenWrangler: Data Wrangling task automation using Code-Generating Models

Abstract

Assuring the data quality of tabular datasets is essential for the efficiency of the diverse tabular downstream tasks (like summarization and fact-checking). Data-wrangling tasks effectively address the challenges associated with structured data processing to improve the quality of tabular data. Traditional statistical methods handle numeric data efficiently but often fail to understand the semantic context of the textual data in tables. Deep learning approaches are resource-intensive, requiring task and dataset-specific training. Addressing these shortcomings, we present an automated system that leverages LLMs to generate executable code for data-wrangling tasks like missing value imputation, error detection, and error correction. Our system aims to identify inherent patterns in the data while leveraging external knowledge, effectively addressing both memory-independent and memory-dependent tasks.

Conference paper