Feifei Li, Jimeng Sun, et al.
ICDE 2007
Huge datasets are becoming prevalent; even as researchers, we now routinely have to work with datasets that are up to a few terabytes in size. Interesting real-world applications produce huge volumes of messy data. The mining process involves several steps, starting from pre-processing the raw data to estimating the final models. As data become more abundant, scalable and easyto- use tools for distributed processing are also emerging. Among those, Map-Reduce has been widely embraced by both academia and industry. In database terms, Map- Reduce is a simple yet powerful execution engine, which can be complemented with other data storage and management components, as necessary.In this paper we describe our experiences and findings in applying Map-Reduce, from raw data to final models, on an important mining task. In particular, we focus on co-clustering, which has been studied in many applications such as text mining, collaborative filtering, bio-informatics, graph mining. We propose the Distributed Co-clustering (DisCo) framework, which introduces practical approaches for distributed data pre-processing, and co-clustering. We develop DisCo using Hadoop, an open source Map-Reduce implementation. We show that DisCo can scale well and efficiently process and analyze extremely large datasets (up to several hundreds of gigabytes) on commodity hardware. © 2008 IEEE.
Feifei Li, Jimeng Sun, et al.
ICDE 2007
Jimeng Sun, Christos Faloutsos, et al.
KDD 2007
U. Kang, Spiros Papadimitriou, et al.
SDM 2011
Thomas George, Anshul Gupta, et al.
ICDM 2008