Ali Anwar, Yue Cheng, et al.
HotStorage 2016
Due to its speed and ease of use, Spark has become a popular tool amongst data scientists to analyze data in various sizes. Counter-intuitively, data processing workloads in industrial companies such as Google, Facebook, and Yahoo are dominated by short-running applications, which is due to the majority of applications being mostly consisted of simple SQL-like queries (Dean, 2004, Zaharia et al, 2008). Unfortunately, the current version of Spark is not optimized for such kinds of workloads. In this paper, we propose a novel framework, called Meteor, which can dramatically improve the performance for short-running applications. We extend Spark with three additional operating modes: one-thread, one-container, and distributed. The one-thread mode executes all tasks on just one thread; the one-container mode runs these tasks in one container by multi-threading; the distributed mode allocates all tasks over the whole cluster. A new framework for submitting applications is also designed, which utilizes a fine-grained Spark performance model to decide which of the three modes is the most efficient to invoke upon a new application submission. From our extensive experiments on Amazon EC2, one-thread mode is the optimal choice when the input size is small, otherwise the distributed mode is better. Overall, Meteor is up to 2 times faster than the original Spark for short applications.
Ali Anwar, Yue Cheng, et al.
HotStorage 2016
Hong Zhang, Hai Huang, et al.
IPDPS 2017
Jidong Xiao, Lei Lu, et al.
LISA 2015
Hai Huang, Raymond Jennings, et al.
LISA 2007