Pangea: Monolithic distributed storage for data analytics
Jia Zou, Arun Iyengar, et al.
VLDB 2017
Large-scale optimization has become an important application for data management systems, particularly in the context of statistical machine learning. In this paper, we consider how one might implement the join-and-co-group pattern in the context of a fully declarative data processing system. The join-and-co-group pattern is ubiquitous in iterative, large-scale optimization. In the join-and-co-group pattern, a user-defined function g g is parameterized with a data object x x as well as the subset of the statistical model Θ x Θx that applies to that object, so that g(x | Θ x) g(x|Θx) can be used to compute a partial update of the model. This is repeated for every x x in the full data set X X. All partial updates are then aggregated and used to perform a complete update of the model. The join-and-co-group pattern has several implementation challenges, including the potential for a massive blow-up in the size of a fully parameterized model. Thus, unless the correct physical execution plan be chosen for implementing the join-and-co-group pattern, it is easily possible to have an execution that takes a very long time or even fails to complete. In this paper, we carefully consider the alternatives for implementing the join-and-co-group pattern on top of a declarative system, as well as how the best alternative can be selected automatically. Our focus is on the SimSQL database system, which is an SQL-based system with special facilities for large-scale, iterative optimization. Since it is an SQL-based system with a query optimizer, those choices can be made automatically.
Jia Zou, Arun Iyengar, et al.
VLDB 2017
Matthias Boehm, Michael W. Dusenberry, et al.
VLDB
Jia Zou, Arun Iyengar, et al.
VLDB Journal
Matthias Boehm, Berthold Reinwald, et al.
VLDB