-
Scaling-Up Reasoning and Advanced Analytics on BigData
Authors:
Tyson Condie,
Ariyam Das,
Matteo Interlandi,
Alexander Shkapsky,
Mohan Yang,
Carlo Zaniolo
Abstract:
BigDatalog is an extension of Datalog that achieves performance and scalability on both Apache Spark and multicore systems to the point that its graph analytics outperform those written in GraphX. Looking back, we see how this realizes the ambitious goal pursued by deductive database researchers beginning forty years ago: this is the goal of combining the rigor and power of logic in expressing que…
▽ More
BigDatalog is an extension of Datalog that achieves performance and scalability on both Apache Spark and multicore systems to the point that its graph analytics outperform those written in GraphX. Looking back, we see how this realizes the ambitious goal pursued by deductive database researchers beginning forty years ago: this is the goal of combining the rigor and power of logic in expressing queries and reasoning with the performance and scalability by which relational databases managed Big Data. This goal led to Datalog which is based on Horn Clauses like Prolog but employs implementation techniques, such as Semi-naive Fixpoint and Magic Sets, that extend the bottom-up computation model of relational systems, and thus obtain the performance and scalability that relational systems had achieved, as far back as the 80s, using data-parallelization on shared-nothing architectures. But this goal proved difficult to achieve because of major issues at (i) the language level and (ii) at the system level. The paper describes how (i) was addressed by simple rules under which the fixpoint semantics extends to programs using count, sum and extrema in recursion, and (ii) was tamed by parallel compilation techniques that achieve scalability on multicore systems and Apache Spark. This paper is under consideration for acceptance in Theory and Practice of Logic Programming (TPLP).
△ Less
Submitted 9 July, 2018;
originally announced July 2018.
-
Fixpoint Semantics and Optimization of Recursive Datalog Programs with Aggregates
Authors:
Carlo Zaniolo,
Mohan Yang,
Matteo Interlandi,
Ariyam Das,
Alexander Shkapsky,
Tyson Condie
Abstract:
A very desirable Datalog extension investigated by many researchers in the last thirty years consists in allowing the use of the basic SQL aggregates min, max, count and sum in recursive rules. In this paper, we propose a simple comprehensive solution that extends the declarative least-fixpoint semantics of Horn Clauses, along with the optimization techniques used in the bottom-up implementation a…
▽ More
A very desirable Datalog extension investigated by many researchers in the last thirty years consists in allowing the use of the basic SQL aggregates min, max, count and sum in recursive rules. In this paper, we propose a simple comprehensive solution that extends the declarative least-fixpoint semantics of Horn Clauses, along with the optimization techniques used in the bottom-up implementation approach adopted by many Datalog systems. We start by identifying a large class of programs of great practical interest in which the use of min or max in recursive rules does not compromise the declarative fixpoint semantics of the programs using those rules. Then, we revisit the monotonic versions of count and sum aggregates proposed in (Mazuran et al. 2013b) and named, respectively, mcount and msum. Since mcount, and also msum on positive numbers, are monotonic in the lattice of set-containment, they preserve the fixpoint semantics of Horn Clauses. However, in many applications of practical interest, their use can lead to inefficiencies, that can be eliminated by combining them with max, whereby mcount and msum become the standard count and sum. Therefore, the semantics and optimization techniques of Datalog are extended to recursive programs with min, max, count and sum, making possible the advanced applications of superior performance and scalability demonstrated by BigDatalog (Shkapsky et al. 2016) and Datalog-MC (Yang et al. 2017). This paper is under consideration for acceptance in TPLP.
△ Less
Submitted 21 July, 2017; v1 submitted 18 July, 2017;
originally announced July 2017.
-
Pregelix: Big(ger) Graph Analytics on A Dataflow Engine
Authors:
Yingyi Bu,
Vinayak Borkar,
Jianfeng Jia,
Michael J. Carey,
Tyson Condie
Abstract:
There is a growing need for distributed graph processing systems that are capable of gracefully scaling to very large graph datasets. Unfortunately, this challenge has not been easily met due to the intense memory pressure imposed by process-centric, message passing designs that many graph processing systems follow. Pregelix is a new open source distributed graph processing system that is based on…
▽ More
There is a growing need for distributed graph processing systems that are capable of gracefully scaling to very large graph datasets. Unfortunately, this challenge has not been easily met due to the intense memory pressure imposed by process-centric, message passing designs that many graph processing systems follow. Pregelix is a new open source distributed graph processing system that is based on an iterative dataflow design that is better tuned to handle both in-memory and out-of-core workloads. As such, Pregelix offers improved performance characteristics and scaling properties over current open source systems (e.g., we have seen up to 15x speedup compared to Apache Giraph and up to 35x speedup compared to distributed GraphLab), and makes more effective use of available machine resources to support Big(ger) Graph Analytics.
△ Less
Submitted 2 July, 2014;
originally announced July 2014.
-
Iterative MapReduce for Large Scale Machine Learning
Authors:
Joshua Rosen,
Neoklis Polyzotis,
Vinayak Borkar,
Yingyi Bu,
Michael J. Carey,
Markus Weimer,
Tyson Condie,
Raghu Ramakrishnan
Abstract:
Large datasets ("Big Data") are becoming ubiquitous because the potential value in deriving insights from data, across a wide range of business and scientific applications, is increasingly recognized. In particular, machine learning - one of the foundational disciplines for data analysis, summarization and inference - on Big Data has become routine at most organizations that operate large clouds,…
▽ More
Large datasets ("Big Data") are becoming ubiquitous because the potential value in deriving insights from data, across a wide range of business and scientific applications, is increasingly recognized. In particular, machine learning - one of the foundational disciplines for data analysis, summarization and inference - on Big Data has become routine at most organizations that operate large clouds, usually based on systems such as Hadoop that support the MapReduce programming paradigm. It is now widely recognized that while MapReduce is highly scalable, it suffers from a critical weakness for machine learning: it does not support iteration. Consequently, one has to program around this limitation, leading to fragile, inefficient code. Further, reliance on the programmer is inherently flawed in a multi-tenanted cloud environment, since the programmer does not have visibility into the state of the system when his or her program executes. Prior work has sought to address this problem by either develo** specialized systems aimed at stylized applications, or by augmenting MapReduce with ad hoc support for saving state across iterations (driven by an external loop). In this paper, we advocate support for loo** as a first-class construct, and propose an extension of the MapReduce programming paradigm called {\em Iterative MapReduce}. We then develop an optimizer for a class of Iterative MapReduce programs that cover most machine learning techniques, provide theoretical justifications for the key optimization steps, and empirically demonstrate that system-optimized programs for significant machine learning tasks are competitive with state-of-the-art specialized solutions.
△ Less
Submitted 13 March, 2013;
originally announced March 2013.
-
Scaling Datalog for Machine Learning on Big Data
Authors:
Yingyi Bu,
Vinayak Borkar,
Michael J. Carey,
Joshua Rosen,
Neoklis Polyzotis,
Tyson Condie,
Markus Weimer,
Raghu Ramakrishnan
Abstract:
In this paper, we present the case for a declarative foundation for data-intensive machine learning systems. Instead of creating a new system for each specific flavor of machine learning task, or hardcoding new optimizations, we argue for the use of recursive queries to program a variety of machine learning systems. By taking this approach, database query optimization techniques can be utilized to…
▽ More
In this paper, we present the case for a declarative foundation for data-intensive machine learning systems. Instead of creating a new system for each specific flavor of machine learning task, or hardcoding new optimizations, we argue for the use of recursive queries to program a variety of machine learning systems. By taking this approach, database query optimization techniques can be utilized to identify effective execution plans, and the resulting runtime plans can be executed on a single unified data-parallel query processing engine. As a proof of concept, we consider two programming models--Pregel and Iterative Map-Reduce-Update---from the machine learning domain, and show how they can be captured in Datalog, tuned for a specific task, and then compiled into an optimized physical plan. Experiments performed on a large computing cluster with real data demonstrate that this declarative approach can provide very good performance while offering both increased generality and programming ease.
△ Less
Submitted 2 March, 2012; v1 submitted 1 March, 2012;
originally announced March 2012.