-
Instance Optimal Join Size Estimation
Authors:
Mahmoud Abo-Khamis,
Sung** Im,
Benjamin Moseley,
Kirk Pruhs,
Alireza Samadian
Abstract:
We consider the problem of efficiently estimating the size of the inner join of a collection of preprocessed relational tables from the perspective of instance optimality analysis. The run time of instance optimal algorithms is comparable to the minimum time needed to verify the correctness of a solution. Previously instance optimal algorithms were only known when the size of the join was small (a…
▽ More
We consider the problem of efficiently estimating the size of the inner join of a collection of preprocessed relational tables from the perspective of instance optimality analysis. The run time of instance optimal algorithms is comparable to the minimum time needed to verify the correctness of a solution. Previously instance optimal algorithms were only known when the size of the join was small (as one component of their run time that was linear in the join size). We give an instance optimal algorithm for estimating the join size for all instances, including when the join size is large, by removing the dependency on the join size. As a byproduct, we show how to sample rows from the join uniformly at random in a comparable amount of time.
△ Less
Submitted 14 December, 2020;
originally announced December 2020.
-
A Relational Gradient Descent Algorithm For Support Vector Machine Training
Authors:
Mahmoud Abo-Khamis,
Sung** Im,
Benjamin Moseley,
Kirk Pruhs,
Alireza Samadian
Abstract:
We consider gradient descent like algorithms for Support Vector Machine (SVM) training when the data is in relational form. The gradient of the SVM objective can not be efficiently computed by known techniques as it suffers from the ``subtraction problem''. We first show that the subtraction problem can not be surmounted by showing that computing any constant approximation of the gradient of the S…
▽ More
We consider gradient descent like algorithms for Support Vector Machine (SVM) training when the data is in relational form. The gradient of the SVM objective can not be efficiently computed by known techniques as it suffers from the ``subtraction problem''. We first show that the subtraction problem can not be surmounted by showing that computing any constant approximation of the gradient of the SVM objective function is $\#P$-hard, even for acyclic joins. We, however, circumvent the subtraction problem by restricting our attention to stable instances, which intuitively are instances where a nearly optimal solution remains nearly optimal if the points are perturbed slightly. We give an efficient algorithm that computes a ``pseudo-gradient'' that guarantees convergence for stable instances at a rate comparable to that achieved by using the actual gradient. We believe that our results suggest that this sort of stability the analysis would likely yield useful insight in the context of designing algorithms on relational data for other learning problems in which the subtraction problem arises.
△ Less
Submitted 11 May, 2020;
originally announced May 2020.
-
Approximate Aggregate Queries Under Additive Inequalities
Authors:
Mahmoud Abo-Khamis,
Sung** Im,
Benjamin Moseley,
Kirk Pruhs,
Alireza Samadian
Abstract:
We consider the problem of evaluating certain types of functional aggregation queries on relational data subject to additive inequalities. Such aggregation queries, with a smallish number of additive inequalities, arise naturally/commonly in many applications, particularly in learning applications. We give a relatively complete categorization of the computational complexity of such problems. We fi…
▽ More
We consider the problem of evaluating certain types of functional aggregation queries on relational data subject to additive inequalities. Such aggregation queries, with a smallish number of additive inequalities, arise naturally/commonly in many applications, particularly in learning applications. We give a relatively complete categorization of the computational complexity of such problems. We first show that the problem is NP-hard, even in the case of one additive inequality. Thus we turn to approximating the query. Our main result is an efficient algorithm for approximating, with arbitrarily small relative error, many natural aggregation queries with one additive inequality. We give examples of natural queries that can be efficiently solved using this algorithm. In contrast, we show that the situation with two additive inequalities is quite different, by showing that it is NP-hard to evaluate simple aggregation queries, with two additive inequalities, with any bounded relative error.
△ Less
Submitted 30 April, 2020; v1 submitted 23 March, 2020;
originally announced March 2020.
-
Learning Models over Relational Data: A Brief Tutorial
Authors:
Maximilian Schleich,
Dan Olteanu,
Mahmoud Abo-Khamis,
Hung Q. Ngo,
XuanLong Nguyen
Abstract:
This tutorial overviews the state of the art in learning models over relational databases and makes the case for a first-principles approach that exploits recent developments in database research.
The input to learning classification and regression models is a training dataset defined by feature extraction queries over relational databases. The mainstream approach to learning over relational dat…
▽ More
This tutorial overviews the state of the art in learning models over relational databases and makes the case for a first-principles approach that exploits recent developments in database research.
The input to learning classification and regression models is a training dataset defined by feature extraction queries over relational databases. The mainstream approach to learning over relational data is to materialize the training dataset, export it out of the database, and then learn over it using a statistical package. This approach can be expensive as it requires the materialization of the training dataset. An alternative approach is to cast the machine learning problem as a database problem by transforming the data-intensive component of the learning task into a batch of aggregates over the feature extraction query and by computing this batch directly over the input database.
The tutorial highlights a variety of techniques developed by the database theory and systems communities to improve the performance of the learning task. They rely on structural properties of the relational data and of the feature extraction query, including algebraic (semi-ring), combinatorial (hypertree width), statistical (sampling), or geometric (distance) structure. They also rely on factorized computation, code specialization, query compilation, and parallelization.
△ Less
Submitted 15 November, 2019;
originally announced November 2019.