-
Bilevel Relations and Their Applications to Data Insights
Authors:
Xi Wu,
Xiangyao Yu,
Shaleen Deep,
Ahmed Mahmood,
Uyeong Jang,
Stratis Viglas,
Somesh Jha,
John Cieslewicz,
Jeffrey F. Naughton
Abstract:
Many data-insight analytic tasks in anomaly detection, metric attribution, and experimentation analysis can be modeled as searching in a large space of tables and finding important ones, where the notion of importance is defined in some adhoc manner. While various frameworks have been proposed (e.g., DIFF, VLDB 2019), a systematic and general treatment is lacking. This paper describes bilevel rela…
▽ More
Many data-insight analytic tasks in anomaly detection, metric attribution, and experimentation analysis can be modeled as searching in a large space of tables and finding important ones, where the notion of importance is defined in some adhoc manner. While various frameworks have been proposed (e.g., DIFF, VLDB 2019), a systematic and general treatment is lacking. This paper describes bilevel relations and operators. While a relation (i.e., table) models a set of tuples, a bilevel relation is a dictionary that explicitly models a set of tables, where each ``value'' table is identified by a ``key'' of a (region, features) pair, where region specifies key attributes of the table, and features specify columns of the table. Bilevel relational operators are BilevelRelation-to-BilevelRelation transformations and directly analyze a set of tables. Bilevel relations and operators provide higher level abstractions for creating and manipulating a set of tables, and are compatible with the classic relational algebra. Together, they allow us to construct bilevel queries, which can express succinctly a range of insight-analytical questions with ``search+eval'' character. We have implemented and deployed a query engine for bilevel queries as a service, which is a first of its kind. Bilevel queries pose a rich algorithm and system design space, such as query optimization and data format, in order to evaluate them efficiently. We describe our current designs and lessons, and report empirical evaluations. Bilevel queries have found many useful applications, and have attracted more than 30 internal teams to build data-insight applications with it.
△ Less
Submitted 8 November, 2023;
originally announced November 2023.
-
Holistic Cube Analysis: A Query Framework for Data Insights
Authors:
Xi Wu,
Shaleen Deep,
Joe Benassi,
Fengan Li,
Yaqi Zhang,
Uyeong Jang,
James Foster,
Stella Kim,
Yu**g Sun,
Long Nguyen,
Stratis Viglas,
Somesh Jha,
John Cieslewicz,
Jeffrey F. Naughton
Abstract:
Many data insight questions can be viewed as searching in a large space of tables and finding important ones, where the notion of importance is defined in some adhoc user defined manner. This paper presents Holistic Cube Analysis (HoCA), a framework that augments the capabilities of relational queries for such problems. HoCA first augments the relational data model and introduces a new data type A…
▽ More
Many data insight questions can be viewed as searching in a large space of tables and finding important ones, where the notion of importance is defined in some adhoc user defined manner. This paper presents Holistic Cube Analysis (HoCA), a framework that augments the capabilities of relational queries for such problems. HoCA first augments the relational data model and introduces a new data type AbstractCube, defined as a function which maps a region-features pair to a relational table (a region is a tuple which specifies values of a set of dimensions). AbstractCube provides a logical form of data, and HoCA operators are cube-to-cube transformations. We describe two basic but fundamental HoCA operators, cube crawling and cube join (with many possible extensions). Cube crawling explores a region space, and outputs a cube that maps regions to signal vectors. Cube join, in turn, is critical for composition, allowing one to join information from different cubes for deeper analysis. Cube crawling introduces two novel programming features, (programmable) Region Analysis Models (RAMs) and Multi-Model Crawling. Crucially, RAM has a notion of population features, which allows one to go beyond only analyzing local features at a region, and program region-population analysis that compares region and population features, capturing a large class of importance notions. HoCA has a rich algorithmic space, such as optimizing crawling and join performance, and physical design of cubes. We have implemented and deployed HoCA at Google. Our early HoCA offering has attracted more than 30 teams building applications with it, across a diverse spectrum of fields including system monitoring, experimentation analysis, and business intelligence. For many applications, HoCA empowers novel and powerful analyses, such as instances of recurrent crawling, which are challenging to achieve otherwise.
△ Less
Submitted 1 July, 2023; v1 submitted 31 January, 2023;
originally announced February 2023.
-
Tuple-oriented Compression for Large-scale Mini-batch Stochastic Gradient Descent
Authors:
Fengan Li,
Lingjiao Chen,
Yi**g Zeng,
Arun Kumar,
Jeffrey F. Naughton,
Jignesh M. Patel,
Xi Wu
Abstract:
Data compression is a popular technique for improving the efficiency of data processing workloads such as SQL queries and more recently, machine learning (ML) with classical batch gradient methods. But the efficacy of such ideas for mini-batch stochastic gradient descent (MGD), arguably the workhorse algorithm of modern ML, is an open question. MGD's unique data access pattern renders prior art, i…
▽ More
Data compression is a popular technique for improving the efficiency of data processing workloads such as SQL queries and more recently, machine learning (ML) with classical batch gradient methods. But the efficacy of such ideas for mini-batch stochastic gradient descent (MGD), arguably the workhorse algorithm of modern ML, is an open question. MGD's unique data access pattern renders prior art, including those designed for batch gradient methods, less effective. We fill this crucial research gap by proposing a new lossless compression scheme we call tuple-oriented compression (TOC) that is inspired by an unlikely source, the string/text compression scheme Lempel-Ziv-Welch, but tailored to MGD in a way that preserves tuple boundaries within mini-batches. We then present a suite of novel compressed matrix operation execution techniques tailored to the TOC compression scheme that operate directly over the compressed data representation and avoid decompression overheads. An extensive empirical evaluation with real-world datasets shows that TOC consistently achieves substantial compression ratios by up to 51x and reduces runtimes for MGD workloads by up to 10.2x in popular ML systems.
△ Less
Submitted 20 January, 2019; v1 submitted 22 February, 2017;
originally announced February 2017.
-
Bolt-on Differential Privacy for Scalable Stochastic Gradient Descent-based Analytics
Authors:
Xi Wu,
Fengan Li,
Arun Kumar,
Kamalika Chaudhuri,
Somesh Jha,
Jeffrey F. Naughton
Abstract:
While significant progress has been made separately on analytics systems for scalable stochastic gradient descent (SGD) and private SGD, none of the major scalable analytics frameworks have incorporated differentially private SGD. There are two inter-related issues for this disconnect between research and practice: (1) low model accuracy due to added noise to guarantee privacy, and (2) high develo…
▽ More
While significant progress has been made separately on analytics systems for scalable stochastic gradient descent (SGD) and private SGD, none of the major scalable analytics frameworks have incorporated differentially private SGD. There are two inter-related issues for this disconnect between research and practice: (1) low model accuracy due to added noise to guarantee privacy, and (2) high development and runtime overhead of the private algorithms. This paper takes a first step to remedy this disconnect and proposes a private SGD algorithm to address \emph{both} issues in an integrated manner. In contrast to the white-box approach adopted by previous work, we revisit and use the classical technique of {\em output perturbation} to devise a novel "bolt-on" approach to private SGD. While our approach trivially addresses (2), it makes (1) even more challenging. We address this challenge by providing a novel analysis of the $L_2$-sensitivity of SGD, which allows, under the same privacy guarantees, better convergence of SGD when only a constant number of passes can be made over the data. We integrate our algorithm, as well as other state-of-the-art differentially private SGD, into Bismarck, a popular scalable SGD-based analytics system on top of an RDBMS. Extensive experiments show that our algorithm can be easily integrated, incurs virtually no overhead, scales well, and most importantly, yields substantially better (up to 4X) test accuracy than the state-of-the-art algorithms on many real datasets.
△ Less
Submitted 23 March, 2017; v1 submitted 15 June, 2016;
originally announced June 2016.
-
Sampling-Based Query Re-Optimization
Authors:
Wentao Wu,
Jeffrey F. Naughton,
Harneet Singh
Abstract:
Despite of decades of work, query optimizers still make mistakes on "difficult" queries because of bad cardinality estimates, often due to the interaction of multiple predicates and correlations in the data. In this paper, we propose a low-cost post-processing step that can take a plan produced by the optimizer, detect when it is likely to have made such a mistake, and take steps to fix it. Specif…
▽ More
Despite of decades of work, query optimizers still make mistakes on "difficult" queries because of bad cardinality estimates, often due to the interaction of multiple predicates and correlations in the data. In this paper, we propose a low-cost post-processing step that can take a plan produced by the optimizer, detect when it is likely to have made such a mistake, and take steps to fix it. Specifically, our solution is a sampling-based iterative procedure that requires almost no changes to the original query optimizer or query evaluation mechanism of the system. We show that this indeed imposes low overhead and catches cases where three widely used optimizers (PostgreSQL and two commercial systems) make large errors.
△ Less
Submitted 21 January, 2016;
originally announced January 2016.
-
Revisiting Differentially Private Regression: Lessons From Learning Theory and their Consequences
Authors:
Xi Wu,
Matthew Fredrikson,
Wentao Wu,
Somesh Jha,
Jeffrey F. Naughton
Abstract:
Private regression has received attention from both database and security communities. Recent work by Fredrikson et al. (USENIX Security 2014) analyzed the functional mechanism (Zhang et al. VLDB 2012) for training linear regression models over medical data. Unfortunately, they found that model accuracy is already unacceptable with differential privacy when $\varepsilon = 5$. We address this issue…
▽ More
Private regression has received attention from both database and security communities. Recent work by Fredrikson et al. (USENIX Security 2014) analyzed the functional mechanism (Zhang et al. VLDB 2012) for training linear regression models over medical data. Unfortunately, they found that model accuracy is already unacceptable with differential privacy when $\varepsilon = 5$. We address this issue, presenting an explicit connection between differential privacy and stable learning theory through which a substantially better privacy/utility tradeoff can be obtained. Perhaps more importantly, our theory reveals that the most basic mechanism in differential privacy, output perturbation, can be used to obtain a better tradeoff for all convex-Lipschitz-bounded learning tasks. Since output perturbation is simple to implement, it means that our approach is potentially widely applicable in practice. We go on to apply it on the same medical data as used by Fredrikson et al. Encouragingly, we achieve accurate models even for $\varepsilon = 0.1$. In the last part of this paper, we study the impact of our improved differentially private mechanisms on model inversion attacks, a privacy attack introduced by Fredrikson et al. We observe that the improved tradeoff makes the resulting differentially private model more susceptible to inversion attacks. We analyze this phenomenon formally.
△ Less
Submitted 20 December, 2015;
originally announced December 2015.
-
Uncertainty Aware Query Execution Time Prediction
Authors:
Wentao Wu,
Xi Wu,
Hakan Hacıgümüş,
Jeffrey F. Naughton
Abstract:
Predicting query execution time is a fundamental issue underlying many database management tasks. Existing predictors rely on information such as cardinality estimates and system performance constants that are difficult to know exactly. As a result, accurate prediction still remains elusive for many queries. However, existing predictors provide a single, point estimate of the true execution time,…
▽ More
Predicting query execution time is a fundamental issue underlying many database management tasks. Existing predictors rely on information such as cardinality estimates and system performance constants that are difficult to know exactly. As a result, accurate prediction still remains elusive for many queries. However, existing predictors provide a single, point estimate of the true execution time, but fail to characterize the uncertainty in the prediction. In this paper, we take a first step towards providing uncertainty information along with query execution time predictions. We use the query optimizer's cost model to represent the query execution time as a function of the selectivities of operators in the query plan as well as the constants that describe the cost of CPU and I/O operations in the system. By treating these quantities as random variables rather than constants, we show that with low overhead we can infer the distribution of likely prediction errors. We further show that the estimated prediction errors by our proposed techniques are strongly correlated with the actual prediction errors.
△ Less
Submitted 27 August, 2014;
originally announced August 2014.
-
On Load Shedding in Complex Event Processing
Authors:
Yeye He,
Siddharth Barman,
Jeffrey F. Naughton
Abstract:
Complex Event Processing (CEP) is a stream processing model that focuses on detecting event patterns in continuous event streams. While the CEP model has gained popularity in the research communities and commercial technologies, the problem of gracefully degrading performance under heavy load in the presence of resource constraints, or load shedding, has been largely overlooked. CEP is similar to…
▽ More
Complex Event Processing (CEP) is a stream processing model that focuses on detecting event patterns in continuous event streams. While the CEP model has gained popularity in the research communities and commercial technologies, the problem of gracefully degrading performance under heavy load in the presence of resource constraints, or load shedding, has been largely overlooked. CEP is similar to "classical" stream data management, but addresses a substantially different class of queries. This unfortunately renders the load shedding algorithms developed for stream data processing inapplicable. In this paper we study CEP load shedding under various resource constraints. We formalize broad classes of CEP load-shedding scenarios as different optimization problems. We demonstrate an array of complexity results that reveal the hardness of these problems and construct shedding algorithms with performance guarantees. Our results shed some light on the difficulty of develo** load-shedding algorithms that maximize utility.
△ Less
Submitted 16 December, 2013;
originally announced December 2013.