Search | arXiv e-print repository

Towards a Workload for Evolutionary Analytics

Authors: Jeff LeFevre, Jagan Sankaranarayanan, Hakan Hacigumus, Junichi Tatemura, Neoklis Polyzotis

Abstract: Emerging data analysis involves the ingestion and exploration of new data sets, application of complex functions, and frequent query revisions based on observing prior query answers. We call this new type of analysis evolutionary analytics and identify its properties. This type of analysis is not well represented by current benchmark workloads. In this paper, we present a workload and identify sev… ▽ More Emerging data analysis involves the ingestion and exploration of new data sets, application of complex functions, and frequent query revisions based on observing prior query answers. We call this new type of analysis evolutionary analytics and identify its properties. This type of analysis is not well represented by current benchmark workloads. In this paper, we present a workload and identify several metrics to test system support for evolutionary analytics. Along with our metrics, we present methodologies for running the workload that capture this analytical scenario. △ Less

Submitted 27 June, 2013; v1 submitted 5 April, 2013; originally announced April 2013.

Comments: 10 pages

Journal ref: DanaC: Workshop on Data analytics in the Cloud, June 2013, New York, NY

arXiv:1303.6609 [pdf, other]

Exploiting Opportunistic Physical Design in Large-scale Data Analytics

Authors: Jeff LeFevre, Jagan Sankaranarayanan, Hakan Hacigumus, Junichi Tatemura, Neoklis Polyzotis, Michael J. Carey

Abstract: Large-scale systems, such as MapReduce and Hadoop, perform aggressive materialization of intermediate job results in order to support fault tolerance. When jobs correspond to exploratory queries submitted by data analysts, these materializations yield a large set of materialized views that typically capture common computation among successive queries from the same analyst, or even across queries o… ▽ More Large-scale systems, such as MapReduce and Hadoop, perform aggressive materialization of intermediate job results in order to support fault tolerance. When jobs correspond to exploratory queries submitted by data analysts, these materializations yield a large set of materialized views that typically capture common computation among successive queries from the same analyst, or even across queries of different analysts who test similar hypotheses. We propose to treat these views as an opportunistic physical design and use them for the purpose of query optimization. We develop a novel query-rewrite algorithm that addresses the two main challenges in this context: how to search the large space of rewrites, and how to reason about views that contain UDFs (a common feature in large-scale data analytics). The algorithm, which provably finds the minimum-cost rewrite, is inspired by nearest-neighbor searches in non-metric spaces. We present an extensive experimental study on real-world datasets with a prototype data-analytics system based on Hive. The results demonstrate that our approach can result in dramatic performance improvements on complex data-analysis queries, reducing total execution time by an average of 61% and up to two orders of magnitude. △ Less

Submitted 10 December, 2013; v1 submitted 26 March, 2013; originally announced March 2013.

Comments: 15 pages

arXiv:1201.0226 [pdf, other]

Towards Cost-Effective Storage Provisioning for DBMSs

Authors: Ning Zhang, Junichi Tatemura, Jignesh M. Patel, Hakan Hacıgümüş

Abstract: Data center operators face a bewildering set of choices when considering how to provision resources on machines with complex I/O subsystems. Modern I/O subsystems often have a rich mix of fast, high performing, but expensive SSDs sitting alongside with cheaper but relatively slower (for random accesses) traditional hard disk drives. The data center operators need to determine how to provision the… ▽ More Data center operators face a bewildering set of choices when considering how to provision resources on machines with complex I/O subsystems. Modern I/O subsystems often have a rich mix of fast, high performing, but expensive SSDs sitting alongside with cheaper but relatively slower (for random accesses) traditional hard disk drives. The data center operators need to determine how to provision the I/O resources for specific workloads so as to abide by existing Service Level Agreements (SLAs), while minimizing the total operating cost (TOC) of running the workload, where the TOC includes the amortized hardware costs and the run time energy costs. The focus of this paper is on introducing this new problem of TOC-based storage allocation, cast in a framework that is compatible with traditional DBMS query optimization and query processing architecture. We also present a heuristic-based solution to this problem, called DOT. We have implemented DOT in PostgreSQL, and experiments using TPC-H and TPC-C demonstrate significant TOC reduction by DOT in various settings. △ Less

Submitted 31 December, 2011; originally announced January 2012.

Comments: VLDB2012

Journal ref: Proceedings of the VLDB Endowment (PVLDB), Vol. 5, No. 4, pp. 274-285 (2011)

Showing 1–3 of 3 results for author: Tatemura, J