-
IsoPredict: Dynamic Predictive Analysis for Detecting Unserializable Behaviors in Weakly Isolated Data Store Applications
Authors:
Chujun Geng,
Spyros Blanas,
Michael D. Bond,
Yang Wang
Abstract:
This paper presents the first dynamic predictive analysis for data store applications under weak isolation levels, called Isopredict. Given an observed serializable execution of a data store application, Isopredict generates and solves SMT constraints to find an unserializable execution that is a feasible execution of the application. Isopredict introduces novel techniques that handle divergent ap…
▽ More
This paper presents the first dynamic predictive analysis for data store applications under weak isolation levels, called Isopredict. Given an observed serializable execution of a data store application, Isopredict generates and solves SMT constraints to find an unserializable execution that is a feasible execution of the application. Isopredict introduces novel techniques that handle divergent application behavior; solve mutually recursive sets of constraints; and balance coverage, precision, and performance. An evaluation on four transactional data store benchmarks shows that Isopredict often predicts unserializable behaviors, 99% of which are feasible.
△ Less
Submitted 6 April, 2024;
originally announced April 2024.
-
SQRQuerier: A Visual Querying Framework for Cross-national Survey Data Recycling
Authors:
Yamei Tu,
Olga Li,
Junpeng Wang,
Han-Wei Shen,
Przemek Powalko,
Irina Tomescu-Dubrow,
Kazimierz M. Slomczynski,
Spyros Blanas,
J. Craig Jenkins
Abstract:
Public opinion surveys constitute a powerful tool to study peoples' attitudes and behaviors in comparative perspectives. However, even worldwide surveys provide only partial geographic and time coverage, which hinders comprehensive knowledge production. To broaden the scope of comparison, social scientists turn to ex-post harmonization of variables from datasets that cover similar topics but in di…
▽ More
Public opinion surveys constitute a powerful tool to study peoples' attitudes and behaviors in comparative perspectives. However, even worldwide surveys provide only partial geographic and time coverage, which hinders comprehensive knowledge production. To broaden the scope of comparison, social scientists turn to ex-post harmonization of variables from datasets that cover similar topics but in different populations and/or years. The resulting new datasets can be analyzed as a single source, which can be flexibly accessed through many data portals. However, such portals offer little guidance to explore the data in-depth or query data with user-customized needs. As a result, it is still challenging for social scientists to efficiently identify related data for their studies and evaluate their theoretical models based on the sliced data. To overcome them, in the Survey Data Recycling (SDR) international cooperation research project, we propose SDRQuerier and apply it to the harmonized SDR database, which features over two million respondents interviewed in a total of 1,721 national surveys that are part of 22 well-known international projects. We design the SDRQuerier to solve three practical challenges that social scientists routinely face. First, a BERT-based model provides customized data queries through research questions or keywords. Second, we propose a new visual design to showcase the availability of the harmonized data at different levels, thus hel** users decide if empirical data exist to address a given research question. Lastly, SDRQuerier discloses the underlying relational patterns among substantive and methodological variables in the database, to help social scientists rigorously evaluate or even improve their regression models. Through case studies with multiple social scientists in solving their daily challenges, we demonstrated the novelty, effectiveness of SDRQuerier.
△ Less
Submitted 25 January, 2022;
originally announced January 2022.
-
Algorithms for a Topology-aware Massively Parallel Computation Model
Authors:
Xiao Hu,
Paraschos Koutris,
Spyros Blanas
Abstract:
Most of the prior work in massively parallel data processing assumes homogeneity, i.e., every computing unit has the same computational capability, and can communicate with every other unit with the same latency and bandwidth. However, this strong assumption of a uniform topology rarely holds in practical settings, where computing units are connected through complex networks. To address this issue…
▽ More
Most of the prior work in massively parallel data processing assumes homogeneity, i.e., every computing unit has the same computational capability, and can communicate with every other unit with the same latency and bandwidth. However, this strong assumption of a uniform topology rarely holds in practical settings, where computing units are connected through complex networks. To address this issue, Blanas et al. recently proposed a topology-aware massively parallel computation model that integrates the network structure and heterogeneity in the modeling cost. The network is modeled as a directed graph, where each edge is associated with a cost function that depends on the data transferred between the two endpoints. The computation proceeds in synchronous rounds, and the cost of each round is measured as the maximum cost over all the edges in the network.
In this work, we take the first step into investigating three fundamental data processing tasks in this topology-aware parallel model: set intersection, cartesian product, and sorting. We focus on network topologies that are tree topologies, and present both lower bounds, as well as (asymptotically) matching upper bounds. The optimality of our algorithms is with respect to the initial data distribution among the network nodes, instead of assuming worst-case distribution as in previous results. Apart from the theoretical optimality of our results, our protocols are simple, use a constant number of rounds, and we believe can be implemented in practical settings as well.
△ Less
Submitted 23 September, 2020;
originally announced September 2020.
-
Chasing Similarity: Distribution-aware Aggregation Scheduling (Extended Version)
Authors:
Feilong Liu,
Ario Salmasi,
Spyros Blanas,
Anastasios Sidiropoulos
Abstract:
Parallel aggregation is a ubiquitous operation in data analytics that is expressed as GROUP BY in SQL, reduce in Hadoop, or segment in TensorFlow. Parallel aggregation starts with an optional local pre-aggregation step and then repartitions the intermediate result across the network. While local pre-aggregation works well for low-cardinality aggregations, the network communication cost remains sig…
▽ More
Parallel aggregation is a ubiquitous operation in data analytics that is expressed as GROUP BY in SQL, reduce in Hadoop, or segment in TensorFlow. Parallel aggregation starts with an optional local pre-aggregation step and then repartitions the intermediate result across the network. While local pre-aggregation works well for low-cardinality aggregations, the network communication cost remains significant for high-cardinality aggregations even after local pre-aggregation. The problem is that the repartition-based algorithm for high-cardinality aggregation does not fully utilize the network.
In this work, we first formulate a mathematical model that captures the performance of parallel aggregation. We prove that finding optimal aggregation plans from a known data distribution is NP-hard, assuming the Small Set Expansion conjecture. We propose GRASP, a GReedy Aggregation Scheduling Protocol that decomposes parallel aggregation into phases. GRASP is distribution-aware as it aggregates the most similar partitions in each phase to reduce the transmitted data size in subsequent phases. In addition, GRASP takes the available network bandwidth into account when scheduling aggregations in each phase to maximize network utilization. The experimental evaluation on real data shows that GRASP outperforms repartition-based aggregation by 3.5x and LOOM by 2.0x.
△ Less
Submitted 29 November, 2018; v1 submitted 30 September, 2018;
originally announced October 2018.
-
To Ship or Not to (Function) Ship (Extended version)
Authors:
Feilong Liu,
Niranjan Kamat,
Spyros Blanas,
Arnab Nandi
Abstract:
Sampling is often used to reduce query latency for interactive big data analytics. The established parallel data processing paradigm relies on function ship**, where a coordinator dispatches queries to worker nodes and then collects the results. The commoditization of high-performance networking makes data ship** possible, where the coordinator directly reads data in the workers' memory using…
▽ More
Sampling is often used to reduce query latency for interactive big data analytics. The established parallel data processing paradigm relies on function ship**, where a coordinator dispatches queries to worker nodes and then collects the results. The commoditization of high-performance networking makes data ship** possible, where the coordinator directly reads data in the workers' memory using RDMA while workers process other queries. In this work, we explore when to use function ship** or data ship** for interactive query processing with sampling. Whether function ship** or data ship** should be preferred depends on the amount of data transferred, the current CPU utilization, the sampling method and the number of queries executed over the data set. The results show that data ship** is up to 6.5x faster when performing clustered sampling with heavily-utilized workers.
△ Less
Submitted 29 July, 2018;
originally announced July 2018.
-
Approximate Distributed Joins in Apache Spark
Authors:
Do Le Quoc,
Istemi Ekin Akkus,
Pramod Bhatotia,
Spyros Blanas,
Ruichuan Chen,
Christof Fetzer,
Thorsten Strufe
Abstract:
The join operation is a fundamental building block of parallel data processing. Unfortunately, it is very resource-intensive to compute an equi-join across massive datasets. The approximate computing paradigm allows users to trade accuracy and latency for expensive data processing operations. The equi-join operator is thus a natural candidate for optimization using approximation techniques. Althou…
▽ More
The join operation is a fundamental building block of parallel data processing. Unfortunately, it is very resource-intensive to compute an equi-join across massive datasets. The approximate computing paradigm allows users to trade accuracy and latency for expensive data processing operations. The equi-join operator is thus a natural candidate for optimization using approximation techniques. Although sampling-based approaches are widely used for approximation, sampling over joins is a compelling but challenging task regarding the output quality. Naive approaches, which perform joins over dataset samples, would not preserve statistical properties of the join output.
To realize this potential, we interweave Bloom filter sketching and stratified sampling with the join computation in a new operator, ApproxJoin, that preserves the statistical properties of the join output. ApproxJoin leverages a Bloom filter to avoid shuffling non-joinable data items around the network and then applies stratified sampling to obtain a representative sample of the join output.
Our analysis shows that ApproxJoin scales well and significantly reduces data movement, without sacrificing tight error bounds on the accuracy of the final results. We implemented ApproxJoin in Apache Spark and evaluated ApproxJoin using microbenchmarks and real-world case studies. The evaluation shows that ApproxJoin achieves a speedup of 6-9x over unmodified Spark-based joins with the same sampling rate. Furthermore, the speedup is accompanied by a significant reduction in the shuffled data volume, which is 5-82x less than unmodified Spark-based joins.
△ Less
Submitted 15 May, 2018;
originally announced May 2018.
-
ArrayBridge: Interweaving declarative array processing with high-performance computing
Authors:
Haoyuan Xing,
Sofoklis Floratos,
Spyros Blanas,
Suren Byna,
Prabhat,
Kesheng Wu,
Paul Brown
Abstract:
Scientists are increasingly turning to datacenter-scale computers to produce and analyze massive arrays. Despite decades of database research that extols the virtues of declarative query processing, scientists still write, debug and parallelize imperative HPC kernels even for the most mundane queries. This impedance mismatch has been partly attributed to the cumbersome data loading process; in res…
▽ More
Scientists are increasingly turning to datacenter-scale computers to produce and analyze massive arrays. Despite decades of database research that extols the virtues of declarative query processing, scientists still write, debug and parallelize imperative HPC kernels even for the most mundane queries. This impedance mismatch has been partly attributed to the cumbersome data loading process; in response, the database community has proposed in situ mechanisms to access data in scientific file formats. Scientists, however, desire more than a passive access method that reads arrays from files.
This paper describes ArrayBridge, a bi-directional array view mechanism for scientific file formats, that aims to make declarative array manipulations interoperable with imperative file-centric analyses. Our prototype implementation of ArrayBridge uses HDF5 as the underlying array storage library and seamlessly integrates into the SciDB open-source array database system. In addition to fast querying over external array objects, ArrayBridge produces arrays in the HDF5 file format just as easily as it can read from it. ArrayBridge also supports time travel queries from imperative kernels through the unmodified HDF5 API, and automatically deduplicates between array versions for space efficiency. Our extensive performance evaluation in NERSC, a large-scale scientific computing facility, shows that ArrayBridge exhibits statistically indistinguishable performance and I/O scalability to the native SciDB storage engine.
△ Less
Submitted 27 February, 2017;
originally announced February 2017.
-
Forecasting the cost of processing multi-join queries via hashing for main-memory databases (Extended version)
Authors:
Feilong Liu,
Spyros Blanas
Abstract:
Database management systems (DBMSs) carefully optimize complex multi-join queries to avoid expensive disk I/O. As servers today feature tens or hundreds of gigabytes of RAM, a significant fraction of many analytic databases becomes memory-resident. Even after careful tuning for an in-memory environment, a linear disk I/O model such as the one implemented in PostgreSQL may make query response time…
▽ More
Database management systems (DBMSs) carefully optimize complex multi-join queries to avoid expensive disk I/O. As servers today feature tens or hundreds of gigabytes of RAM, a significant fraction of many analytic databases becomes memory-resident. Even after careful tuning for an in-memory environment, a linear disk I/O model such as the one implemented in PostgreSQL may make query response time predictions that are up to 2X slower than the optimal multi-join query plan over memory-resident data. This paper introduces a memory I/O cost model to identify good evaluation strategies for complex query plans with multiple hash-based equi-joins over memory-resident data. The proposed cost model is carefully validated for accuracy using three different systems, including an Amazon EC2 instance, to control for hardware-specific differences. Prior work in parallel query evaluation has advocated right-deep and bushy trees for multi-join queries due to their greater parallelization and pipelining potential. A surprising finding is that the conventional wisdom from shared-nothing disk-based systems does not directly apply to the modern shared-everything memory hierarchy. As corroborated by our model, the performance gap between the optimal left-deep and right-deep query plan can grow to about 10X as the number of joins in the query increases.
△ Less
Submitted 21 July, 2015; v1 submitted 10 July, 2015;
originally announced July 2015.
-
Towards Exascale Scientific Metadata Management
Authors:
Spyros Blanas,
Surendra Byna
Abstract:
Advances in technology and computing hardware are enabling scientists from all areas of science to produce massive amounts of data using large-scale simulations or observational facilities. In this era of data deluge, effective coordination between the data production and the analysis phases hinges on the availability of metadata that describe the scientific datasets. Existing workflow engines hav…
▽ More
Advances in technology and computing hardware are enabling scientists from all areas of science to produce massive amounts of data using large-scale simulations or observational facilities. In this era of data deluge, effective coordination between the data production and the analysis phases hinges on the availability of metadata that describe the scientific datasets. Existing workflow engines have been capturing a limited form of metadata to provide provenance information about the identity and lineage of the data. However, much of the data produced by simulations, experiments, and analyses still need to be annotated manually in an ad hoc manner by domain scientists. Systematic and transparent acquisition of rich metadata becomes a crucial prerequisite to sustain and accelerate the pace of scientific innovation. Yet, ubiquitous and domain-agnostic metadata management infrastructure that can meet the demands of extreme-scale science is notable by its absence.
To address this gap in scientific data management research and practice, we present our vision for an integrated approach that (1) automatically captures and manipulates information-rich metadata while the data is being produced or analyzed and (2) stores metadata within each dataset to permeate metadata-oblivious processes and to query metadata through established and standardized data access interfaces. We motivate the need for the proposed integrated approach using applications from plasma physics, climate modeling and neuroscience, and then discuss research challenges and possible solutions.
△ Less
Submitted 29 March, 2015;
originally announced March 2015.
-
High-Performance Concurrency Control Mechanisms for Main-Memory Databases
Authors:
Per-Åke Larson,
Spyros Blanas,
Cristian Diaconu,
Craig Freedman,
Jignesh M. Patel,
Mike Zwilling
Abstract:
A database system optimized for in-memory storage can support much higher transaction rates than current systems. However, standard concurrency control methods used today do not scale to the high transaction rates achievable by such systems. In this paper we introduce two efficient concurrency control methods specifically designed for main-memory databases. Both use multiversioning to isolate read…
▽ More
A database system optimized for in-memory storage can support much higher transaction rates than current systems. However, standard concurrency control methods used today do not scale to the high transaction rates achievable by such systems. In this paper we introduce two efficient concurrency control methods specifically designed for main-memory databases. Both use multiversioning to isolate read-only transactions from updates but differ in how atomicity is ensured: one is optimistic and one is pessimistic. To avoid expensive context switching, transactions never block during normal processing but they may have to wait before commit to ensure correct serialization ordering. We also implemented a main-memory optimized version of single-version locking. Experimental results show that while single-version locking works well when transactions are short and contention is low performance degrades under more demanding conditions. The multiversion schemes have higher overhead but are much less sensitive to hotspots and the presence of long-running transactions.
△ Less
Submitted 31 December, 2011;
originally announced January 2012.