-
Investigation of Adaptive Hotspot-Aware Indexes for Oscillating Write-Heavy and Read-Heavy Workloads -- An Experimental Study
Authors:
Lu Xing,
Walid G. Aref
Abstract:
HTAP systems are designed to handle transactional and analytical workloads. Besides a mixed workload at any given time, the workload can also change over time. A popular kind of continuously changing workload is one that oscillates between being write-heavy and being read-heavy. These oscillating workloads can be observed in many applications. Indexes, e.g., the B+-tree and the LSM-Tree cannot per…
▽ More
HTAP systems are designed to handle transactional and analytical workloads. Besides a mixed workload at any given time, the workload can also change over time. A popular kind of continuously changing workload is one that oscillates between being write-heavy and being read-heavy. These oscillating workloads can be observed in many applications. Indexes, e.g., the B+-tree and the LSM-Tree cannot perform equally well all the time. Conventional adaptive indexing does not solve this issue either as it focuses on adapting in one direction. This paper investigates how to support oscillating workloads with adaptive indexes that adapt the underlying index structures in both directions. With the observation that real-world datasets are skewed, we focus on optimizing the indexes within the hotspot regions. We encapsulate the adaptation techniques into the Adaptive Hotspot-Aware Tree adaptive index. We compare the indexes and discuss the insights of each adaptation technique. Our investigation highlights the trade-offs of AHA-tree as well as the pros and cons of each design choice. AHA-tree can behave competitively as compared to an LSM-tree for write-heavy transactional workloads. Upon switching to a read-heavy analytical workload, and after some transient adaptation period, AHA-tree can behave as a B+-tree and can match the B+-trees read performance.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
The AHA-Tree: An Adaptive Index for HTAP Workloads
Authors:
Lu Xing,
Walid G. Aref
Abstract:
In this demo, we realize data indexes that can morph from being write-optimized at times to being read-optimized at other times nonstop with zero-down time during the workload transitioning. These data indexes are useful for HTAP systems (Hybrid Transactional and Analytical Processing Systems), where transactional workloads are write-heavy while analytical workloads are read-heavy. Traditional ind…
▽ More
In this demo, we realize data indexes that can morph from being write-optimized at times to being read-optimized at other times nonstop with zero-down time during the workload transitioning. These data indexes are useful for HTAP systems (Hybrid Transactional and Analytical Processing Systems), where transactional workloads are write-heavy while analytical workloads are read-heavy. Traditional indexes, e.g., B+-tree and LSM-Tree, although optimized for one kind of workload, cannot perform equally well under all workloads. To migrate from the write-optimized LSM-Tree to a read-optimized B+-tree is costly and mandates some system down time to reorganize data. We design adaptive indexes that can dynamically morph from a pure LSM-tree to a pure buffered B-tree back and forth, and has interesting states in-between. There are two challenges: allowing concurrent operations and avoiding system down time. This demo benchmarks the proposed AHA-Tree index under dynamic workloads and shows how the index evolves from one state to another without blocking.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Multi-Entry Generalized Search Trees for Indexing Trajectories
Authors:
Maxime Schoemans,
Walid G. Aref,
Esteban Zimányi,
Mahmoud Sakr
Abstract:
The idea of generalized indices is one of the success stories of database systems research. It has found its way to implementation in common database systems. GiST (Generalized Search Tree) and SP-GiST (Space-Partitioned Generalized Search Tree) are two widely-used generalized indices that are typically used for multidimensional data. Currently, the generalized indices GiST and SP-GiST represent o…
▽ More
The idea of generalized indices is one of the success stories of database systems research. It has found its way to implementation in common database systems. GiST (Generalized Search Tree) and SP-GiST (Space-Partitioned Generalized Search Tree) are two widely-used generalized indices that are typically used for multidimensional data. Currently, the generalized indices GiST and SP-GiST represent one database object using one index entry, e.g., a bounding box for each spatio-temporal object. However, when dealing with complex objects, e.g., moving object trajectories, a single entry per object is inadequate for creating efficient indices. Previous research has highlighted that splitting trajectories into multiple bounding boxes prior to indexing can enhance query performance as it leads to a higher index filter. In this paper, we introduce MGiST and MSP-GiST, the multi-entry generalized search tree counterparts of GiST and SP-GiST, respectively, that are designed to enable the partitioning of objects into multiple entries during insertion. The methods for decomposing a complex object into multiple sub-objects differ from one data type to another, and may depend on some domain-specific parameters. Thus, MGiST and MSP-GiST are designed to allow for pluggable modules that aid in optimizing the split of an object into multiple sub-objects. We demonstrate the usefulness of MGiST and MSP-GiST using a trajectory indexing scenario, where we realize several trajectory indexes using MGiST and MSP-GiST and instantiate these search trees with trajectory-specific splitting algorithms. We create and test the performance of several multi-entry versions of widely-used spatial index structures, e.g., R-Tree, Quad-Tree, and KD-Tree. We conduct evaluations using both synthetic and real-world data, and observe up to an order of magnitude enhancement in performance of point, range, and KNN queries.
△ Less
Submitted 7 June, 2024;
originally announced June 2024.
-
GTX: A Transactional Graph Data System For HTAP Workloads
Authors:
Libin Zhou,
Walid Aref
Abstract:
Processing, managing, and analyzing dynamic graphs are the cornerstone in multiple application domains including fraud detection, recommendation system, graph neural network training, etc. This demo presents GTX, a latch-free write-optimized transactional graph data system that supports high throughput read-write transactions while maintaining competitive graph analytics. GTX has a unique latch-fr…
▽ More
Processing, managing, and analyzing dynamic graphs are the cornerstone in multiple application domains including fraud detection, recommendation system, graph neural network training, etc. This demo presents GTX, a latch-free write-optimized transactional graph data system that supports high throughput read-write transactions while maintaining competitive graph analytics. GTX has a unique latch-free graph storage and a transaction and concurrency control protocol for dynamic power-law graphs. GTX leverages atomic operations to eliminate latches, proposes a delta-based multi-version storage, and designs a hybrid transaction commit protocol to reduce interference between concurrent operations. To further improve its throughput, we design a delta-chains index to support efficient edge lookups. GTX manages concurrency control at delta-chain level, and provides adaptive concurrency according to the workload. Real-world graph access and updates exhibit temporal localities and hotspots. Unlike other transactional graph systems that experience significant performance degradation, GTX is the only system that can adapt to temporal localities and hotspots in graph updates and maintain million-transactions-per-second throughput. GTX is prototyped as a graph library and is evaluated using a graph library evaluation tool using real and synthetic datasets.
△ Less
Submitted 2 May, 2024;
originally announced May 2024.
-
GTX: A Write-Optimized Latch-free Graph Data System with Transactional Support
Authors:
Libin Zhou,
Yeasir Rayhan,
Lu Xing,
Walid. G. Aref
Abstract:
This paper introduces GTX a standalone main-memory write-optimized graph system that specializes in structural and graph property updates while maintaining concurrent reads and graph analytics with snapshot isolation-level transactional concurrency. Recent graph libraries target efficient concurrent read and write support while guaranteeing transactional consistency. However, their performance suf…
▽ More
This paper introduces GTX a standalone main-memory write-optimized graph system that specializes in structural and graph property updates while maintaining concurrent reads and graph analytics with snapshot isolation-level transactional concurrency. Recent graph libraries target efficient concurrent read and write support while guaranteeing transactional consistency. However, their performance suffers for updates with strong temporal locality over the same vertexes and edges due to vertex-centric lock contentions. GTX introduces a new delta-chain-centric concurrency-control protocol that eliminates traditional mutually exclusive latches. GTX resolves the conflicts caused by vertex-level locking, and adapts to real-life workloads while maintaining sequential access to the graph's adjacency lists storage. This combination of features has been demonstrated to provide good performance in graph analytical queries. GTX's transactions support fast group commit, novel write-write conflict prevention, and lazy garbage collection. Based on extensive experimental and comparative studies, in addition to maintaining competitive concurrent read and analytical performance, GTX demonstrates high throughput over state-of-the-art techniques when handling concurrent transaction+analytics workloads. For write-heavy transactional workloads, GTX performs up to 11x better than the best-performing state-of-the-art systems in transaction throughput. At the same time, GTX does not sacrifice the performance of read-heavy analytical workloads, and has competitive performance similar to state-of-the-art systems.
△ Less
Submitted 2 May, 2024;
originally announced May 2024.
-
A Survey of Learned Indexes for the Multi-dimensional Space
Authors:
Abdullah Al-Mamun,
Hao Wu,
Qiyang He,
Jianguo Wang,
Walid G. Aref
Abstract:
A recent research trend involves treating database index structures as Machine Learning (ML) models. In this domain, single or multiple ML models are trained to learn the map** from keys to positions inside a data set. This class of indexes is known as "Learned Indexes." Learned indexes have demonstrated improved search performance and reduced space requirements for one-dimensional data. The con…
▽ More
A recent research trend involves treating database index structures as Machine Learning (ML) models. In this domain, single or multiple ML models are trained to learn the map** from keys to positions inside a data set. This class of indexes is known as "Learned Indexes." Learned indexes have demonstrated improved search performance and reduced space requirements for one-dimensional data. The concept of one-dimensional learned indexes has naturally been extended to multi-dimensional (e.g., spatial) data, leading to the development of "Learned Multi-dimensional Indexes". This survey focuses on learned multi-dimensional index structures. Specifically, it reviews the current state of this research area, explains the core concepts behind each proposed method, and classifies these methods based on several well-defined criteria. We present a taxonomy that classifies and categorizes each learned multi-dimensional index, and survey the existing literature on learned multi-dimensional indexes according to this taxonomy. Additionally, we present a timeline to illustrate the evolution of research on learned indexes. Finally, we highlight several open challenges and future research directions in this emerging and highly active field.
△ Less
Submitted 11 March, 2024;
originally announced March 2024.
-
The Ubiquitous Skiplist: A Survey of What Cannot be Skipped About the Skiplist and its Applications in Big Data Systems
Authors:
Venkata Sai Pavan Kumar Vadrevu,
Lu Xing,
Walid G. Aref
Abstract:
Skiplists have become prevalent in systems. The main advantages of skiplists are their simplicity and ease of implementation, and the ability to support operations in the same asymptotic complexities as their tree-based counterparts. In this survey, we explore skiplists and their many variants. We highlight many scenarios of how skiplists are useful and fit well in these usage scenarios. We study…
▽ More
Skiplists have become prevalent in systems. The main advantages of skiplists are their simplicity and ease of implementation, and the ability to support operations in the same asymptotic complexities as their tree-based counterparts. In this survey, we explore skiplists and their many variants. We highlight many scenarios of how skiplists are useful and fit well in these usage scenarios. We study several extensions to skiplists to make them fit for more applications, e.g., their use in the multi-dimensional space, network overlaying algorithms, as well as serving as indexes in database systems. Besides, we also discuss systems that adopt the idea of skiplists and apply the probabilistic skip pattern into their designs.
△ Less
Submitted 22 May, 2024; v1 submitted 7 March, 2024;
originally announced March 2024.
-
SIMD-ified R-tree Query Processing and Optimization
Authors:
Yeasir Rayhan,
Walid G. Aref
Abstract:
The introduction of Single Instruction Multiple Data (SIMD) instructions in mainstream CPUs has enabled modern database engines to leverage data parallelism by performing more computation with a single instruction, resulting in a reduced number of instructions required to execute a query as well as the elimination of conditional branches. Though SIMD in the context of traditional database engines…
▽ More
The introduction of Single Instruction Multiple Data (SIMD) instructions in mainstream CPUs has enabled modern database engines to leverage data parallelism by performing more computation with a single instruction, resulting in a reduced number of instructions required to execute a query as well as the elimination of conditional branches. Though SIMD in the context of traditional database engines has been studied extensively, it has been overlooked in the context of spatial databases. In this paper, we investigate how spatial database engines can benefit from SIMD vectorization in the context of an R-tree spatial index. We present vectorized versions of the spatial range select, and spatial join operations over a vectorized R-tree index. For each of the operations, we investigate two storage layouts for an R-tree node to leverage SIMD instructions. We design vectorized algorithms for each of the spatial operations given each of the two data layouts. We show that the introduction of SIMD can improve the latency of the spatial query operators up to 9x. We introduce several optimizations over the vectorized implementation of these query operators, and study their effectiveness in query performance and various hardware performance counters under different scenarios.
△ Less
Submitted 28 September, 2023;
originally announced September 2023.
-
Towards Mobility Data Science (Vision Paper)
Authors:
Mohamed Mokbel,
Mahmoud Sakr,
Li Xiong,
Andreas Züfle,
Jussara Almeida,
Taylor Anderson,
Walid Aref,
Gennady Andrienko,
Natalia Andrienko,
Yang Cao,
Sanjay Chawla,
Reynold Cheng,
Panos Chrysanthis,
Xiqi Fei,
Gabriel Ghinita,
Anita Graser,
Dimitrios Gunopulos,
Christian Jensen,
Joon-Seok Kim,
Kyoung-Sook Kim,
Peer Kröger,
John Krumm,
Johannes Lauer,
Amr Magdy,
Mario Nascimento
, et al. (23 additional authors not shown)
Abstract:
Mobility data captures the locations of moving objects such as humans, animals, and cars. With the availability of GPS-equipped mobile devices and other inexpensive location-tracking technologies, mobility data is collected ubiquitously. In recent years, the use of mobility data has demonstrated significant impact in various domains including traffic management, urban planning, and health sciences…
▽ More
Mobility data captures the locations of moving objects such as humans, animals, and cars. With the availability of GPS-equipped mobile devices and other inexpensive location-tracking technologies, mobility data is collected ubiquitously. In recent years, the use of mobility data has demonstrated significant impact in various domains including traffic management, urban planning, and health sciences. In this paper, we present the emerging domain of mobility data science. Towards a unified approach to mobility data science, we envision a pipeline having the following components: mobility data collection, cleaning, analysis, management, and privacy. For each of these components, we explain how mobility data science differs from general data science, we survey the current state of the art and describe open challenges for the research community in the coming years.
△ Less
Submitted 7 March, 2024; v1 submitted 21 June, 2023;
originally announced July 2023.
-
An Update-intensive LSM-based R-tree Index
Authors:
Jaewoo Shin,
Jianguo Wang,
Walid G. Aref
Abstract:
Many applications require update-intensive workloads on spatial objects, e.g., social-network services and shared-riding services that track moving objects. By buffering insert and delete operations in memory, the Log Structured Merge Tree (LSM) has been used widely in various systems because of its ability to handle write-heavy workloads. While the focus on LSM has been on key-value stores and th…
▽ More
Many applications require update-intensive workloads on spatial objects, e.g., social-network services and shared-riding services that track moving objects. By buffering insert and delete operations in memory, the Log Structured Merge Tree (LSM) has been used widely in various systems because of its ability to handle write-heavy workloads. While the focus on LSM has been on key-value stores and their optimizations, there is a need to study how to efficiently support LSM-based {\em secondary} indexes (e.g., location-based indexes) as modern, heterogeneous data necessitates the use of secondary indexes. In this paper, we investigate the augmentation of a main-memory-based memo structure into an LSM secondary index structure to handle update-intensive workloads efficiently. We conduct this study in the context of an R-tree-based secondary index. In particular, we introduce the LSM RUM-tree that demonstrates the use of an Update Memo in an LSM-based R-tree to enhance the performance of the R-tree's insert, delete, update, and search operations. The LSM RUM-tree introduces new strategies to control the size of the Update Memo to make sure it always fits in memory for high performance. The Update Memo is a light-weight in-memory structure that is suitable for handling update-intensive workloads without introducing significant overhead. Experimental results using real spatial data demonstrate that the LSM RUM-tree achieves up to 9.6x speedup on update operations and up to 2400x speedup on query processing over existing LSM R-tree implementations.
△ Less
Submitted 1 May, 2023;
originally announced May 2023.
-
Tutorial: The Ubiquitous Skiplist, its Variants, and Applications in Modern Big Data Systems
Authors:
Venkata Sai Pavan Kumar Vadrevu,
Lu Xing,
Walid G. Aref
Abstract:
The Skiplist, or skip list, originally designed as an in-memory data structure, has attracted a lot of attention in recent years as a main-memory component in many NoSQL, cloud-based, and big data systems. Unlike the B-tree, the skiplist does not need complex rebalancing mechanisms, but it still shows expected logarithmic performance. It supports a variety of operations, including insert, point re…
▽ More
The Skiplist, or skip list, originally designed as an in-memory data structure, has attracted a lot of attention in recent years as a main-memory component in many NoSQL, cloud-based, and big data systems. Unlike the B-tree, the skiplist does not need complex rebalancing mechanisms, but it still shows expected logarithmic performance. It supports a variety of operations, including insert, point read, and range queries. To make the skiplist more versatile, many optimizations have been applied to its node structure, construction algorithm, list structure, concurrent access, to name a few. Many variants of the skiplist have been proposed and experimented with, in many big-data system scenarios.
In addition to being a main-memory component, the skiplist also serves as a core index in systems to address problems including write amplification, write stalls, sorting, range query processing, etc. In this tutorial, we present a comprehensive overview of the skiplist, its variants, optimizations, and various use cases of how big data and NoSQL systems make use of skiplists. Throughout this tutorial, we demonstrate the advantages of using a skiplist or skiplist-like structures in modern data systems.
△ Less
Submitted 19 April, 2023;
originally announced April 2023.
-
The Case for Distributed Shared-Memory Databases with RDMA-Enabled Memory Disaggregation
Authors:
Ruihong Wang,
Jianguo Wang,
Stratos Idreos,
M. Tamer Özsu,
Walid G. Aref
Abstract:
Memory disaggregation (MD) allows for scalable and elastic data center design by separating compute (CPU) from memory. With MD, compute and memory are no longer coupled into the same server box. Instead, they are connected to each other via ultra-fast networking such as RDMA. MD can bring many advantages, e.g., higher memory utilization, better independent scaling (of compute and memory), and lowe…
▽ More
Memory disaggregation (MD) allows for scalable and elastic data center design by separating compute (CPU) from memory. With MD, compute and memory are no longer coupled into the same server box. Instead, they are connected to each other via ultra-fast networking such as RDMA. MD can bring many advantages, e.g., higher memory utilization, better independent scaling (of compute and memory), and lower cost of ownership. This paper makes the case that MD can fuel the next wave of innovation on database systems. We observe that MD revives the great debate of "shared what" in the database community. We envision that distributed shared-memory databases (DSM-DB, for short) - that have not received much attention before - can be promising in the future with MD. We present a list of challenges and opportunities that can inspire next steps in system design making the case for DSM-DB.
△ Less
Submitted 6 July, 2022;
originally announced July 2022.
-
The "AI+R"-tree: An Instance-optimized R-tree
Authors:
Abdullah-Al-Mamun,
Ch. Md. Rakin Haider,
Jianguo Wang,
Walid G. Aref
Abstract:
The emerging class of instance-optimized systems has shown potential to achieve high performance by specializing to a specific data and query workloads. Particularly, Machine Learning (ML) techniques have been applied successfully to build various instance-optimized components (e.g., learned indexes). This paper investigates to leverage ML techniques to enhance the performance of spatial indexes,…
▽ More
The emerging class of instance-optimized systems has shown potential to achieve high performance by specializing to a specific data and query workloads. Particularly, Machine Learning (ML) techniques have been applied successfully to build various instance-optimized components (e.g., learned indexes). This paper investigates to leverage ML techniques to enhance the performance of spatial indexes, particularly the R-tree, for a given data and query workloads. As the areas covered by the R-tree index nodes overlap in space, upon searching for a specific point in space, multiple paths from root to leaf may potentially be explored. In the worst case, the entire R-tree could be searched. In this paper, we define and use the overlap ratio to quantify the degree of extraneous leaf node accesses required by a range query. The goal is to enhance the query performance of a traditional R-tree for high-overlap range queries as they tend to incur long running-times. We introduce a new AI-tree that transforms the search operation of an R-tree into a multi-label classification task to exclude the extraneous leaf node accesses. Then, we augment a traditional R-tree to the AI-tree to form a hybrid "AI+R"-tree. The "AI+R"-tree can automatically differentiate between the high- and low-overlap queries using a learned model. Thus, the "AI+R"-tree processes high-overlap queries using the AI-tree, and the low-overlap queries using the R-tree. Experiments on real datasets demonstrate that the "AI+R"-tree can enhance the query performance over a traditional R-tree by up to 500%.
△ Less
Submitted 1 July, 2022;
originally announced July 2022.
-
ILX: Intelligent "Location+X" Data Systems (Vision Paper)
Authors:
Walid G. Aref,
Ahmed M. Aly,
Anas Daghistani,
Yeasir Rayhan,
Jianguo Wang,
Libin Zhou
Abstract:
Due to the ubiquity of mobile phones and location-detection devices, location data is being generated in very large volumes. Queries and operations that are performed on location data warrant the use of database systems. Despite that, location data is being supported in data systems as an afterthought. Typically, relational or NoSQL data systems that are mostly designed with non-location data in m…
▽ More
Due to the ubiquity of mobile phones and location-detection devices, location data is being generated in very large volumes. Queries and operations that are performed on location data warrant the use of database systems. Despite that, location data is being supported in data systems as an afterthought. Typically, relational or NoSQL data systems that are mostly designed with non-location data in mind get extended with spatial or spatiotemporal indexes, some query operators, and higher level syntactic sugar in order to support location data. The ubiquity of location data and location data services call for systems that are solely designed and optimized for the efficient support of location data. This paper envisions designing intelligent location+X data systems, ILX for short, where location is treated as a first-class citizen type. ILX is tailored with location data as the main data type (location-first). Because location data is typically augmented with other data types X, e.g., graphs, text data, click streams, annotations, etc., ILX needs to be extensible to support other data types X along with location. This paper envisions the main features that ILX should support, and highlights research challenges in realizing and supporting ILX.
△ Less
Submitted 1 August, 2022; v1 submitted 19 June, 2022;
originally announced June 2022.
-
An Experimental Evaluation and Investigation of Waves of Misery in R-trees
Authors:
Lu Xing,
Eric Lee,
Tong An,
Bo-Cheng Chu,
Ahmed Mahmood,
Ahmed M. Aly,
Jianguo Wang,
Walid G. Aref
Abstract:
Waves of misery is a phenomenon where spikes of many node splits occur over short periods of time in tree indexes. Waves of misery negatively affect the performance of tree indexes in insertion-heavy workloads.Waves of misery have been first observed in the context of the B-tree, where these waves cause unpredictable index performance. In particular, the performance of search and index-update oper…
▽ More
Waves of misery is a phenomenon where spikes of many node splits occur over short periods of time in tree indexes. Waves of misery negatively affect the performance of tree indexes in insertion-heavy workloads.Waves of misery have been first observed in the context of the B-tree, where these waves cause unpredictable index performance. In particular, the performance of search and index-update operations deteriorate when a wave of misery takes place, but is more predictable between the waves. This paper investigates the presence or lack of waves of misery in several R-tree variants, and studies the extent of which these waves impact the performance of each variant. Interestingly, although having poorer query performance, the Linear and Quadratic R-trees are found to be more resilient to waves of misery than both the Hilbert and R*-trees. This paper presents several techniques to reduce the impact in performance of the waves of misery for the Hilbert and R*-trees. One way to eliminate waves of misery is to force node splits to take place at regular times before nodes become full to achieve deterministic performance. The other way is that upon splitting a node, do not split it evenly but rather at different node utilization factors. This allows leaf nodes not to fill at the same pace. We study the impact of two new techniques to mitigate waves of misery after the tree index has been constructed, namely Regular Elective Splits (RES, for short) and Unequal Random Splits (URS, for short). Our experimental investigation highlights the trade-offs in performance of the introduced techniques and the pros and cons of each technique.
△ Less
Submitted 24 December, 2021;
originally announced December 2021.
-
Scalable Relational Query Processing on Big Matrix Data
Authors:
Yongyang Yu,
Mingjie Tang,
Walid G. Aref
Abstract:
The use of large-scale machine learning methods is becoming ubiquitous in many applications ranging from business intelligence to self-driving cars. These methods require a complex computation pipeline consisting of various types of operations, e.g., relational operations for pre-processing or post-processing the dataset, and matrix operations for core model computations. Many existing systems foc…
▽ More
The use of large-scale machine learning methods is becoming ubiquitous in many applications ranging from business intelligence to self-driving cars. These methods require a complex computation pipeline consisting of various types of operations, e.g., relational operations for pre-processing or post-processing the dataset, and matrix operations for core model computations. Many existing systems focus on efficiently processing matrix-only operations, and assume that the inputs to the relational operators are already pre-computed and are materialized as intermediate matrices. However, the input to a relational operator may be complex in machine learning pipelines, and may involve various combinations of matrix operators. Hence, it is critical to realize scalable and efficient relational query processors that directly operate on big matrix data. This paper presents new efficient and scalable relational query processing techniques on big matrix data for in-memory distributed clusters. The proposed techniques leverage algebraic transformation rules to rewrite query execution plans into ones with lower computation costs. A distributed query plan optimizer exploits the sparsity-inducing property of merge functions as well as Bloom join strategies for efficiently evaluating various flavors of the join operation. Furthermore, optimized partitioning schemes for the input matrices are developed to facilitate the performance of join operations based on a cost model that minimizes the communication overhead.The proposed relational query processing techniques are prototyped in Apache Spark. Experiments on both real and synthetic data demonstrate that the proposed techniques achieve up to two orders of magnitude performance improvement over state-of-the-art systems on a wide range of applications.
△ Less
Submitted 9 November, 2021; v1 submitted 4 October, 2021;
originally announced October 2021.
-
The Future is Big Graphs! A Community View on Graph Processing Systems
Authors:
Sherif Sakr,
Angela Bonifati,
Hannes Voigt,
Alexandru Iosup,
Khaled Ammar,
Renzo Angles,
Walid Aref,
Marcelo Arenas,
Maciej Besta,
Peter A. Boncz,
Khuzaima Daudjee,
Emanuele Della Valle,
Stefania Dumbrava,
Olaf Hartig,
Bernhard Haslhofer,
Tim Hegeman,
Jan Hidders,
Katja Hose,
Adriana Iamnitchi,
Vasiliki Kalavri,
Hugo Kapp,
Wim Martens,
M. Tamer Özsu,
Eric Peukert,
Stefan Plantikow
, et al. (16 additional authors not shown)
Abstract:
Graphs are by nature unifying abstractions that can leverage interconnectedness to represent, explore, predict, and explain real- and digital-world phenomena. Although real users and consumers of graph instances and graph workloads understand these abstractions, future problems will require new abstractions and systems. What needs to happen in the next decade for big graph processing to continue t…
▽ More
Graphs are by nature unifying abstractions that can leverage interconnectedness to represent, explore, predict, and explain real- and digital-world phenomena. Although real users and consumers of graph instances and graph workloads understand these abstractions, future problems will require new abstractions and systems. What needs to happen in the next decade for big graph processing to continue to succeed?
△ Less
Submitted 11 December, 2020;
originally announced December 2020.
-
STULL: Unbiased Online Sampling for Visual Exploration of Large Spatiotemporal Data
Authors:
Guizhen Wang,
**g**g Guo,
Mingjie Tang,
José Florencio de Queiroz Neto,
Calvin Yau,
Anas Daghistani,
Morteza Karimzadeh,
Walid G. Aref,
David S. Ebert
Abstract:
Online sampling-supported visual analytics is increasingly important, as it allows users to explore large datasets with acceptable approximate answers at interactive rates. However, existing online spatiotemporal sampling techniques are often biased, as most researchers have primarily focused on reducing computational latency. Biased sampling approaches select data with unequal probabilities and p…
▽ More
Online sampling-supported visual analytics is increasingly important, as it allows users to explore large datasets with acceptable approximate answers at interactive rates. However, existing online spatiotemporal sampling techniques are often biased, as most researchers have primarily focused on reducing computational latency. Biased sampling approaches select data with unequal probabilities and produce results that do not match the exact data distribution, leading end users to incorrect interpretations. In this paper, we propose a novel approach to perform unbiased online sampling of large spatiotemporal data. The proposed approach ensures the same probability of selection to every point that qualifies the specifications of a user's multidimensional query. To achieve unbiased sampling for accurate representative interactive visualizations, we design a novel data index and an associated sample retrieval plan. Our proposed sampling approach is suitable for a wide variety of visual analytics tasks, e.g., tasks that run aggregate queries of spatiotemporal data. Extensive experiments confirm the superiority of our approach over a state-of-the-art spatial online sampling technique, demonstrating that within the same computational time, data samples generated in our approach are at least 50% more accurate in representing the actual spatial distribution of the data and enable approximate visualizations to present closer visual appearances to the exact ones.
△ Less
Submitted 29 August, 2020;
originally announced August 2020.
-
SWARM: Adaptive Load Balancing in Distributed Streaming Systems for Big Spatial Data
Authors:
Anas Daghistani,
Walid G. Aref,
Arif Ghafoor,
Ahmed R. Mahmood
Abstract:
The proliferation of GPS-enabled devices has led to the development of numerous location-based services. These services need to process massive amounts of spatial data in real-time. The current scale of spatial data cannot be handled using centralized systems. This has led to the development of distributed spatial streaming systems. Existing systems are using static spatial partitioning to distrib…
▽ More
The proliferation of GPS-enabled devices has led to the development of numerous location-based services. These services need to process massive amounts of spatial data in real-time. The current scale of spatial data cannot be handled using centralized systems. This has led to the development of distributed spatial streaming systems. Existing systems are using static spatial partitioning to distribute the workload. In contrast, the real-time streamed spatial data follows non-uniform spatial distributions that are continuously changing over time. Distributed spatial streaming systems need to react to the changes in the distribution of spatial data and queries. This paper introduces SWARM, a light-weight adaptivity protocol that continuously monitors the data and query workloads across the distributed processes of the spatial data streaming system, and redistribute and rebalance the workloads soon as performance bottlenecks get detected. SWARM is able to handle multiple query-execution and data-persistence models. A distributed streaming system can directly use SWARM to adaptively rebalance the system's workload among its machines with minimal changes to the original code of the underlying spatial application. Extensive experimental evaluation using real and synthetic datasets illustrate that, on average, SWARM achieves 200% improvement over a static grid partitioning that is determined based on observing a limited history of the data and query workloads. Moreover, SWARM reduces execution latency on average 4x compared with the other technique.
△ Less
Submitted 26 February, 2020;
originally announced February 2020.
-
LocationSpark: In-memory Distributed Spatial Query Processing and Optimization
Authors:
Mingjie Tang,
Yongyang Yu,
Walid G. Aref,
Ahmed R. Mahmood,
Qutaibah M. Malluhi,
Mourad Ouzzani
Abstract:
Due to the ubiquity of spatial data applications and the large amounts of spatial data that these applications generate and process, there is a pressing need for scalable spatial query processing. In this paper, we present new techniques for spatial query processing and optimization in an in-memory and distributed setup to address scalability. More specifically, we introduce new techniques for han…
▽ More
Due to the ubiquity of spatial data applications and the large amounts of spatial data that these applications generate and process, there is a pressing need for scalable spatial query processing. In this paper, we present new techniques for spatial query processing and optimization in an in-memory and distributed setup to address scalability. More specifically, we introduce new techniques for handling query skew, which is common in practice, and optimize communication costs accordingly. We propose a distributed query scheduler that use a new cost model to optimize the cost of spatial query processing. The scheduler generates query execution plans that minimize the effect of query skew. The query scheduler employs new spatial indexing techniques based on bitmap filters to forward queries to the appropriate local nodes. Each local computation node is responsible for optimizing and selecting its best local query execution plan based on the indexes and the nature of the spatial queries in that node. All the proposed spatial query processing and optimization techniques are prototyped inside Spark, a distributed memory-based computation system. The experimental study is based on real datasets and demonstrates that distributed spatial query processing can be enhanced by up to an order of magnitude over existing in-memory and distributed spatial systems.
△ Less
Submitted 16 July, 2019; v1 submitted 8 July, 2019;
originally announced July 2019.
-
Design and Evaluation of A Data Partitioning-Based Intrusion Management Architecture for Database Systems
Authors:
Muhamad Felemban,
Yahya Javeed,
Jason Kobes,
Thamir Qadah,
Arif Ghafoor,
Walid Aref
Abstract:
Data-intensive applications exhibit increasing reliance on Database Management Systems (DBMSs, for short). With the growing cyber-security threats to government and commercial infrastructures, the need to develop high resilient cyber systems is becoming increasingly important. Cyber-attacks on DBMSs include intrusion attacks that may result in severe degradation in performance. Several efforts hav…
▽ More
Data-intensive applications exhibit increasing reliance on Database Management Systems (DBMSs, for short). With the growing cyber-security threats to government and commercial infrastructures, the need to develop high resilient cyber systems is becoming increasingly important. Cyber-attacks on DBMSs include intrusion attacks that may result in severe degradation in performance. Several efforts have been directed towards designing an integrated management system to detect, respond, and recover from malicious attacks. In this paper, we propose a data Partitioning-based Intrusion Management System (PIMS, for short) that can endure intense malicious intrusion attacks on DBMS. The novelty in PIMS is the ability to contain the damage into data partitions, termed Intrusion Boundaries (IBs, for short). The IB Demarcation Problem (IBDP, for short) is formulated as a mixed integer nonlinear programming. We prove that IBDP is NP-hard. Accordingly, two heuristic solutions for IBDP are introduced. The proposed architecture for PIMS includes novel IB-centric response and recovery mechanisms, which executes compensating transactions. PIMS is prototyped within PostgreSQL, an open-source DBMS. Finally, empirical and experimental performance evaluation of PIMS are conducted to demonstrate that intelligent partitioning of data tuples improves the overall availability of the DBMS under intrusion attacks.
△ Less
Submitted 5 October, 2018; v1 submitted 4 October, 2018;
originally announced October 2018.
-
Pattern-Driven Data Cleaning
Authors:
El Kindi Rezig,
Mourad Ouzzani,
Walid G. Aref,
Ahmed K. Elmagarmid,
Ahmed R. Mahmood
Abstract:
Data is inherently dirty and there has been a sustained effort to come up with different approaches to clean it. A large class of data repair algorithms rely on data-quality rules and integrity constraints to detect and repair the data. A well-studied class of integrity constraints is Functional Dependencies (FDs, for short) that specify dependencies among attributes in a relation. In this paper,…
▽ More
Data is inherently dirty and there has been a sustained effort to come up with different approaches to clean it. A large class of data repair algorithms rely on data-quality rules and integrity constraints to detect and repair the data. A well-studied class of integrity constraints is Functional Dependencies (FDs, for short) that specify dependencies among attributes in a relation. In this paper, we address three major challenges in data repairing: (1) Accuracy: Most existing techniques strive to produce repairs that minimize changes to the data. However, this process may produce incorrect combinations of attribute values (or patterns). In this work, we formalize the interaction of FD-induced patterns and select repairs that result in preserving frequent patterns found in the original data. This has the potential to yield a better repair quality both in terms of precision and recall. (2) Interpretability of repairs: Current data repair algorithms produce repairs in the form of data updates that are not necessarily understandable. This makes it hard to debug repair decisions and trace the chain of steps that produced them. To this end, we define a new formalism to declaratively express repairs that are easy for users to reason about. (3) Scalability: We propose a linear-time algorithm to compute repairs that outperforms state-of-the-art FD repairing algorithms by orders of magnitude in repair time. Our experiments using both real-world and synthetic data demonstrate that our new repair approach consistently outperforms existing techniques both in terms of repair quality and scalability.
△ Less
Submitted 26 December, 2017;
originally announced December 2017.
-
Human-Centric Data Cleaning [Vision]
Authors:
El Kindi Rezig,
Mourad Ouzzani,
Ahmed K. Elmagarmid,
Walid G. Aref
Abstract:
Data Cleaning refers to the process of detecting and fixing errors in the data. Human involvement is instrumental at several stages of this process, e.g., to identify and repair errors, to validate computed repairs, etc. There is currently a plethora of data cleaning algorithms addressing a wide range of data errors (e.g., detecting duplicates, violations of integrity constraints, missing values,…
▽ More
Data Cleaning refers to the process of detecting and fixing errors in the data. Human involvement is instrumental at several stages of this process, e.g., to identify and repair errors, to validate computed repairs, etc. There is currently a plethora of data cleaning algorithms addressing a wide range of data errors (e.g., detecting duplicates, violations of integrity constraints, missing values, etc.). Many of these algorithms involve a human in the loop, however, this latter is usually coupled to the underlying cleaning algorithms. There is currently no end-to-end data cleaning framework that systematically involves humans in the cleaning pipeline regardless of the underlying cleaning algorithms. In this paper, we highlight key challenges that need to be addressed to realize such a framework. We present a design vision and discuss scenarios that motivate the need for such a framework to judiciously assist humans in the cleaning process. Finally, we present directions to implement such a framework.
△ Less
Submitted 30 December, 2017; v1 submitted 24 December, 2017;
originally announced December 2017.
-
SBG-Sketch: A Self-Balanced Sketch for Labeled-Graph Stream Summarization
Authors:
Mohamed S. Hassan,
Bruno Ribeiro,
Walid G. Aref
Abstract:
Applications in various domains rely on processing graph streams, e.g., communication logs of a cloud-troubleshooting system, road-network traffic updates, and interactions on a social network. A labeled-graph stream refers to a sequence of streamed edges that form a labeled graph. Label-aware applications need to filter the graph stream before performing a graph operation. Due to the large volume…
▽ More
Applications in various domains rely on processing graph streams, e.g., communication logs of a cloud-troubleshooting system, road-network traffic updates, and interactions on a social network. A labeled-graph stream refers to a sequence of streamed edges that form a labeled graph. Label-aware applications need to filter the graph stream before performing a graph operation. Due to the large volume and high velocity of these streams, it is often more practical to incrementally build a lossy-compressed version of the graph, and use this lossy version to approximately evaluate graph queries. Challenges arise when the queries are unknown in advance but are associated with filtering predicates based on edge labels. Surprisingly common, and especially challenging, are labeled-graph streams that have highly skewed label distributions that might also vary over time. This paper introduces Self-Balanced Graph Sketch (SBG-Sketch, for short), a graphical sketch for summarizing and querying labeled-graph streams that can cope with all these challenges. SBG-Sketch maintains synopsis for both the edge attributes (e.g., edge weight) as well as the topology of the streamed graph. SBG-Sketch allows efficient processing of graph-traversal queries, e.g., reachability queries. Experimental results over a variety of real graph streams show SBG-Sketch to reduce the estimation errors of state-of-the-art methods by up to 99%.
△ Less
Submitted 20 September, 2017;
originally announced September 2017.
-
Empowering In-Memory Relational Database Engines with Native Graph Processing
Authors:
Mohamed S. Hassan,
Tatiana Kuznetsova,
Hyun Chai Jeong,
Walid G. Aref,
Mohammad Sadoghi
Abstract:
The plethora of graphs and relational data give rise to many interesting graph-relational queries in various domains, e.g., finding related proteins satisfying relational predicates in a biological network. The maturity of RDBMSs motivated academia and industry to invest efforts in leveraging RDBMSs for graph processing, where efficiency is proven for vital graph queries. However, none of these ef…
▽ More
The plethora of graphs and relational data give rise to many interesting graph-relational queries in various domains, e.g., finding related proteins satisfying relational predicates in a biological network. The maturity of RDBMSs motivated academia and industry to invest efforts in leveraging RDBMSs for graph processing, where efficiency is proven for vital graph queries. However, none of these efforts process graphs natively inside the RDBMS, which is particularly challenging due to the impedance mismatch between the relational and the graph models. In this paper, we propose to treat graphs as first-class citizens inside the relational engine so that operations on graphs are executed natively inside the RDBMS. We realize our approach inside VoltDB, an open-source in-memory relational database, and name this realization GRFusion. The SQL and the query engine of GRFusion are empowered to declaratively define graphs and execute cross-data-model query plans formed by graph and relational operators, resulting in up to four orders-of-magnitude in query-time speedup w.r.t. state-of-the-art approaches.
△ Less
Submitted 12 October, 2017; v1 submitted 19 September, 2017;
originally announced September 2017.
-
Adaptive Processing of Spatial-Keyword Data Over a Distributed Streaming Cluster
Authors:
Ahmed R. Mahmood,
Anas Daghistani,
Ahmed M. Aly,
Walid G. Aref,
Mingjie Tang,
Saleh Basalamah,
Sunil Prabhakar
Abstract:
The widespread use of GPS-enabled smartphones along with the popularity of micro-blogging and social networking applications, e.g., Twitter and Facebook, has resulted in the generation of huge streams of geo-tagged textual data. Many applications require real-time processing of these streams. For example, location-based e-coupon and ad-targeting systems enable advertisers to register millions of a…
▽ More
The widespread use of GPS-enabled smartphones along with the popularity of micro-blogging and social networking applications, e.g., Twitter and Facebook, has resulted in the generation of huge streams of geo-tagged textual data. Many applications require real-time processing of these streams. For example, location-based e-coupon and ad-targeting systems enable advertisers to register millions of ads to millions of users. The number of users is typically very high and they are continuously moving, and the ads change frequently as well. Hence sending the right ad to the matching users is very challenging. Existing streaming systems are either centralized or are not spatial-keyword aware, and cannot efficiently support the processing of rapidly arriving spatial-keyword data streams. This paper presents Tornado, a distributed spatial-keyword stream processing system. Tornado features routing units to fairly distribute the workload, and furthermore, co-locate the data objects and the corresponding queries at the same processing units. The routing units use the Augmented-Grid, a novel structure that is equipped with an efficient search algorithm for distributing the data objects and queries. Tornado uses evaluators to process the data objects against the queries. The routing units minimize the redundant communication by not sending data updates for processing when these updates do not match any query. By applying dynamically evaluated cost formulae that continuously represent the processing overhead at each evaluator, Tornado is adaptive to changes in the workload. Extensive experimental evaluation using spatio-textual range queries over real Twitter data indicates that Tornado outperforms the non-spatio-textually aware approaches by up to two orders of magnitude in terms of the overall system throughput.
△ Less
Submitted 8 September, 2017;
originally announced September 2017.
-
FAST: Frequency-Aware Spatio-Textual Indexing for In-Memory Continuous Filter Query Processing
Authors:
Ahmed R. Mahmood,
Ahmed M. Aly,
Walid G. Aref
Abstract:
Many applications need to process massive streams of spatio-textual data in real-time against continuous spatio-textual queries. For example, in location-aware ad targeting publish/subscribe systems, it is required to disseminate millions of ads and promotions to millions of users based on the locations and textual profiles of users. In this paper, we study indexing of continuous spatio-textual qu…
▽ More
Many applications need to process massive streams of spatio-textual data in real-time against continuous spatio-textual queries. For example, in location-aware ad targeting publish/subscribe systems, it is required to disseminate millions of ads and promotions to millions of users based on the locations and textual profiles of users. In this paper, we study indexing of continuous spatio-textual queries. There exist several related spatio-textual indexes that typically integrate a spatial index with a textual index. However, these indexes usually have a high demand for main-memory and assume that the entire vocabulary of keywords is known in advance. Also, these indexes do not successfully capture the variations in the frequencies of keywords across different spatial regions and treat frequent and infrequent keywords in the same way. Moreover, existing indexes do not adapt to the changes in workload over space and time. For example, some keywords may be trending at certain times in certain locations and this may change as time passes. This affects the indexing and searching performance of existing indexes significantly. In this paper, we introduce FAST, a Frequency-Aware Spatio-Textual index for continuous spatio-textual queries. FAST is a main-memory index that requires up to one third of the memory needed by the state-of-the-art index. FAST does not assume prior knowledge of the entire vocabulary of indexed objects. FAST adaptively accounts for the difference in the frequencies of keywords within their corresponding spatial regions to automatically choose the best indexing approach that optimizes the insertion and search times. Extensive experimental evaluation using real and synthetic datasets demonstrates that FAST is up to 3x faster in search time and 5x faster in insertion time than the state-of-the-art indexes.
△ Less
Submitted 4 October, 2017; v1 submitted 8 September, 2017;
originally announced September 2017.
-
A Survey of Shortest-Path Algorithms
Authors:
Amgad Madkour,
Walid G. Aref,
Faizan Ur Rehman,
Mohamed Abdur Rahman,
Saleh Basalamah
Abstract:
A shortest-path algorithm finds a path containing the minimal cost between two vertices in a graph. A plethora of shortest-path algorithms is studied in the literature that span across multiple disciplines. This paper presents a survey of shortest-path algorithms based on a taxonomy that is introduced in the paper. One dimension of this taxonomy is the various flavors of the shortest-path problem.…
▽ More
A shortest-path algorithm finds a path containing the minimal cost between two vertices in a graph. A plethora of shortest-path algorithms is studied in the literature that span across multiple disciplines. This paper presents a survey of shortest-path algorithms based on a taxonomy that is introduced in the paper. One dimension of this taxonomy is the various flavors of the shortest-path problem. There is no one general algorithm that is capable of solving all variants of the shortest-path problem due to the space and time complexities associated with each algorithm. Other important dimensions of the taxonomy include whether the shortest-path algorithm operates over a static or a dynamic graph, whether the shortest-path algorithm produces exact or approximate answers, and whether the objective of the shortest-path algorithm is to achieve time-dependence or is to only be goal directed. This survey studies and classifies shortest-path algorithms according to the proposed taxonomy. The survey also presents the challenges and proposed solutions associated with each category in the taxonomy.
△ Less
Submitted 4 May, 2017;
originally announced May 2017.
-
On Order-independent Semantics of the Similarity Group-By Relational Database Operator
Authors:
Mingjie Tang,
Ruby Y. Tahboub,
Walid G. Aref,
Qutaibah M. Malluhi,
Mourad Ouzzani
Abstract:
Similarity group-by (SGB, for short) has been proposed as a relational database operator to match the needs of emerging database applications. Many SGB operators that extend SQL have been proposed in the literature, e.g., similarity operators in the one-dimensional space. These operators have various semantics. Depending on how these operators are implemented, some of the implementations may lead…
▽ More
Similarity group-by (SGB, for short) has been proposed as a relational database operator to match the needs of emerging database applications. Many SGB operators that extend SQL have been proposed in the literature, e.g., similarity operators in the one-dimensional space. These operators have various semantics. Depending on how these operators are implemented, some of the implementations may lead to different grou**s of the data. Hence, if SQL code is ported from one database system to another, it is not guaranteed that the code will produce the same results. In this paper, we investigate the various semantics for the relational similarity group-by operators in the multi-dimensional space. We define the class of order-independent SGB operators that produce the same results regardless of the order in which the input data is presented to them. Using the notion of interval graphs borrowed from graph theory, we prove that, for certain SGB operators, there exist order-independent implementations. For each of these operators, we provide a sample algorithm that is order-independent. Also, we prove that for other SGB operators, there does not exist an order-independent implementation for them, and hence these SGB operators are ill-defined and should not be adopted in extensions to SQL to realize similarity group-by. In this paper, we introduce an SGB operator, namely SGB-All, for grou** multi-dimensional data using similarity. SGB-All forms groups such that a data item, say O, belongs to a group, say G, if and only if O is within a user-defined threshold from all other data items in G. In other words, each group in SGB-All forms a clique of nearby data items in the multi-dimensional space. We prove that SGB-All are order-independent, i.e., there is at least one algorithm for each option that is independent of the presentation order of the input data.
△ Less
Submitted 13 December, 2014;
originally announced December 2014.
-
Spatial Queries with Two kNN Predicates
Authors:
Ahmed M. Aly,
Walid G. Aref,
Mourad Ouzzani
Abstract:
The widespread use of location-aware devices has led to countless location-based services in which a user query can be arbitrarily complex, i.e., one that embeds multiple spatial selection and join predicates. Amongst these predicates, the k-Nearest-Neighbor (kNN) predicate stands as one of the most important and widely used predicates. Unlike related research, this paper goes beyond the optimizat…
▽ More
The widespread use of location-aware devices has led to countless location-based services in which a user query can be arbitrarily complex, i.e., one that embeds multiple spatial selection and join predicates. Amongst these predicates, the k-Nearest-Neighbor (kNN) predicate stands as one of the most important and widely used predicates. Unlike related research, this paper goes beyond the optimization of queries with single kNN predicates, and shows how queries with two kNN predicates can be optimized. In particular, the paper addresses the optimization of queries with: (i) two kNN-select predicates, (ii) two kNN-join predicates, and (iii) one kNN-join predicate and one kNN-select predicate. For each type of queries, conceptually correct query evaluation plans (QEPs) and new algorithms that optimize the query execution time are presented. Experimental results demonstrate that the proposed algorithms outperform the conceptually correct QEPs by orders of magnitude.
△ Less
Submitted 31 July, 2012;
originally announced August 2012.
-
bdbms -- A Database Management System for Biological Data
Authors:
Mohamed Y. Eltabakh,
Mourad Ouzzani,
Walid G. Aref
Abstract:
Biologists are increasingly using databases for storing and managing their data. Biological databases typically consist of a mixture of raw data, metadata, sequences, annotations, and related data obtained from various sources. Current database technology lacks several functionalities that are needed by biological databases. In this paper, we introduce bdbms, an extensible prototype database man…
▽ More
Biologists are increasingly using databases for storing and managing their data. Biological databases typically consist of a mixture of raw data, metadata, sequences, annotations, and related data obtained from various sources. Current database technology lacks several functionalities that are needed by biological databases. In this paper, we introduce bdbms, an extensible prototype database management system for supporting biological data. bdbms extends the functionalities of current DBMSs to include: (1) Annotation and provenance management including storage, indexing, manipulation, and querying of annotation and provenance as first class objects in bdbms, (2) Local dependency tracking to track the dependencies and derivations among data items, (3) Update authorization to support data curation via content-based authorization, in contrast to identity-based authorization, and (4) New access methods and their supporting operators that support pattern matching on various types of compressed biological data types. This paper presents the design of bdbms along with the techniques proposed to support these functionalities including an extension to SQL. We also outline some open issues in building bdbms.
△ Less
Submitted 22 December, 2006;
originally announced December 2006.