-
Finding a Second Wind: Speeding Up Graph Traversal Queries in RDBMSs Using Column-Oriented Processing
Authors:
Mikhail Firsov,
Michael Polyntsov,
Kirill Smirnov,
George Chernishev
Abstract:
Recursive queries and recursive derived tables constitute an important part of the SQL standard. Their efficient processing is important for many real-life applications that rely on graph or hierarchy traversal. Position-enabled column-stores offer a novel opportunity to improve run times for this type of queries. Such systems allow the engine to explicitly use data positions (row ids) inside its…
▽ More
Recursive queries and recursive derived tables constitute an important part of the SQL standard. Their efficient processing is important for many real-life applications that rely on graph or hierarchy traversal. Position-enabled column-stores offer a novel opportunity to improve run times for this type of queries. Such systems allow the engine to explicitly use data positions (row ids) inside its core and thus, enable novel efficient implementations of query plan operators.
In this paper, we present an approach that significantly speeds up recursive query processing inside RDBMSes. Its core idea is to employ a particular aspect of column-store technology (late materialization) which enables the query engine to manipulate data positions during query execution. Based on it, we propose two sets of Volcano-style operators intended to process different query cases.
In order validate our ideas, we have implemented the proposed approach in PosDB, an RDBMS column-store with SQL support. We experimentally demonstrate the viability of our approach by providing a comparison with PostgreSQL. Experiments show that for breadth-first search: 1) our position-based approach yields up to 6x better results than PostgreSQL, 2) our tuple-based one results in only 3x improvement when using a special rewriting technique, but it can work in a larger number of cases, and 3) both approaches can't be emulated in row-stores efficiently.
△ Less
Submitted 16 August, 2023;
originally announced August 2023.
-
Solving Data Quality Problems with Desbordante: a Demo
Authors:
George Chernishev,
Michael Polyntsov,
Anton Chizhov,
Kirill Stupakov,
Ilya Shchuckin,
Alexander Smirnov,
Maxim Strutovsky,
Alexey Shlyonskikh,
Mikhail Firsov,
Stepan Manannikov,
Nikita Bobrov,
Daniil Goncharov,
Ilia Barutkin,
Vladislav Shalnev,
Kirill Muraviev,
Anna Rakhmukova,
Dmitriy Shcheka,
Anton Chernikov,
Mikhail Vyrodov,
Yaroslav Kurbatov,
Maxim Fofanov,
Sergei Belokonnyi,
Pavel Anosov,
Arthur Saliou,
Eduard Gaisin
, et al. (1 additional authors not shown)
Abstract:
Data profiling is an essential process in modern data-driven industries. One of its critical components is the discovery and validation of complex statistics, including functional dependencies, data constraints, association rules, and others.
However, most existing data profiling systems that focus on complex statistics do not provide proper integration with the tools used by contemporary data s…
▽ More
Data profiling is an essential process in modern data-driven industries. One of its critical components is the discovery and validation of complex statistics, including functional dependencies, data constraints, association rules, and others.
However, most existing data profiling systems that focus on complex statistics do not provide proper integration with the tools used by contemporary data scientists. This creates a significant barrier to the adoption of these tools in the industry. Moreover, existing systems were not created with industrial-grade workloads in mind. Finally, they do not aim to provide descriptive explanations, i.e. why a given pattern is not found. It is a significant issue as it is essential to understand the underlying reasons for a specific pattern's absence to make informed decisions based on the data.
Because of that, these patterns are effectively rest in thin air: their application scope is rather limited, they are rarely used by the broader public. At the same time, as we are going to demonstrate in this presentation, complex statistics can be efficiently used to solve many classic data quality problems.
Desbordante is an open-source data profiler that aims to close this gap. It is built with emphasis on industrial application: it is efficient, scalable, resilient to crashes, and provides explanations. Furthermore, it provides seamless Python integration by offloading various costly operations to the C++ core, not only mining.
In this demonstration, we show several scenarios that allow end users to solve different data quality problems. Namely, we showcase typo detection, data deduplication, and data anomaly detection scenarios.
△ Less
Submitted 28 July, 2023; v1 submitted 27 July, 2023;
originally announced July 2023.
-
Simple method for detecting sleep episodes in rats ECoG using machine learning
Authors:
Konstantin Sergeev,
Anastasiya Runnova,
Maxim Zhuravlev,
Evgenia Sitnikova,
Elizaveta Rutskova,
Kirill Smirnov,
Andrei Slepnev,
Nadezhda Semenova
Abstract:
In this paper we propose a new method for the automatic recognition of the state of behavioral sleep (BS) and waking state (WS) in freely moving rats using their electrocorticographic (ECoG) data. Three-channels ECoG signals were recorded from frontal left, frontal right and occipital right cortical areas. We employed a simple artificial neural network (ANN), in which the mean values and standard…
▽ More
In this paper we propose a new method for the automatic recognition of the state of behavioral sleep (BS) and waking state (WS) in freely moving rats using their electrocorticographic (ECoG) data. Three-channels ECoG signals were recorded from frontal left, frontal right and occipital right cortical areas. We employed a simple artificial neural network (ANN), in which the mean values and standard deviations of ECoG signals from two or three channels were used as inputs for the ANN. Results of wavelet-based recognition of BS/WS in the same data were used to train the ANN and evaluate correctness of our classifier. We tested different combinations of ECoG channels for detecting BS/WS.
Our results showed that the accuracy of ANN classification did not depend on ECoG-channel. For any ECoG-channel, networks were trained on one rat and applied to another rat with an accuracy of at least 80~\%. Itis important that we used a very simple network topology to achieve a relatively high accuracy of classification. Our classifier was based on a simple linear combination of input signals with some weights, and these weights could be replaced by the averaged weights of all trained ANNs without decreases in classification accuracy. In all, we introduce a new sleep recognition method that does not require additional network training. It is enough to know the coefficients and the equations suggested in this paper. The proposed method showed very fast performance and simple computations, therefore it could be used in real time experiments. It might be of high demand in preclinical studies in rodents that require vigilance control or monitoring of sleep-wake patterns.
△ Less
Submitted 2 February, 2023;
originally announced February 2023.
-
Desbordante: from benchmarking suite to high-performance science-intensive data profiler (preprint)
Authors:
George Chernishev,
Michael Polyntsov,
Anton Chizhov,
Kirill Stupakov,
Ilya Shchuckin,
Alexander Smirnov,
Maxim Strutovsky,
Alexey Shlyonskikh,
Mikhail Firsov,
Stepan Manannikov,
Nikita Bobrov,
Daniil Goncharov,
Ilia Barutkin,
Vladislav Shalnev,
Kirill Muraviev,
Anna Rakhmukova,
Dmitriy Shcheka,
Anton Chernikov,
Dmitrii Mandelshtam,
Mikhail Vyrodov,
Arthur Saliou,
Eduard Gaisin,
Kirill Smirnov
Abstract:
Pioneering data profiling systems such as Metanome and OpenClean brought public attention to science-intensive data profiling. This type of profiling aims to extract complex patterns (primitives) such as functional dependencies, data constraints, association rules, and others. However, these tools are research prototypes rather than production-ready systems.
The following work presents Desbordan…
▽ More
Pioneering data profiling systems such as Metanome and OpenClean brought public attention to science-intensive data profiling. This type of profiling aims to extract complex patterns (primitives) such as functional dependencies, data constraints, association rules, and others. However, these tools are research prototypes rather than production-ready systems.
The following work presents Desbordante - a high-performance science-intensive data profiler with open source code. Unlike similar systems, it is built with emphasis on industrial application in a multi-user environment. It is efficient, resilient to crashes, and scalable. Its efficiency is ensured by implementing discovery algorithms in C++, resilience is achieved by extensive use of containerization, and scalability is based on replication of containers.
Desbordante aims to open industrial-grade primitive discovery to a broader public, focusing on domain experts who are not IT professionals. Aside from the discovery of various primitives, Desbordante offers primitive validation, which not only reports whether a given instance of primitive holds or not, but also points out what prevents it from holding via the use of special screens. Next, Desbordante supports pipelines - ready-to-use functionality implemented using the discovered primitives, for example, typo detection. We provide built-in pipelines, and the users can construct their own via provided Python bindings. Unlike other profilers, Desbordante works not only with tabular data, but with graph and transactional data as well.
In this paper, we present Desbordante, the vision behind it and its use-cases. To provide a more in-depth perspective, we discuss its current state, architecture, and design decisions it is built on. Additionally, we outline our future plans.
△ Less
Submitted 14 January, 2023;
originally announced January 2023.
-
Implementing the Comparison-Based External Sort
Authors:
Michael Polyntsov,
Valentin Grigorev,
Kirill Smirnov,
George Chernishev
Abstract:
In the age of big data, sorting is an indispensable operation for DBMSes and similar systems. Having data sorted can help produce query plans with significantly lower run times. It also can provide other benefits like having non-blocking operators which will produce data steadily (without bursts), or operators with reduced memory footprint.
Sorting may be required on any step of query processing…
▽ More
In the age of big data, sorting is an indispensable operation for DBMSes and similar systems. Having data sorted can help produce query plans with significantly lower run times. It also can provide other benefits like having non-blocking operators which will produce data steadily (without bursts), or operators with reduced memory footprint.
Sorting may be required on any step of query processing, i.e., be it source data or intermediate results. At the same time, the data to be sorted may not fit into main memory. In this case, an external sort operator, which writes intermediate results to disk, should be used.
In this paper we consider an external sort operator of the comparison-based sort type. We discuss its implementation and describe related design decisions. Our aim is to study the impact on performance of a data structure used on the merge step. For this, we have experimentally evaluated three data structures implemented inside a DBMS.
Results have shown that it is worthwhile to make an effort to implement an efficient data structure for run merging, even on modern commodity computers which are usually disk-bound. Moreover, we demonstrated that using a loser tree is a more efficient approach than both the naive approach and the heap-based one.
△ Less
Submitted 26 July, 2022;
originally announced July 2022.
-
Revisiting Data Compression in Column-Stores
Authors:
Alexander Slesarev,
Evgeniy Klyuchikov,
Kirill Smirnov,
George Chernishev
Abstract:
Data compression is widely used in contemporary column-oriented DBMSes to lower space usage and to speed up query processing. Pioneering systems have introduced compression to tackle the disk bandwidth bottleneck by trading CPU processing power for it. The main issue of this is a trade-off between the compression ratio and the decompression CPU cost. Existing results state that light-weight compre…
▽ More
Data compression is widely used in contemporary column-oriented DBMSes to lower space usage and to speed up query processing. Pioneering systems have introduced compression to tackle the disk bandwidth bottleneck by trading CPU processing power for it. The main issue of this is a trade-off between the compression ratio and the decompression CPU cost. Existing results state that light-weight compression with small decompression costs outperforms heavy-weight compression schemes in column-stores. However, since the time these results were obtained, CPU, RAM, and disk performance have advanced considerably. Moreover, novel compression algorithms have emerged.
In this paper, we revisit the problem of compression in disk-based column-stores. More precisely, we study the I/O-RAM compression scheme which implies that there are two types of pages of different size: disk pages (compressed) and in-memory pages (uncompressed). In this scheme, the buffer manager is responsible for decompressing pages as soon as they arrive from disk. This scheme is rather popular as it is easy to implement: several modern column and row-stores use it.
We pose and address the following research questions: 1) Are heavy-weight compression schemes still inappropriate for disk-based column-stores?, 2) Are new light-weight compression algorithms better than the old ones?, 3) Is there a need for SIMD-employing decompression algorithms in case of a disk-based system? We study these questions experimentally using a columnar query engine and Star Schema Benchmark.
△ Less
Submitted 19 May, 2021;
originally announced May 2021.
-
Extending Databases to Support Data Manipulation with Functional Dependencies: a Vision Paper
Authors:
Nikita Bobrov,
Kirill Smirnov,
George Chernishev
Abstract:
In the current paper, we propose to fuse together stored data (tables) and their functional dependencies (FDs) inside a DBMS. We aim to make FDs first-class citizens: objects which can be queried and used to query data. Our idea is to allow analysts to explore both data and functional dependencies using the database interface. For example, an analyst may be interested in such tasks as: "find all r…
▽ More
In the current paper, we propose to fuse together stored data (tables) and their functional dependencies (FDs) inside a DBMS. We aim to make FDs first-class citizens: objects which can be queried and used to query data. Our idea is to allow analysts to explore both data and functional dependencies using the database interface. For example, an analyst may be interested in such tasks as: "find all rows which prevent a given functional dependency from holding", "for a given table, find all functional dependencies that involve a given attribute", "project all attributes that functionally determine a specified attribute".
For this purpose, we propose: (1) an SQL-based query language for querying a collection of functional dependencies (2) an extension of the SQL SELECT clause for supporting FD-based predicates, including approximate ones (3) a special data structure intended for containing mined FDs and acting as a mediator between user queries and underlying data. We describe the proposed extensions, demonstrate their use-cases, and finally, discuss implementation details and their impact on query processing.
△ Less
Submitted 16 May, 2020;
originally announced May 2020.