Search | arXiv e-print repository

arXiv:2405.11988 [pdf, other]

doi 10.1145/3662010.3663447

DuckDB-SGX2: The Good, The Bad and The Ugly within Confidential Analytical Query Processing

Authors: Ilaria Battiston, Lotte Felius, Sam Ansmink, Laurens Kuiper, Peter Boncz

Abstract: We provide an evaluation of an analytical workload in a confidential computing environment, combining DuckDB with two technologies: modular columnar encryption in Parquet files (data at rest) and the newest version of the Intel SGX Trusted Execution Environment (TEE), providing a hardware enclave where data in flight can be (more) securely decrypted and processed. One finding is that the "performa… ▽ More We provide an evaluation of an analytical workload in a confidential computing environment, combining DuckDB with two technologies: modular columnar encryption in Parquet files (data at rest) and the newest version of the Intel SGX Trusted Execution Environment (TEE), providing a hardware enclave where data in flight can be (more) securely decrypted and processed. One finding is that the "performance tax" for such confidential analytical processing is acceptable compared to not using these technologies. We eventually manage to run TPC-H SF30 with under 2x overhead compared to non-encrypted, non-enclave execution; we show that, specifically, columnar compression and encryption are a good combination. Our second finding consists of dos and don'ts to tune DuckDB to work effectively in this environment. There are various performance hazards: potentially 5x higher cache miss costs due to memory encryption inside the enclave, NUMA penalties, and highly elevated cost of swap** pages in and out of the enclave -- which is also triggered indirectly by using a non-SGX-aware malloc library. △ Less

Submitted 30 May, 2024; v1 submitted 20 May, 2024; originally announced May 2024.

arXiv:2404.16486 [pdf, other]

doi 10.1145/3626246.3654743

OpenIVM: a SQL-to-SQL Compiler for Incremental Computations

Authors: Ilaria Battiston, Kriti Kathuria, Peter Boncz

Abstract: This demonstration presents a new Open Source SQL-to-SQL compiler for Incremental View Maintenance (IVM). While previous systems, such as DBToaster, implemented computational functionality for IVM in a separate system, the core principle of OpenIVM is to make use of existing SQL query processing engines and perform all IVM computations via SQL. This approach enables the integration of IVM in these… ▽ More This demonstration presents a new Open Source SQL-to-SQL compiler for Incremental View Maintenance (IVM). While previous systems, such as DBToaster, implemented computational functionality for IVM in a separate system, the core principle of OpenIVM is to make use of existing SQL query processing engines and perform all IVM computations via SQL. This approach enables the integration of IVM in these systems without code duplication. Also, it eases its use in cross-system IVM, i.e. to orchestrate an HTAP system in which one (OLTP) DBMS provides insertions/updates/deletes (deltas), which are propagated using SQL into another (OLAP) DBMS, hosting materialized views. Our system compiles view definitions into SQL to eventually propagate deltas into the table that materializes the view, following the principles of DBSP. Under the hood, OpenIVM uses the DuckDB library to compile (parse, transform, optimize) the materialized view maintenance logic. We demonstrate OpenIVM in action (i) as the core of a DuckDB extension module that adds IVM functionality to it and (ii) powering cross-system IVM for HTAP, with PostgreSQL handling updates on base tables and DuckDB hosting materialized views on these. △ Less

Submitted 25 April, 2024; originally announced April 2024.

arXiv:2312.12923 [pdf, other]

Improving Data Minimization through Decentralized Data Architectures

Authors: Ilaria Battiston, Peter Boncz

Abstract: In this research project, we investigate an alternative to the standard cloud-centralized data architecture. Specifically, we aim to leave part of the application data under the control of the individual data owners in decentralized personal data stores. Our primary goal is to increase data minimization, i. e., enabling more sensitive personal data to be under the control of its owners while provi… ▽ More In this research project, we investigate an alternative to the standard cloud-centralized data architecture. Specifically, we aim to leave part of the application data under the control of the individual data owners in decentralized personal data stores. Our primary goal is to increase data minimization, i. e., enabling more sensitive personal data to be under the control of its owners while providing a straightforward and efficient framework to design architectures that allow applications to run and data to be analyzed. To serve this purpose, the centralized part of the schema contains aggregating views over this decentralized data. We propose to design a declarative language that extends SQL, for architects to specify different kinds of tables and views at the schema level, along with sensitive columns and their minimum granularity level of their aggregations. Local updates need to be reflected in the centralized views while ensuring privacy throughout intermediate calculations; for this we pursue the integration of distributed materialized view maintenance and multi-party computation (MPC) techniques. We finally aim to implement this system, where the personal data stores could either live in mobile devices or encrypted cloud storage, in order to evaluate its performance properties. △ Less

Submitted 20 December, 2023; originally announced December 2023.

Comments: 4 pages, 1 figure

Journal ref: CEUR Workshop Proceedings 2023

arXiv:2307.04820 [pdf, other]

The LDBC Social Network Benchmark Interactive workload v2: A transactional graph query benchmark with deep delete operations

Authors: David Püroja, Jack Waudby, Peter Boncz, Gábor Szárnyas

Abstract: The LDBC Social Network Benchmark's Interactive workload captures an OLTP scenario operating on a correlated social network graph. It consists of complex graph queries executed concurrently with a stream of updates operation. Since its initial release in 2015, the Interactive workload has become the de facto industry standard for benchmarking transactional graph data management systems. As graph s… ▽ More The LDBC Social Network Benchmark's Interactive workload captures an OLTP scenario operating on a correlated social network graph. It consists of complex graph queries executed concurrently with a stream of updates operation. Since its initial release in 2015, the Interactive workload has become the de facto industry standard for benchmarking transactional graph data management systems. As graph systems have matured and the community's understanding of graph processing features has evolved, we initiated the renewal of this benchmark. This paper describes the draft Interactive v2 workload with several new features: delete operations, a cheapest path-finding query, support for larger data sets, and a novel temporal parameter curation algorithm that ensures stable runtimes for path queries. △ Less

Submitted 17 August, 2023; v1 submitted 10 July, 2023; originally announced July 2023.

ACM Class: H.2.4

arXiv:2307.04350 [pdf, other]

The Linked Data Benchmark Council (LDBC): Driving competition and collaboration in the graph data management space

Authors: Gábor Szárnyas, Brad Bebee, Altan Birler, Alin Deutsch, George Fletcher, Henry A. Gabb, Denise Gosnell, Alastair Green, Zhihui Guo, Keith W. Hare, Jan Hidders, Alexandru Iosup, Atanas Kiryakov, Tomas Kovatchev, Xinsheng Li, Leonid Libkin, Heng Lin, Xiaojian Luo, Arnau Prat-Pérez, David Püroja, Shipeng Qi, Oskar van Rest, Benjamin A. Steer, Dávid Szakállas, Bing Tong , et al. (8 additional authors not shown)

Abstract: Graph data management is instrumental for several use cases such as recommendation, root cause analysis, financial fraud detection, and enterprise knowledge representation. Efficiently supporting these use cases yields a number of unique requirements, including the need for a concise query language and graph-aware query optimization techniques. The goal of the Linked Data Benchmark Council (LDBC)… ▽ More Graph data management is instrumental for several use cases such as recommendation, root cause analysis, financial fraud detection, and enterprise knowledge representation. Efficiently supporting these use cases yields a number of unique requirements, including the need for a concise query language and graph-aware query optimization techniques. The goal of the Linked Data Benchmark Council (LDBC) is to design a set of standard benchmarks that capture representative categories of graph data management problems, making the performance of systems comparable and facilitating competition among vendors. LDBC also conducts research on graph schemas and graph query languages. This paper introduces the LDBC organization and its work over the last decade. △ Less

Submitted 17 August, 2023; v1 submitted 10 July, 2023; originally announced July 2023.

ACM Class: H.2.4

arXiv:2112.06280 [pdf, other]

In-Memory Indexed Caching for Distributed Data Processing

Authors: Alexandru Uta, Bogdan Ghit, Ankur Dave, Jan Rellermeyer, Peter Boncz

Abstract: Powerful abstractions such as dataframes are only as efficient as their underlying runtime system. The de-facto distributed data processing framework, Apache Spark, is poorly suited for the modern cloud-based data-science workloads due to its outdated assumptions: static datasets analyzed using coarse-grained transformations. In this paper, we introduce the Indexed DataFrame, an in-memory cache th… ▽ More Powerful abstractions such as dataframes are only as efficient as their underlying runtime system. The de-facto distributed data processing framework, Apache Spark, is poorly suited for the modern cloud-based data-science workloads due to its outdated assumptions: static datasets analyzed using coarse-grained transformations. In this paper, we introduce the Indexed DataFrame, an in-memory cache that supports a dataframe abstraction which incorporates indexing capabilities to support fast lookup and join operations. Moreover, it supports appends with multi-version concurrency control. We implement the Indexed DataFrame as a lightweight, standalone library which can be integrated with minimum effort in existing Spark programs. We analyze the performance of the Indexed DataFrame in cluster and cloud deployments with real-world datasets and benchmarks using both Apache Spark and Databricks Runtime. In our evaluation, we show that the Indexed DataFrame significantly speeds-up query execution when compared to a non-indexed dataframe, incurring modest memory overhead. △ Less

Submitted 8 February, 2022; v1 submitted 12 December, 2021; originally announced December 2021.

Comments: Accepted for publication at IEEE IPDPS 2022

arXiv:2105.15111 [pdf, ps, other]

An Epidemiological Model for contact tracing with the Dutch CoronaMelder App

Authors: Peter Boncz

Abstract: We present an epidemiological model for the effectiveness of CoronaMelder, the Dutch digital contact tracing app developed on top of the Google/Apple Exposure Notification framework. We compare the effectiveness of CoronaMelder with manual contract tracing on a number of metrics. CoronaMelder turns out to have a small but noticeable positive influence in slowing down the COVID-19 pandemic, an effe… ▽ More We present an epidemiological model for the effectiveness of CoronaMelder, the Dutch digital contact tracing app developed on top of the Google/Apple Exposure Notification framework. We compare the effectiveness of CoronaMelder with manual contract tracing on a number of metrics. CoronaMelder turns out to have a small but noticeable positive influence in slowing down the COVID-19 pandemic, an effect that will become more pronounced in an opened-up society where adoption of CoronaMelder is increased. △ Less

Submitted 10 June, 2021; v1 submitted 18 May, 2021; originally announced May 2021.

Comments: This is a first and preliminary draft. Future updates are expected

arXiv:2012.06171 [pdf, other]

doi 10.1145/3434642

The Future is Big Graphs! A Community View on Graph Processing Systems

Authors: Sherif Sakr, Angela Bonifati, Hannes Voigt, Alexandru Iosup, Khaled Ammar, Renzo Angles, Walid Aref, Marcelo Arenas, Maciej Besta, Peter A. Boncz, Khuzaima Daudjee, Emanuele Della Valle, Stefania Dumbrava, Olaf Hartig, Bernhard Haslhofer, Tim Hegeman, Jan Hidders, Katja Hose, Adriana Iamnitchi, Vasiliki Kalavri, Hugo Kapp, Wim Martens, M. Tamer Özsu, Eric Peukert, Stefan Plantikow , et al. (16 additional authors not shown)

Abstract: Graphs are by nature unifying abstractions that can leverage interconnectedness to represent, explore, predict, and explain real- and digital-world phenomena. Although real users and consumers of graph instances and graph workloads understand these abstractions, future problems will require new abstractions and systems. What needs to happen in the next decade for big graph processing to continue t… ▽ More Graphs are by nature unifying abstractions that can leverage interconnectedness to represent, explore, predict, and explain real- and digital-world phenomena. Although real users and consumers of graph instances and graph workloads understand these abstractions, future problems will require new abstractions and systems. What needs to happen in the next decade for big graph processing to continue to succeed? △ Less

Submitted 11 December, 2020; originally announced December 2020.

Comments: 12 pages, 3 figures, collaboration between the large-scale systems and data management communities, work started at the Dagstuhl Seminar 19491 on Big Graph Processing Systems, to be published in the Communications of the ACM

ACM Class: C.3; E.0; H.2; J.0

arXiv:2011.15028 [pdf, other]

The LDBC Graphalytics Benchmark

Authors: Alexandru Iosup, Ahmed Musaafir, Alexandru Uta, Arnau Prat Pérez, Gábor Szárnyas, Hassan Chafi, Ilie Gabriel Tănase, Lifeng Nai, Michael Anderson, Mihai Capotă, Narayanan Sundaram, Peter Boncz, Siegfried Depner, Stijn Heldens, Thomas Manhardt, Tim Hegeman, Wing Lung Ngai, Yinglong Xia

Abstract: In this document, we describe LDBC Graphalytics, an industrial-grade benchmark for graph analysis platforms. The main goal of Graphalytics is to enable the fair and objective comparison of graph analysis platforms. Due to the diversity of bottlenecks and performance issues such platforms need to address, Graphalytics consists of a set of selected deterministic algorithms for full-graph analysis, s… ▽ More In this document, we describe LDBC Graphalytics, an industrial-grade benchmark for graph analysis platforms. The main goal of Graphalytics is to enable the fair and objective comparison of graph analysis platforms. Due to the diversity of bottlenecks and performance issues such platforms need to address, Graphalytics consists of a set of selected deterministic algorithms for full-graph analysis, standard graph datasets, synthetic dataset generators, and reference output for validation purposes. Its test harness produces deep metrics that quantify multiple kinds of systems scalability, weak and strong, and robustness, such as failures and performance variability. The benchmark also balances comprehensiveness with runtime necessary to obtain the deep metrics. The benchmark comes with open-source software for generating performance data, for validating algorithm results, for monitoring and sharing performance data, and for obtaining the final benchmark result as a standard performance report. △ Less

Submitted 6 April, 2023; v1 submitted 30 November, 2020; originally announced November 2020.

ACM Class: C.4; H.2.4

arXiv:2001.02299 [pdf, other]

The LDBC Social Network Benchmark

Authors: Renzo Angles, János Benjamin Antal, Alex Averbuch, Altan Birler, Peter Boncz, Márton Búr, Orri Erling, Andrey Gubichev, Vlad Haprian, Moritz Kaufmann, Josep Lluís Larriba Pey, Norbert Martínez, József Marton, Marcus Paradies, Minh-Duc Pham, Arnau Prat-Pérez, David Püroja, Mirko Spasić, Benjamin A. Steer, Dávid Szakállas, Gábor Szárnyas, Jack Waudby, Mingxi Wu, Yuchen Zhang

Abstract: The Linked Data Benchmark Council's Social Network Benchmark (LDBC SNB) is an effort intended to test various functionalities of systems used for graph-like data management. For this, LDBC SNB uses the recognizable scenario of operating a social network, characterized by its graph-shaped data. LDBC SNB consists of two workloads that focus on different functionalities: the Interactive workload (int… ▽ More The Linked Data Benchmark Council's Social Network Benchmark (LDBC SNB) is an effort intended to test various functionalities of systems used for graph-like data management. For this, LDBC SNB uses the recognizable scenario of operating a social network, characterized by its graph-shaped data. LDBC SNB consists of two workloads that focus on different functionalities: the Interactive workload (interactive transactional queries) and the Business Intelligence workload (analytical queries). This document contains the definition of both workloads. This includes a detailed explanation of the data used in the LDBC SNB, a detailed description for all queries, and instructions on how to generate the data and run the benchmark with the provided software. △ Less

Submitted 14 January, 2024; v1 submitted 7 January, 2020; originally announced January 2020.

Comments: For the repository containing the source code of this technical report, see https://github.com/ldbc/ldbc_snb_docs

ACM Class: H.2.4

arXiv:1907.00083 [pdf, other]

Extracting Novel Facts from Tables for Knowledge Graph Completion (Extended version)

Authors: Benno Kruit, Peter Boncz, Jacopo Urbani

Abstract: We propose a new end-to-end method for extending a Knowledge Graph (KG) from tables. Existing techniques tend to interpret tables by focusing on information that is already in the KG, and therefore tend to extract many redundant facts. Our method aims to find more novel facts. We introduce a new technique for table interpretation based on a scalable graphical model using entity similarities. Our m… ▽ More We propose a new end-to-end method for extending a Knowledge Graph (KG) from tables. Existing techniques tend to interpret tables by focusing on information that is already in the KG, and therefore tend to extract many redundant facts. Our method aims to find more novel facts. We introduce a new technique for table interpretation based on a scalable graphical model using entity similarities. Our method further disambiguates cell values using KG embeddings as additional ranking method. Other distinctive features are the lack of assumptions about the underlying KG and the enabling of a fine-grained tuning of the precision/recall trade-off of extracted facts. Our experiments show that our approach has a higher recall during the interpretation process than the state-of-the-art, and is more resistant against the bias observed in extracting mostly redundant facts since it produces more novel extractions. △ Less

Submitted 15 July, 2019; v1 submitted 28 June, 2019; originally announced July 2019.

arXiv:1904.08223 [pdf, other]

Estimating Cardinalities with Deep Sketches

Authors: Andreas Kipf, Dimitri Vorona, Jonas Müller, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter Boncz, Thomas Neumann, Alfons Kemper

Abstract: We introduce Deep Sketches, which are compact models of databases that allow us to estimate the result sizes of SQL queries. Deep Sketches are powered by a new deep learning approach to cardinality estimation that can capture correlations between columns, even across tables. Our demonstration allows users to define such sketches on the TPC-H and IMDb datasets, monitor the training process, and run… ▽ More We introduce Deep Sketches, which are compact models of databases that allow us to estimate the result sizes of SQL queries. Deep Sketches are powered by a new deep learning approach to cardinality estimation that can capture correlations between columns, even across tables. Our demonstration allows users to define such sketches on the TPC-H and IMDb datasets, monitor the training process, and run ad-hoc queries against trained sketches. We also estimate query cardinalities with HyPer and PostgreSQL to visualize the gains over traditional cardinality estimators. △ Less

Submitted 17 April, 2019; originally announced April 2019.

Comments: To appear in SIGMOD'19

arXiv:1809.00677 [pdf, other]

Learned Cardinalities: Estimating Correlated Joins with Deep Learning

Authors: Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter Boncz, Alfons Kemper

Abstract: We describe a new deep learning approach to cardinality estimation. MSCN is a multi-set convolutional network, tailored to representing relational query plans, that employs set semantics to capture query features and true cardinalities. MSCN builds on sampling-based estimation, addressing its weaknesses when no sampled tuples qualify a predicate, and in capturing join-crossing correlations. Our ev… ▽ More We describe a new deep learning approach to cardinality estimation. MSCN is a multi-set convolutional network, tailored to representing relational query plans, that employs set semantics to capture query features and true cardinalities. MSCN builds on sampling-based estimation, addressing its weaknesses when no sampled tuples qualify a predicate, and in capturing join-crossing correlations. Our evaluation of MSCN using a real-world dataset shows that deep learning significantly enhances the quality of cardinality estimation, which is the core problem in query optimization. △ Less

Submitted 18 December, 2018; v1 submitted 3 September, 2018; originally announced September 2018.

Comments: CIDR 2019. https://github.com/andreaskipf/learnedcardinalities

arXiv:1802.09488 [pdf, other]

Adaptive Geospatial Joins for Modern Hardware

Authors: Andreas Kipf, Harald Lang, Varun Pandey, Raul Alexandru Persa, Peter Boncz, Thomas Neumann, Alfons Kemper

Abstract: Geospatial joins are a core building block of connected mobility applications. An especially challenging problem are joins between streaming points and static polygons. Since points are not known beforehand, they cannot be indexed. Nevertheless, points need to be mapped to polygons with low latencies to enable real-time feedback. We present an adaptive geospatial join that uses true hit filterin… ▽ More Geospatial joins are a core building block of connected mobility applications. An especially challenging problem are joins between streaming points and static polygons. Since points are not known beforehand, they cannot be indexed. Nevertheless, points need to be mapped to polygons with low latencies to enable real-time feedback. We present an adaptive geospatial join that uses true hit filtering to avoid expensive geometric computations in most cases. Our technique uses a quadtree-based hierarchical grid to approximate polygons and stores these approximations in a specialized radix tree. We emphasize on an approximate version of our algorithm that guarantees a user-defined precision. The exact version of our algorithm can adapt to the expected point distribution by refining the index. We optimized our implementation for modern hardware architectures with wide SIMD vector processing units, including Intel's brand new Knights Landing. Overall, our approach can perform up to two orders of magnitude faster than existing techniques. △ Less

Submitted 26 February, 2018; originally announced February 2018.

arXiv:1712.01550 [pdf, other]

G-CORE: A Core for Future Graph Query Languages

Authors: Renzo Angles, Marcelo Arenas, Pablo Barceló, Peter Boncz, George H. L. Fletcher, Claudio Gutierrez, Tobias Lindaaker, Marcus Paradies, Stefan Plantikow, Juan Sequeda, Oskar van Rest, Hannes Voigt

Abstract: We report on a community effort between industry and academia to shape the future of graph query languages. We argue that existing graph database management systems should consider supporting a query language with two key characteristics. First, it should be composable, meaning, that graphs are the input and the output of queries. Second, the graph query language should treat paths as first-class… ▽ More We report on a community effort between industry and academia to shape the future of graph query languages. We argue that existing graph database management systems should consider supporting a query language with two key characteristics. First, it should be composable, meaning, that graphs are the input and the output of queries. Second, the graph query language should treat paths as first-class citizens. Our result is G-CORE, a powerful graph query language design that fulfills these goals, and strikes a careful balance between path query expressivity and evaluation complexity. △ Less

Submitted 6 December, 2017; v1 submitted 5 December, 2017; originally announced December 2017.

arXiv:1208.4170 [pdf, other]

From Cooperative Scans to Predictive Buffer Management

Authors: Michał Świtakowski, Peter Boncz, Marcin Żukowski

Abstract: In analytical applications, database systems often need to sustain workloads with multiple concurrent scans hitting the same table. The Cooperative Scans (CScans) framework, which introduces an Active Buffer Manager (ABM) component into the database architecture, has been the most effective and elaborate response to this problem, and was initially developed in the X100 research prototype. We now r… ▽ More In analytical applications, database systems often need to sustain workloads with multiple concurrent scans hitting the same table. The Cooperative Scans (CScans) framework, which introduces an Active Buffer Manager (ABM) component into the database architecture, has been the most effective and elaborate response to this problem, and was initially developed in the X100 research prototype. We now report on the the experiences of integrating Cooperative Scans into its industrial-strength successor, the Vectorwise database product. During this implementation we invented a simpler optimization of concurrent scan buffer management, called Predictive Buffer Management (PBM). PBM is based on the observation that in a workload with long-running scans, the buffer manager has quite a bit of information on the workload in the immediate future, such that an approximation of the ideal OPT algorithm becomes feasible. In the evaluation on both synthetic benchmarks as well as a TPC-H throughput run we compare the benefits of naive buffer management (LRU) versus CScans, PBM and OPT; showing that PBM achieves benefits close to Cooperative Scans, while incurring much lower architectural impact. △ Less

Submitted 20 August, 2012; originally announced August 2012.

Comments: VLDB2012

Journal ref: Proceedings of the VLDB Endowment (PVLDB), Vol. 5, No. 12, pp. 1759-1770 (2012)

Showing 1–16 of 16 results for author: Boncz, P