Blend: A Unified Data Discovery System
Authors:
Mahdi Esmailoghli,
Christoph Schnell,
Renée J. Miller,
Ziawasch Abedjan
Abstract:
Data discovery is an iterative and incremental process that necessitates the execution of multiple data discovery queries to identify the desired tables from large and diverse data lakes. Current methodologies concentrate on single discovery tasks such as join, correlation, or union discovery. However, in practice, a series of these approaches and their corresponding index structures are necessary…
▽ More
Data discovery is an iterative and incremental process that necessitates the execution of multiple data discovery queries to identify the desired tables from large and diverse data lakes. Current methodologies concentrate on single discovery tasks such as join, correlation, or union discovery. However, in practice, a series of these approaches and their corresponding index structures are necessary to enable the user to discover the desired tables. This paper presents BLEND, a comprehensive data discovery system that empowers users to develop ad-hoc discovery tasks without the need to develop new algorithms or build a new index structure. To achieve this goal, we introduce a general index structure capable of addressing multiple discovery queries. We develop a set of lower-level operators that serve as the fundamental building blocks for more complex and sophisticated user tasks. These operators are highly efficient and enable end-to-end efficiency. To enhance the execution of the discovery pipeline, we rewrite the search queries into optimized SQL statements to push the data operators down to the database. We demonstrate that our holistic system is able to achieve comparable effectiveness and runtime efficiency to the individual state-of-the-art approaches specifically designed for a single task.
△ Less
Submitted 4 October, 2023;
originally announced October 2023.
MATE: Multi-Attribute Table Extraction
Authors:
Mahdi Esmailoghli,
Jorge-Arnulfo Quiané-Ruiz,
Ziawasch Abedjan
Abstract:
A core operation in data discovery is to find joinable tables for a given table. Real-world tables include both unary and n-ary join keys. However, existing table discovery systems are optimized for unary joins and are ineffective and slow in the existence of n-ary keys. In this paper, we introduce MATE, a table discovery system that leverages a novel hash-based index that enables n-ary join disco…
▽ More
A core operation in data discovery is to find joinable tables for a given table. Real-world tables include both unary and n-ary join keys. However, existing table discovery systems are optimized for unary joins and are ineffective and slow in the existence of n-ary keys. In this paper, we introduce MATE, a table discovery system that leverages a novel hash-based index that enables n-ary join discovery through a space-efficient super key. We design a filtering layer that uses a novel hash, XASH. This hash function encodes the syntactic features of all column values and aggregates them into a super key, which allows the system to efficiently prune tables with non-joinable rows. Our join discovery system is able to prune up to 1000x more false positives and leads to over 60x faster table discovery in comparison to state-of-the-art.
△ Less
Submitted 25 April, 2022; v1 submitted 1 October, 2021;
originally announced October 2021.