Search | arXiv e-print repository

Gen-T: Table Reclamation in Data Lakes

Authors: Grace Fan, Roee Shraga, Renée J. Miller

Abstract: We introduce the problem of Table Reclamation. Given a Source Table and a large table repository, reclamation finds a set of tables that, when integrated, reproduce the source table as closely as possible. Unlike query discovery problems like Query-by-Example or by-Target, Table Reclamation focuses on reclaiming the data in the Source Table as fully as possible using real tables that may be incomp… ▽ More We introduce the problem of Table Reclamation. Given a Source Table and a large table repository, reclamation finds a set of tables that, when integrated, reproduce the source table as closely as possible. Unlike query discovery problems like Query-by-Example or by-Target, Table Reclamation focuses on reclaiming the data in the Source Table as fully as possible using real tables that may be incomplete or inconsistent. To do this, we define a new measure of table similarity, called error-aware instance similarity, to measure how close a reclaimed table is to a Source Table, a measure grounded in instance similarity used in data exchange. Our search covers not only SELECT-PROJECT- JOIN queries, but integration queries with unions, outerjoins, and the unary operators subsumption and complementation that have been shown to be important in data integration and fusion. Using reclamation, a data scientist can understand if any tables in a repository can be used to exactly reclaim a tuple in the Source. If not, one can understand if this is due to differences in values or to incompleteness in the data. Our solution, Gen-T, performs table discovery to retrieve a set of candidate tables from the table repository, filters these down to a set of originating tables, then integrates these tables to reclaim the Source as closely as possible. We show that our solution, while approximate, is accurate, efficient and scalable in the size of the table repository with experiments on real data lakes containing up to 15K tables, where the average number of tuples varies from small (web tables) to extremely large (open data tables) up to 1M tuples. △ Less

Submitted 22 March, 2024; v1 submitted 21 March, 2024; originally announced March 2024.

Comments: to appear at ICDE 2024

arXiv:2403.02327 [pdf, other]

Model Lakes

Authors: Koyena Pal, David Bau, Renée J. Miller

Abstract: Given a set of deep learning models, it can be hard to find models appropriate to a task, understand the models, and characterize how models are different one from another. Currently, practitioners rely on manually-written documentation to understand and choose models. However, not all models have complete and reliable documentation. As the number of machine learning models increases, this issue o… ▽ More Given a set of deep learning models, it can be hard to find models appropriate to a task, understand the models, and characterize how models are different one from another. Currently, practitioners rely on manually-written documentation to understand and choose models. However, not all models have complete and reliable documentation. As the number of machine learning models increases, this issue of finding, differentiating, and understanding models is becoming more crucial. Inspired from research on data lakes, we introduce and define the concept of model lakes. We discuss fundamental research challenges in the management of large models. And we discuss what principled data management techniques can be brought to bear on the study of large model management. △ Less

Submitted 4 March, 2024; originally announced March 2024.

arXiv:2310.02656 [pdf, other]

Blend: A Unified Data Discovery System

Authors: Mahdi Esmailoghli, Christoph Schnell, Renée J. Miller, Ziawasch Abedjan

Abstract: Data discovery is an iterative and incremental process that necessitates the execution of multiple data discovery queries to identify the desired tables from large and diverse data lakes. Current methodologies concentrate on single discovery tasks such as join, correlation, or union discovery. However, in practice, a series of these approaches and their corresponding index structures are necessary… ▽ More Data discovery is an iterative and incremental process that necessitates the execution of multiple data discovery queries to identify the desired tables from large and diverse data lakes. Current methodologies concentrate on single discovery tasks such as join, correlation, or union discovery. However, in practice, a series of these approaches and their corresponding index structures are necessary to enable the user to discover the desired tables. This paper presents BLEND, a comprehensive data discovery system that empowers users to develop ad-hoc discovery tasks without the need to develop new algorithms or build a new index structure. To achieve this goal, we introduce a general index structure capable of addressing multiple discovery queries. We develop a set of lower-level operators that serve as the fundamental building blocks for more complex and sophisticated user tasks. These operators are highly efficient and enable end-to-end efficiency. To enhance the execution of the discovery pipeline, we rewrite the search queries into optimized SQL statements to push the data operators down to the database. We demonstrate that our holistic system is able to achieve comparable effectiveness and runtime efficiency to the individual state-of-the-art approaches specifically designed for a single task. △ Less

Submitted 4 October, 2023; originally announced October 2023.

arXiv:2308.03883 [pdf, other]

Generative Benchmark Creation for Table Union Search

Authors: Koyena Pal, Aamod Khatiwada, Roee Shraga, Renée J. Miller

Abstract: Data management has traditionally relied on synthetic data generators to generate structured benchmarks, like the TPC suite, where we can control important parameters like data size and its distribution precisely. These benchmarks were central to the success and adoption of database management systems. But more and more, data management problems are of a semantic nature. An important example is fi… ▽ More Data management has traditionally relied on synthetic data generators to generate structured benchmarks, like the TPC suite, where we can control important parameters like data size and its distribution precisely. These benchmarks were central to the success and adoption of database management systems. But more and more, data management problems are of a semantic nature. An important example is finding tables that can be unioned. While any two tables with the same cardinality can be unioned, table union search is the problem of finding tables whose union is semantically coherent. Semantic problems cannot be benchmarked using synthetic data. Our current methods for creating benchmarks involve the manual curation and labeling of real data. These methods are not robust or scalable and perhaps more importantly, it is not clear how robust the created benchmarks are. We propose to use generative AI models to create structured data benchmarks for table union search. We present a novel method for using generative models to create tables with specified properties. Using this method, we create a new benchmark containing pairs of tables that are both unionable and non-unionable but related. We thoroughly evaluate recent existing table union search methods over existing benchmarks and our new benchmark. We also present and evaluate a new table search methods based on recent large language models over all benchmarks. We show that the new benchmark is more challenging for all methods than hand-curated benchmarks, specifically, the top-performing method achieves a Mean Average Precision of around 60%, over 30% less than its performance on existing manually created benchmarks. We examine why this is the case and show that the new benchmark permits more detailed analysis of methods, including a study of both false positives and false negatives that were not possible with existing benchmarks. △ Less

Submitted 7 August, 2023; originally announced August 2023.

arXiv:2304.08285 [pdf, other]

doi 10.1145/3555041.3589732

DIALITE: Discover, Align and Integrate Open Data Tables

Authors: Aamod Khatiwada, Roee Shraga, Renée J. Miller

Abstract: We demonstrate a novel table discovery pipeline called DIALITE that allows users to discover, integrate and analyze open data tables. DIALITE has three main stages. First, it allows users to discover tables from open data platforms using state-of-the-art table discovery techniques. Second, DIALITE integrates the discovered tables to produce an integrated table. Finally, it allows users to analyze… ▽ More We demonstrate a novel table discovery pipeline called DIALITE that allows users to discover, integrate and analyze open data tables. DIALITE has three main stages. First, it allows users to discover tables from open data platforms using state-of-the-art table discovery techniques. Second, DIALITE integrates the discovered tables to produce an integrated table. Finally, it allows users to analyze the integration result by applying different downstreaming tasks over it. Our pipeline is flexible such that the user can easily add and compare additional discovery and integration algorithms. △ Less

Submitted 17 April, 2023; originally announced April 2023.

Comments: SIGMOD 2023

arXiv:2301.13095 [pdf, other]

Explaining Dataset Changes for Semantic Data Versioning with Explain-Da-V (Technical Report)

Authors: Roee Shraga, Renée J. Miller

Abstract: In multi-user environments in which data science and analysis is collaborative, multiple versions of the same datasets are generated. While managing and storing data versions has received some attention in the research literature, the semantic nature of such changes has remained under-explored. In this work, we introduce \texttt{Explain-Da-V}, a framework aiming to explain changes between two give… ▽ More In multi-user environments in which data science and analysis is collaborative, multiple versions of the same datasets are generated. While managing and storing data versions has received some attention in the research literature, the semantic nature of such changes has remained under-explored. In this work, we introduce \texttt{Explain-Da-V}, a framework aiming to explain changes between two given dataset versions. \texttt{Explain-Da-V} generates \emph{explanations} that use \emph{data transformations} to explain changes. We further introduce a set of measures that evaluate the validity, generalizability, and explainability of these explanations. We empirically show, using an adapted existing benchmark and a newly created benchmark, that \texttt{Explain-Da-V} generates better explanations than existing data transformation synthesis methods. △ Less

Submitted 30 January, 2023; originally announced January 2023.

Comments: To appear in VLDB 2023

arXiv:2209.13589 [pdf, other]

SANTOS: Relationship-based Semantic Table Union Search

Authors: Aamod Khatiwada, Grace Fan, Roee Shraga, Zixuan Chen, Wolfgang Gatterbauer, Renée J. Miller, Mirek Riedewald

Abstract: Existing techniques for unionable table search define unionability using metadata (tables must have the same or similar schemas) or column-based metrics (for example, the values in a table should be drawn from the same domain). In this work, we introduce the use of semantic relationships between pairs of columns in a table to improve the accuracy of union search. Consequently, we introduce a new n… ▽ More Existing techniques for unionable table search define unionability using metadata (tables must have the same or similar schemas) or column-based metrics (for example, the values in a table should be drawn from the same domain). In this work, we introduce the use of semantic relationships between pairs of columns in a table to improve the accuracy of union search. Consequently, we introduce a new notion of unionability that considers relationships between columns, together with the semantics of columns, in a principled way. To do so, we present two new methods to discover semantic relationship between pairs of columns. The first uses an existing knowledge base (KB), the second (which we call a "synthesized KB") uses knowledge from the data lake itself. We adopt an existing Table Union Search benchmark and present new (open) benchmarks that represent small and large real data lakes. We show that our new unionability search algorithm, called SANTOS, outperforms a state-of-the-art union search that uses a wide variety of column-based semantics, including word embeddings and regular expressions. We show empirically that our synthesized KB improves the accuracy of union search by representing relationship semantics that may not be contained in an available KB. This result hints at a promising future of creating a synthesized KBs from data lakes with limited KB coverage and using them for union search. △ Less

Submitted 27 September, 2022; originally announced September 2022.

Comments: 15 pages, 10 figures, to appear at SIGMOD 2023

arXiv:2103.09940 [pdf, other]

DomainNet: Homograph Detection for Data Lake Disambiguation

Authors: Aristotelis Leventidis, Laura Di Rocco, Wolfgang Gatterbauer, Renée J. Miller, Mirek Riedewald

Abstract: Modern data lakes are deeply heterogeneous in the vocabulary that is used to describe data. We study a problem of disambiguation in data lakes: how can we determine if a data value occurring more than once in the lake has different meanings and is therefore a homograph? While word and entity disambiguation have been well studied in computational linguistics, data management and data science, we sh… ▽ More Modern data lakes are deeply heterogeneous in the vocabulary that is used to describe data. We study a problem of disambiguation in data lakes: how can we determine if a data value occurring more than once in the lake has different meanings and is therefore a homograph? While word and entity disambiguation have been well studied in computational linguistics, data management and data science, we show that data lakes provide a new opportunity for disambiguation of data values since they represent a massive network of interconnected values. We investigate to what extent this network can be used to disambiguate values. DomainNet uses network-centrality measures on a bipartite graph whose nodes represent values and attributes to determine, without supervision, if a value is a homograph. A thorough experimental evaluation demonstrates that state-of-the-art techniques in domain discovery cannot be re-purposed to compete with our method. Specifically, using a domain discovery method to identify homographs has a precision and a recall of 38% versus 69% with our method on a synthetic benchmark. By applying a network-centrality measure to our graph representation, DomainNet achieves a good separation between homographs and data values with a unique meaning. On a real data lake our top-200 precision is 89%. △ Less

Submitted 22 March, 2021; v1 submitted 17 March, 2021; originally announced March 2021.

Comments: Full version of paper appearing in EDBT 2021

arXiv:2008.01208 [pdf, other]

Knowledge Translation: Extended Technical Report

Authors: Bahar Ghadiri Bashardoost, Renée J. Miller, Kelly Lyons, Fatemeh Nargesian

Abstract: We introduce Kensho, a tool for generating map** rules between two Knowledge Bases (KBs). To create the map** rules, Kensho starts with a set of correspondences and enriches them with additional semantic information automatically identified from the structure and constraints of the KBs. Our approach works in two phases. In the first phase, semantic associations between resources of each KB are… ▽ More We introduce Kensho, a tool for generating map** rules between two Knowledge Bases (KBs). To create the map** rules, Kensho starts with a set of correspondences and enriches them with additional semantic information automatically identified from the structure and constraints of the KBs. Our approach works in two phases. In the first phase, semantic associations between resources of each KB are captured. In the second phase, map** rules are generated by interpreting the correspondences in a way that respects the discovered semantic associations among elements of each KB. Kensho's map** rules are expressed using SPARQL queries and can be used directly to exchange knowledge from source to target. Kensho is able to automatically rank the generated map** rules using a set of heuristics. We present an experimental evaluation of Kensho and assess our map** generation and ranking strategies using more than 50 synthesized and real world settings, chosen to showcase some of the most important applications of knowledge translation. In addition, we use three existing benchmarks to demonstrate Kensho's ability to deal with different map** scenarios. △ Less

Submitted 3 August, 2020; originally announced August 2020.

Comments: Extended technical report of "Knowledge Translation" paper, accepted in VLDB 2020

arXiv:1812.07024 [pdf, other]

Data Lake Organization

Authors: Fatemeh Nargesian, Ken Q. Pu, Bahar Ghadiri Bashardoost, Erkang Zhu, Renée J. Miller

Abstract: We consider the problem of creating a navigation structure that allows a user to most effectively navigate a data lake. We define an organization as a graph that contains nodes representing sets of attributes within a data lake and edges indicating subset relationships among nodes. We present a new probabilistic model of how users interact with an organization and define the likelihood of a user f… ▽ More We consider the problem of creating a navigation structure that allows a user to most effectively navigate a data lake. We define an organization as a graph that contains nodes representing sets of attributes within a data lake and edges indicating subset relationships among nodes. We present a new probabilistic model of how users interact with an organization and define the likelihood of a user finding a table using the organization. We propose the data lake organization problem as the problem of finding an organization that maximizes the expected probability of discovering tables by navigating an organization. We propose an approximate algorithm for the data lake organization problem. We show the effectiveness of the algorithm on both real data lakes containing data from open data portals and on benchmarks that emulate the observed characteristics of real data lakes. Through a formal user study, we show that navigation can help users discover relevant tables that cannot be found by keyword search. In addition, in our study, 42% of users preferred the use of navigation and 58% preferred keyword search, suggesting these are complementary and both useful modalities for data discovery in data lakes. Our experiments show that data lake organizations take into account the data lake distribution and outperform an existing hand-curated taxonomy and a common baseline organization. △ Less

Submitted 2 March, 2020; v1 submitted 17 December, 2018; originally announced December 2018.

arXiv:1702.03447 [pdf, ps, other]

A Collective, Probabilistic Approach to Schema Map**: Appendix

Authors: Angelika Kimmig, Alex Memory, Renee J. Miller, Lise Getoor

Abstract: In this appendix we provide additional supplementary material to "A Collective, Probabilistic Approach to Schema Map**." We include an additional extended example, supplementary experiment details, and proof for the complexity result stated in the main paper. In this appendix we provide additional supplementary material to "A Collective, Probabilistic Approach to Schema Map**." We include an additional extended example, supplementary experiment details, and proof for the complexity result stated in the main paper. △ Less

Submitted 11 February, 2017; originally announced February 2017.

Comments: This is the appendix to the paper "A Collective, Probabilistic Approach to Schema Map**" accepted to ICDE 2017

arXiv:1603.07410 [pdf, other]

LSH Ensemble: Internet-Scale Domain Search

Authors: Erkang Zhu, Fatemeh Nargesian, Ken Q. Pu, Renée J. Miller

Abstract: We study the problem of domain search where a domain is a set of distinct values from an unspecified universe. We use Jaccard set containment, defined as $|Q \cap X|/|Q|$, as the relevance measure of a domain $X$ to a query domain $Q$. Our choice of Jaccard set containment over Jaccard similarity makes our work particularly suitable for searching Open Data and data on the web, as Jaccard similarit… ▽ More We study the problem of domain search where a domain is a set of distinct values from an unspecified universe. We use Jaccard set containment, defined as $|Q \cap X|/|Q|$, as the relevance measure of a domain $X$ to a query domain $Q$. Our choice of Jaccard set containment over Jaccard similarity makes our work particularly suitable for searching Open Data and data on the web, as Jaccard similarity is known to have poor performance over sets with large differences in their domain sizes. We demonstrate that the domains found in several real-life Open Data and web data repositories show a power-law distribution over their domain sizes. We present a new index structure, Locality Sensitive Hashing (LSH) Ensemble, that solves the domain search problem using set containment at Internet scale. Our index structure and search algorithm cope with the data volume and skew by means of data sketches (MinHash) and domain partitioning. Our index structure does not assume a prescribed set of values. We construct a cost model that describes the accuracy of LSH Ensemble with any given partitioning. This allows us to formulate the partitioning for LSH Ensemble as an optimization problem. We prove that there exists an optimal partitioning for any distribution. Furthermore, for datasets following a power-law distribution, as observed in Open Data and Web data corpora, we show that the optimal partitioning can be approximated using equi-depth, making it efficient to use in practice. We evaluate our algorithm using real data (Canadian Open Data and WDC Web Tables) containing up over 262 M domains. The experiments demonstrate that our index consistently outperforms other leading alternatives in accuracy and performance. The improvements are most dramatic for data with large skew in the domain sizes. Even at 262 M domains, our index sustains query performance with under 3 seconds response time. △ Less

Submitted 23 July, 2016; v1 submitted 23 March, 2016; originally announced March 2016.

Comments: To appear in VLDB 2016

ACM Class: H.2.5; H.3.3; H.3.1

arXiv:0908.0567 [pdf]

LinkedCT: A Linked Data Space for Clinical Trials

Authors: Oktie Hassanzadeh, Anastasios Kementsietsidis, Lipyeow Lim, Renee J. Miller, Min Wang

Abstract: The Linked Clinical Trials (LinkedCT) project aims at publishing the first open semantic web data source for clinical trials data. The database exposed by LinkedCT is generated by (1) transforming existing data sources of clinical trials into RDF, and (2) discovering semantic links between the records in the trials data and several other data sources. In this paper, we discuss several challenges… ▽ More The Linked Clinical Trials (LinkedCT) project aims at publishing the first open semantic web data source for clinical trials data. The database exposed by LinkedCT is generated by (1) transforming existing data sources of clinical trials into RDF, and (2) discovering semantic links between the records in the trials data and several other data sources. In this paper, we discuss several challenges involved in these two steps and present the methodology used in LinkedCT to overcome these challenges. Our approach for semantic link discovery involves using state-of-the-art approximate string matching techniques combined with ontology-based semantic matching of the records, all performed in a declarative and easy-to-use framework. We present an evaluation of the performance of our proposed techniques in several link discovery scenarios in LinkedCT. △ Less

Submitted 4 August, 2009; originally announced August 2009.

Comments: 5 pages, 1 figure, 4 tables

ACM Class: H.2.8; H.3.5; J.3

Showing 1–13 of 13 results for author: Miller, R J