Skip to main content

Showing 1–13 of 13 results for author: Miller, R J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2403.14128  [pdf, other

    cs.DB

    Gen-T: Table Reclamation in Data Lakes

    Authors: Grace Fan, Roee Shraga, Renée J. Miller

    Abstract: We introduce the problem of Table Reclamation. Given a Source Table and a large table repository, reclamation finds a set of tables that, when integrated, reproduce the source table as closely as possible. Unlike query discovery problems like Query-by-Example or by-Target, Table Reclamation focuses on reclaiming the data in the Source Table as fully as possible using real tables that may be incomp… ▽ More

    Submitted 22 March, 2024; v1 submitted 21 March, 2024; originally announced March 2024.

    Comments: to appear at ICDE 2024

  2. arXiv:2403.02327  [pdf, other

    cs.DB cs.AI

    Model Lakes

    Authors: Koyena Pal, David Bau, Renée J. Miller

    Abstract: Given a set of deep learning models, it can be hard to find models appropriate to a task, understand the models, and characterize how models are different one from another. Currently, practitioners rely on manually-written documentation to understand and choose models. However, not all models have complete and reliable documentation. As the number of machine learning models increases, this issue o… ▽ More

    Submitted 4 March, 2024; originally announced March 2024.

  3. arXiv:2310.02656  [pdf, other

    cs.DB

    Blend: A Unified Data Discovery System

    Authors: Mahdi Esmailoghli, Christoph Schnell, Renée J. Miller, Ziawasch Abedjan

    Abstract: Data discovery is an iterative and incremental process that necessitates the execution of multiple data discovery queries to identify the desired tables from large and diverse data lakes. Current methodologies concentrate on single discovery tasks such as join, correlation, or union discovery. However, in practice, a series of these approaches and their corresponding index structures are necessary… ▽ More

    Submitted 4 October, 2023; originally announced October 2023.

  4. arXiv:2308.03883  [pdf, other

    cs.DB cs.CL cs.LG

    Generative Benchmark Creation for Table Union Search

    Authors: Koyena Pal, Aamod Khatiwada, Roee Shraga, Renée J. Miller

    Abstract: Data management has traditionally relied on synthetic data generators to generate structured benchmarks, like the TPC suite, where we can control important parameters like data size and its distribution precisely. These benchmarks were central to the success and adoption of database management systems. But more and more, data management problems are of a semantic nature. An important example is fi… ▽ More

    Submitted 7 August, 2023; originally announced August 2023.

  5. DIALITE: Discover, Align and Integrate Open Data Tables

    Authors: Aamod Khatiwada, Roee Shraga, Renée J. Miller

    Abstract: We demonstrate a novel table discovery pipeline called DIALITE that allows users to discover, integrate and analyze open data tables. DIALITE has three main stages. First, it allows users to discover tables from open data platforms using state-of-the-art table discovery techniques. Second, DIALITE integrates the discovered tables to produce an integrated table. Finally, it allows users to analyze… ▽ More

    Submitted 17 April, 2023; originally announced April 2023.

    Comments: SIGMOD 2023

  6. arXiv:2301.13095  [pdf, other

    cs.DB

    Explaining Dataset Changes for Semantic Data Versioning with Explain-Da-V (Technical Report)

    Authors: Roee Shraga, Renée J. Miller

    Abstract: In multi-user environments in which data science and analysis is collaborative, multiple versions of the same datasets are generated. While managing and storing data versions has received some attention in the research literature, the semantic nature of such changes has remained under-explored. In this work, we introduce \texttt{Explain-Da-V}, a framework aiming to explain changes between two give… ▽ More

    Submitted 30 January, 2023; originally announced January 2023.

    Comments: To appear in VLDB 2023

  7. arXiv:2209.13589  [pdf, other

    cs.DB

    SANTOS: Relationship-based Semantic Table Union Search

    Authors: Aamod Khatiwada, Grace Fan, Roee Shraga, Zixuan Chen, Wolfgang Gatterbauer, Renée J. Miller, Mirek Riedewald

    Abstract: Existing techniques for unionable table search define unionability using metadata (tables must have the same or similar schemas) or column-based metrics (for example, the values in a table should be drawn from the same domain). In this work, we introduce the use of semantic relationships between pairs of columns in a table to improve the accuracy of union search. Consequently, we introduce a new n… ▽ More

    Submitted 27 September, 2022; originally announced September 2022.

    Comments: 15 pages, 10 figures, to appear at SIGMOD 2023

  8. arXiv:2103.09940  [pdf, other

    cs.DB

    DomainNet: Homograph Detection for Data Lake Disambiguation

    Authors: Aristotelis Leventidis, Laura Di Rocco, Wolfgang Gatterbauer, Renée J. Miller, Mirek Riedewald

    Abstract: Modern data lakes are deeply heterogeneous in the vocabulary that is used to describe data. We study a problem of disambiguation in data lakes: how can we determine if a data value occurring more than once in the lake has different meanings and is therefore a homograph? While word and entity disambiguation have been well studied in computational linguistics, data management and data science, we sh… ▽ More

    Submitted 22 March, 2021; v1 submitted 17 March, 2021; originally announced March 2021.

    Comments: Full version of paper appearing in EDBT 2021

  9. arXiv:2008.01208  [pdf, other

    cs.DB

    Knowledge Translation: Extended Technical Report

    Authors: Bahar Ghadiri Bashardoost, Renée J. Miller, Kelly Lyons, Fatemeh Nargesian

    Abstract: We introduce Kensho, a tool for generating map** rules between two Knowledge Bases (KBs). To create the map** rules, Kensho starts with a set of correspondences and enriches them with additional semantic information automatically identified from the structure and constraints of the KBs. Our approach works in two phases. In the first phase, semantic associations between resources of each KB are… ▽ More

    Submitted 3 August, 2020; originally announced August 2020.

    Comments: Extended technical report of "Knowledge Translation" paper, accepted in VLDB 2020

  10. arXiv:1812.07024  [pdf, other

    cs.DB

    Data Lake Organization

    Authors: Fatemeh Nargesian, Ken Q. Pu, Bahar Ghadiri Bashardoost, Erkang Zhu, Renée J. Miller

    Abstract: We consider the problem of creating a navigation structure that allows a user to most effectively navigate a data lake. We define an organization as a graph that contains nodes representing sets of attributes within a data lake and edges indicating subset relationships among nodes. We present a new probabilistic model of how users interact with an organization and define the likelihood of a user f… ▽ More

    Submitted 2 March, 2020; v1 submitted 17 December, 2018; originally announced December 2018.

  11. arXiv:1702.03447  [pdf, ps, other

    cs.DB cs.LG

    A Collective, Probabilistic Approach to Schema Map**: Appendix

    Authors: Angelika Kimmig, Alex Memory, Renee J. Miller, Lise Getoor

    Abstract: In this appendix we provide additional supplementary material to "A Collective, Probabilistic Approach to Schema Map**." We include an additional extended example, supplementary experiment details, and proof for the complexity result stated in the main paper.

    Submitted 11 February, 2017; originally announced February 2017.

    Comments: This is the appendix to the paper "A Collective, Probabilistic Approach to Schema Map**" accepted to ICDE 2017

  12. arXiv:1603.07410  [pdf, other

    cs.DB

    LSH Ensemble: Internet-Scale Domain Search

    Authors: Erkang Zhu, Fatemeh Nargesian, Ken Q. Pu, Renée J. Miller

    Abstract: We study the problem of domain search where a domain is a set of distinct values from an unspecified universe. We use Jaccard set containment, defined as $|Q \cap X|/|Q|$, as the relevance measure of a domain $X$ to a query domain $Q$. Our choice of Jaccard set containment over Jaccard similarity makes our work particularly suitable for searching Open Data and data on the web, as Jaccard similarit… ▽ More

    Submitted 23 July, 2016; v1 submitted 23 March, 2016; originally announced March 2016.

    Comments: To appear in VLDB 2016

    ACM Class: H.2.5; H.3.3; H.3.1

  13. arXiv:0908.0567  [pdf

    cs.DB cs.CE cs.IR

    LinkedCT: A Linked Data Space for Clinical Trials

    Authors: Oktie Hassanzadeh, Anastasios Kementsietsidis, Lipyeow Lim, Renee J. Miller, Min Wang

    Abstract: The Linked Clinical Trials (LinkedCT) project aims at publishing the first open semantic web data source for clinical trials data. The database exposed by LinkedCT is generated by (1) transforming existing data sources of clinical trials into RDF, and (2) discovering semantic links between the records in the trials data and several other data sources. In this paper, we discuss several challenges… ▽ More

    Submitted 4 August, 2009; originally announced August 2009.

    Comments: 5 pages, 1 figure, 4 tables

    ACM Class: H.2.8; H.3.5; J.3