Skip to main content

Showing 1–7 of 7 results for author: Mior, M J

.
  1. arXiv:2407.03286  [pdf, other

    cs.DB

    Large Language Models for JSON Schema Discovery

    Authors: Michael J. Mior

    Abstract: Semi-structured data formats such as JSON have proved to be useful data models for applications that require flexibility in the format of data stored. However, JSON data often come without the schemas that are typically available with relational data. This has resulted in a number of tools for discovering schemas from a collection of data. Although such tools can be useful, existing approaches foc… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

  2. arXiv:2307.12807  [pdf, other

    cs.DB cs.AI

    Comprehending Semantic Types in JSON Data with Graph Neural Networks

    Authors: Shuang Wei, Michael J. Mior

    Abstract: Semantic types are a more powerful and detailed way of describing data than atomic types such as strings or integers. They establish connections between columns and concepts from the real world, providing more nuanced and fine-grained information that can be useful for tasks such as automated data cleaning, schema matching, and data discovery. Existing deep learning models trained on large text co… ▽ More

    Submitted 24 July, 2023; originally announced July 2023.

  3. arXiv:2307.03113  [pdf, other

    cs.DB

    JSONoid: Monoid-based Enrichment for Configurable and Scalable Data-Driven Schema Discovery

    Authors: Michael J. Mior

    Abstract: Schema discovery is an important aspect to working with data in formats such as JSON. Unlike relational databases, JSON data sets often do not have associated structural information. Consumers of such datasets are often left to browse through data in an attempt to observe commonalities in structure across documents to construct suitable code for data processing. However, this process is time-consu… ▽ More

    Submitted 6 July, 2023; originally announced July 2023.

  4. Learning from Uncurated Regular Expressions

    Authors: Michael J. Mior

    Abstract: Significant work has been done on learning regular expressions from a set of data values. Depending on the domain, this approach can be very successful. However, significant time is required to learn these expressions and the resulting expressions can become either very complex or inaccurate in the presence of dirty data. The alternative of manually writing regular expressions becomes unattractive… ▽ More

    Submitted 23 June, 2023; v1 submitted 14 June, 2022; originally announced June 2022.

  5. arXiv:2111.10398  [pdf, other

    cs.DB

    Fast Discovery of Nested Dependencies on JSON Data

    Authors: Michael J. Mior

    Abstract: Functional and inclusion dependencies are the most widely used classes of data dependencies in data profiling due to their ability to identify relationships in data such as primary and foreign keys. These relationships are equally important when dealing with nested data formats such as JSON. However, the definition of functional and inclusion dependencies makes use of a flat, unnested relational m… ▽ More

    Submitted 19 November, 2021; originally announced November 2021.

    ACM Class: H.3.4

  6. arXiv:1903.08621  [pdf, other

    cs.DB cs.LG

    Column2Vec: Structural Understanding via Distributed Representations of Database Schemas

    Authors: Michael J. Mior, Alexander G. Ororbia II

    Abstract: We present Column2Vec, a distributed representation of database columns based on column metadata. Our distributed representation has several applications. Using known names for groups of columns (i.e., a table name), we train a model to generate an appropriate name for columns in an unnamed table. We demonstrate the viability of our approach using schema information collected from open source appl… ▽ More

    Submitted 20 March, 2019; originally announced March 2019.

  7. Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources

    Authors: Edmon Begoli, Jesús Camacho Rodríguez, Julian Hyde, Michael J. Mior, Daniel Lemire

    Abstract: Apache Calcite is a foundational software framework that provides query processing, optimization, and query language support to many popular open-source data processing systems such as Apache Hive, Apache Storm, Apache Flink, Druid, and MapD. Calcite's architecture consists of a modular and extensible query optimizer with hundreds of built-in optimization rules, a query processor capable of proces… ▽ More

    Submitted 27 February, 2018; originally announced February 2018.

    Comments: SIGMOD'18