Search | arXiv e-print repository

A Scalable Space-efficient In-database Interpretability Framework for Embedding-based Semantic SQL Queries

Authors: Prabhakar Kudva, Rajesh Bordawekar, Apoorva Nitsure

Abstract: AI-Powered database (AI-DB) is a novel relational database system that uses a self-supervised neural network, database embedding, to enable semantic SQL queries on relational tables. In this paper, we describe an architecture and implementation of in-database interpretability infrastructure designed to provide simple, transparent, and relatable insights into ranked results of semantic SQL queries… ▽ More AI-Powered database (AI-DB) is a novel relational database system that uses a self-supervised neural network, database embedding, to enable semantic SQL queries on relational tables. In this paper, we describe an architecture and implementation of in-database interpretability infrastructure designed to provide simple, transparent, and relatable insights into ranked results of semantic SQL queries supported by AI-DB. We introduce a new co-occurrence based interpretability approach to capture relationships between relational entities and describe a space-efficient probabilistic Sketch implementation to store and process co-occurrence counts. Our approach provides both query-agnostic (global) and query-specific (local) interpretabilities. Experimental evaluation demonstrate that our in-database probabilistic approach provides the same interpretability quality as the precise space-inefficient approach, while providing scalable and space efficient runtime behavior (up to 8X space savings), without any user intervention. △ Less

Submitted 1 March, 2023; v1 submitted 23 February, 2023; originally announced February 2023.

arXiv:2102.02705 [pdf, other]

EFloat: Entropy-coded Floating Point Format for Compressing Vector Embedding Models

Authors: Rajesh Bordawekar, Bulent Abali, Ming-Hung Chen

Abstract: In a large class of deep learning models, including vector embedding models such as word and database embeddings, we observe that floating point exponent values cluster around a few unique values, permitting entropy based data compression. Entropy coding compresses fixed-length values with variable-length codes, encoding most probable values with fewer bits. We propose the EFloat compressed floati… ▽ More In a large class of deep learning models, including vector embedding models such as word and database embeddings, we observe that floating point exponent values cluster around a few unique values, permitting entropy based data compression. Entropy coding compresses fixed-length values with variable-length codes, encoding most probable values with fewer bits. We propose the EFloat compressed floating point number format that uses a variable field boundary between the exponent and significand fields. EFloat uses entropy coding on exponent values and signs to minimize the average width of the exponent and sign fields, while preserving the original FP32 exponent range unchanged. Saved bits become part of the significand field increasing the EFloat numeric precision by 4.3 bits on average compared to other reduced-precision floating point formats. EFloat makes 8-bit and even smaller floats practical without sacrificing the exponent range of a 32-bit floating point representation. We currently use the EFloat format for saving memory capacity and bandwidth consumption of large vector embedding models such as those used for database embeddings. Using the RMS error as metric, we demonstrate that EFloat provides higher accuracy than other floating point formats with equal bit budget. The EF12 format with 12-bit budget has less end-to-end application error than the 16-bit BFloat16. EF16 with 16-bit budget has an RMS-error 17 to 35 times less than BF16 RMS-error for a diverse set of embedding models. When making similarity and dissimilarity queries, using the NDCG ranking metric, EFloat matches the result quality of prior floating point representations with larger bit budgets. △ Less

Submitted 2 February, 2022; v1 submitted 4 February, 2021; originally announced February 2021.

arXiv:2005.09617

Unlocking New York City Crime Insights using Relational Database Embeddings

Authors: Apoorva Nitsure, Rajesh Bordawekar, Jose Neves

Abstract: This version withdrawn by arXiv administrators because the author did not have the right to agree to our license at the time of submission. This version withdrawn by arXiv administrators because the author did not have the right to agree to our license at the time of submission. △ Less

Submitted 20 May, 2020; v1 submitted 19 May, 2020; originally announced May 2020.

Comments: arXiv admin note: This version withdrawn by arXiv administrators because the author did not have the right to agree to our license at the time of submission

arXiv:1712.07199 [pdf, other]

Cognitive Database: A Step towards Endowing Relational Databases with Artificial Intelligence Capabilities

Authors: Rajesh Bordawekar, Bortik Bandyopadhyay, Oded Shmueli

Abstract: We propose Cognitive Databases, an approach for transparently enabling Artificial Intelligence (AI) capabilities in relational databases. A novel aspect of our design is to first view the structured data source as meaningful unstructured text, and then use the text to build an unsupervised neural network model using a Natural Language Processing (NLP) technique called word embedding. This model ca… ▽ More We propose Cognitive Databases, an approach for transparently enabling Artificial Intelligence (AI) capabilities in relational databases. A novel aspect of our design is to first view the structured data source as meaningful unstructured text, and then use the text to build an unsupervised neural network model using a Natural Language Processing (NLP) technique called word embedding. This model captures the hidden inter-/intra-column relationships between database tokens of different types. For each database token, the model includes a vector that encodes contextual semantic relationships. We seamlessly integrate the word embedding model into existing SQL query infrastructure and use it to enable a new class of SQL-based analytics queries called cognitive intelligence (CI) queries. CI queries use the model vectors to enable complex queries such as semantic matching, inductive reasoning queries such as analogies, predictive queries using entities not present in a database, and, more generally, using knowledge from external sources. We demonstrate unique capabilities of Cognitive Databases using an Apache Spark based prototype to execute inductive reasoning CI queries over a multi-modal database containing text and images. We believe our first-of-a-kind system exemplifies using AI functionality to endow relational databases with capabilities that were previously very hard to realize in practice. △ Less

Submitted 19 December, 2017; originally announced December 2017.

arXiv:1603.07185 [pdf, other]

Enabling Cognitive Intelligence Queries in Relational Databases using Low-dimensional Word Embeddings

Authors: Rajesh Bordawekar, Oded Shmueli

Abstract: We apply distributed language embedding methods from Natural Language Processing to assign a vector to each database entity associated token (for example, a token may be a word occurring in a table row, or the name of a column). These vectors, of typical dimension 200, capture the meaning of tokens based on the contexts in which the tokens appear together. To form vectors, we apply a learning meth… ▽ More We apply distributed language embedding methods from Natural Language Processing to assign a vector to each database entity associated token (for example, a token may be a word occurring in a table row, or the name of a column). These vectors, of typical dimension 200, capture the meaning of tokens based on the contexts in which the tokens appear together. To form vectors, we apply a learning method to a token sequence derived from the database. We describe various techniques for extracting token sequences from a database. The techniques differ in complexity, in the token sequences they output and in the database information used (e.g., foreign keys). The vectors can be used to algebraically quantify semantic relationships between the tokens such as similarities and analogies. Vectors enable a dual view of the data: relational and (meaningful rather than purely syntactical) text. We introduce and explore a new class of queries called cognitive intelligence (CI) queries that extract information from the database based, in part, on the relationships encoded by vectors. We have implemented a prototype system on top of Spark to exhibit the power of CI queries. Here, CI queries are realized via SQL UDFs. This power goes far beyond text extensions to relational systems due to the information encoded in vectors. We also consider various extensions to the basic scheme, including using a collection of views derived from the database to focus on a domain of interest, utilizing vectors and/or text from external sources, maintaining vectors as the database evolves and exploring a database without utilizing its schema. For the latter, we consider minimal extensions to SQL to vastly improve query expressiveness. △ Less

Submitted 23 March, 2016; originally announced March 2016.

Comments: Submitted to VLDB

Showing 1–5 of 5 results for author: Bordawekar, R