Showing 1–2 of 2 results for author: Damalapati, P K

Search v0.5.6 released 2020-02-24

arXiv:2310.18742 [pdf, other]

cs.DB

Data Ambiguity Strikes Back: How Documentation Improves GPT's Text-to-SQL

Authors: Zezhou Huang, Pavan Kalyan Damalapati, Eugene Wu

Abstract: Text-to-SQL allows experts to use databases without in-depth knowledge of them. However, real-world tasks have both query and data ambiguities. Most works on Text-to-SQL focused on query ambiguities and designed chat interfaces for experts to provide clarifications. In contrast, the data management community has long studied data ambiguities, but mainly addresses error detection and correction, ra… ▽ More Text-to-SQL allows experts to use databases without in-depth knowledge of them. However, real-world tasks have both query and data ambiguities. Most works on Text-to-SQL focused on query ambiguities and designed chat interfaces for experts to provide clarifications. In contrast, the data management community has long studied data ambiguities, but mainly addresses error detection and correction, rather than documenting them for disambiguation in data tasks. This work delves into these data ambiguities in real-world datasets. We have identified prevalent data ambiguities of value consistency, data coverage, and data granularity that affect tasks. We examine how documentation, originally made to help humans to disambiguate data, can help GPT-4 with Text-to-SQL tasks. By offering documentation on these, we found GPT-4's performance improved by 28.9%. △ Less

Submitted 28 October, 2023; originally announced October 2023.
arXiv:2307.00417 [pdf, other]

cs.DB cs.HC

doi 10.1145/3597465.3605224

Aggregation Consistency Errors in Semantic Layers and How to Avoid Them

Authors: Zezhou Huang, Pavan Kalyan Damalapati, Eugene Wu

Abstract: Analysts often struggle with analyzing data from multiple tables in a database due to their lack of knowledge on how to join and aggregate the data. To address this, data engineers pre-specify "semantic layers" which include the join conditions and "metrics" of interest with aggregation functions and expressions. However, joins can cause "aggregation consistency issues". For example, analysts may… ▽ More Analysts often struggle with analyzing data from multiple tables in a database due to their lack of knowledge on how to join and aggregate the data. To address this, data engineers pre-specify "semantic layers" which include the join conditions and "metrics" of interest with aggregation functions and expressions. However, joins can cause "aggregation consistency issues". For example, analysts may observe inflated total revenue caused by double counting from join fanouts. Existing BI tools rely on heuristics for deduplication, resulting in imprecise and challenging-to-understand outcomes. To overcome these challenges, we propose "weighing" as a core primitive to counteract join fanouts. "Weighing" has been used in various areas, such as market attribution and order management, ensuring metrics consistency (e.g., total revenue remains the same) even for many-to-many joins. The idea is to assign equal weight to each join key group (rather than each tuple) and then distribute the weights among tuples. Implementing weighing techniques necessitates user input; therefore, we recommend a human-in-the-loop framework that enables users to iteratively explore different strategies and visualize the results. △ Less

Submitted 1 July, 2023; originally announced July 2023.

Journal ref: Proceedings of the Workshop on Human-In-the-Loop Data Analytics 2023

Search v0.5.6 released 2020-02-24