Search | arXiv e-print repository

Certain and Approximately Certain Models for Statistical Learning

Authors: Cheng Zhen, Nischal Aryal, Arash Termehchy, Alireza Aghasi, Amandeep Singh Chabada

Abstract: Real-world data is often incomplete and contains missing values. To train accurate models over real-world datasets, users need to spend a substantial amount of time and resources imputing and finding proper values for missing data items. In this paper, we demonstrate that it is possible to learn accurate models directly from data with missing values for certain training data and target models. We… ▽ More Real-world data is often incomplete and contains missing values. To train accurate models over real-world datasets, users need to spend a substantial amount of time and resources imputing and finding proper values for missing data items. In this paper, we demonstrate that it is possible to learn accurate models directly from data with missing values for certain training data and target models. We propose a unified approach for checking the necessity of data imputation to learn accurate models across various widely-used machine learning paradigms. We build efficient algorithms with theoretical guarantees to check this necessity and return accurate models in cases where imputation is unnecessary. Our extensive experiments indicate that our proposed algorithms significantly reduce the amount of time and effort needed for data imputation without imposing considerable computational overhead. △ Less

Submitted 1 March, 2024; v1 submitted 27 February, 2024; originally announced February 2024.

Comments: A technical report for a paper to appear at SIGMOD 2024

arXiv:2312.15472 [pdf, ps, other]

Towards Consistent Language Models Using Declarative Constraints

Authors: Jasmin Mousavi, Arash Termehchy

Abstract: Large language models have shown unprecedented abilities in generating linguistically coherent and syntactically correct natural language output. However, they often return incorrect and inconsistent answers to input questions. Due to the complexity and uninterpretability of the internally learned representations, it is challenging to modify language models such that they provide correct and consi… ▽ More Large language models have shown unprecedented abilities in generating linguistically coherent and syntactically correct natural language output. However, they often return incorrect and inconsistent answers to input questions. Due to the complexity and uninterpretability of the internally learned representations, it is challenging to modify language models such that they provide correct and consistent results. The data management community has developed various methods and tools for providing consistent answers over inconsistent datasets. In these methods, users specify the desired properties of data in a domain in the form of high-level declarative constraints. This approach has provided usable and scalable methods to delivering consistent information from inconsistent datasets. We aim to build upon this success and leverage these methods to modify language models such that they deliver consistent and accurate results. We investigate the challenges of using these ideas to obtain consistent and relevant answers from language models and report some preliminary empirical studies. △ Less

Submitted 24 December, 2023; originally announced December 2023.

arXiv:2312.14291 [pdf, other]

Multi-Agent Join

Authors: Vahid Ghadakchi, Mian Xie, Arash Termehchy, Bakhtiyar Doskenov, Bharghav Srikhakollu, Summit Haque, Huazheng Wang

Abstract: It is crucial to provide real-time performance in many applications, such as interactive and exploratory data analysis. In these settings, users often need to view subsets of query results quickly. It is challenging to deliver such results over large datasets for relational operators over multiple relations, such as join. Join algorithms usually spend a long time on scanning and attempting to join… ▽ More It is crucial to provide real-time performance in many applications, such as interactive and exploratory data analysis. In these settings, users often need to view subsets of query results quickly. It is challenging to deliver such results over large datasets for relational operators over multiple relations, such as join. Join algorithms usually spend a long time on scanning and attempting to join parts of relations that may not generate any result. Current solutions usually require lengthy and repeated preprocessing, which is costly and may not be possible to do in many settings. Also, they often support restricted types of joins. In this paper, we outline a novel approach for achieving efficient join processing in which a scan operator of the join learns during query execution, the portions of its relations that might satisfy the join predicate. We further improve this method using an algorithm in which both scan operators collaboratively learn an efficient join execution strategy. We also show that this approach generalizes traditional and non-learning methods for joining. Our extensive empirical studies using standard benchmarks indicate that this approach outperforms similar methods considerably. △ Less

Submitted 21 December, 2023; originally announced December 2023.

arXiv:2312.09407 [pdf, other]

How Does User Behavior Evolve During Exploratory Visual Analysis?

Authors: Sanad Saha, Nischal Aryal, Leilani Battle, Arash Termehchy

Abstract: Exploratory visual analysis (EVA) is an essential stage of the data science pipeline, where users often lack clear analysis goals at the start and iteratively refine them as they learn more about their data. Accurate models of users' exploration behavior are becoming increasingly vital to develo** responsive and personalized tools for exploratory visual analysis. Yet we observe a discrepancy bet… ▽ More Exploratory visual analysis (EVA) is an essential stage of the data science pipeline, where users often lack clear analysis goals at the start and iteratively refine them as they learn more about their data. Accurate models of users' exploration behavior are becoming increasingly vital to develo** responsive and personalized tools for exploratory visual analysis. Yet we observe a discrepancy between the static view of human exploration behavior adopted by many computational models versus the dynamic nature of EVA. In this paper, we explore potential parallels between the evolution of users' interactions with visualization tools during data exploration and assumptions made in popular online learning techniques. Through a series of empirical analyses, we seek to answer the question: how might users' exploration behavior evolve in response to what they have learned from the data during EVA? We present our findings and discuss their implications for the future of user modeling for system design. △ Less

Submitted 14 December, 2023; originally announced December 2023.

arXiv:2109.07127 [pdf, ps, other]

A Survey on Data Cleaning Methods for Improved Machine Learning Model Performance

Authors: Ga Young Lee, Lubna Alzamil, Bakhtiyar Doskenov, Arash Termehchy

Abstract: Data cleaning is the initial stage of any machine learning project and is one of the most critical processes in data analysis. It is a critical step in ensuring that the dataset is devoid of incorrect or erroneous data. It can be done manually with data wrangling tools, or it can be completed automatically with a computer program. Data cleaning entails a slew of procedures that, once done, make th… ▽ More Data cleaning is the initial stage of any machine learning project and is one of the most critical processes in data analysis. It is a critical step in ensuring that the dataset is devoid of incorrect or erroneous data. It can be done manually with data wrangling tools, or it can be completed automatically with a computer program. Data cleaning entails a slew of procedures that, once done, make the data ready for analysis. Given its significance in numerous fields, there is a growing interest in the development of efficient and effective data cleaning frameworks. In this survey, some of the most recent advancements of data cleaning approaches are examined for their effectiveness and the future research directions are suggested to close the gap in each of the methods. △ Less

Submitted 15 September, 2021; originally announced September 2021.

arXiv:2004.02308 [pdf, other]

Learning Over Dirty Data Without Cleaning

Authors: Jose Picado, John Davis, Arash Termehchy, Ga Young Lee

Abstract: Real-world datasets are dirty and contain many errors. Examples of these issues are violations of integrity constraints, duplicates, and inconsistencies in representing data values and entities. Learning over dirty databases may result in inaccurate models. Users have to spend a great deal of time and effort to repair data errors and create a clean database for learning. Moreover, as the informati… ▽ More Real-world datasets are dirty and contain many errors. Examples of these issues are violations of integrity constraints, duplicates, and inconsistencies in representing data values and entities. Learning over dirty databases may result in inaccurate models. Users have to spend a great deal of time and effort to repair data errors and create a clean database for learning. Moreover, as the information required to repair these errors is not often available, there may be numerous possible clean versions for a dirty database. We propose DLearn, a novel relational learning system that learns directly over dirty databases effectively and efficiently without any preprocessing. DLearn leverages database constraints to learn accurate relational models over inconsistent and heterogeneous data. Its learned models represent patterns over all possible clean instances of the data in a usable form. Our empirical study indicates that DLearn learns accurate models over large real-world databases efficiently. △ Less

Submitted 5 April, 2020; originally announced April 2020.

Comments: To be published in Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD'20)

arXiv:1911.11184 [pdf, other]

Managing Variability in Relational Databases by VDBMS

Authors: Parisa Ataei, Qiaoran Li, Eric Walkingshaw, Arash Termehchy

Abstract: Variability inherently exists in databases in various contexts which creates database variants. For example, variants of a database could have different schemas/content (database evolution problem), variants of a database could root from different sources (data integration problem), variants of a database could be deployed differently for specific application domain (deploying a database for diffe… ▽ More Variability inherently exists in databases in various contexts which creates database variants. For example, variants of a database could have different schemas/content (database evolution problem), variants of a database could root from different sources (data integration problem), variants of a database could be deployed differently for specific application domain (deploying a database for different configurations of a software system), etc. Unfortunately, while there are specific solutions to each of the problems arising in these contexts, there is no general solution that accounts for variability in databases and addresses managing variability within a database. In this paper, we formally define variational databases (VDBs) and statically-typed variational relational algebra (VRA) to query VDBs---both database and queries explicitly account for variation. We also design and implement variational database management system (VDBMS) to run variational queries over a VDB effectively and efficiently. To assess this, we generate two VDBs from real-world databases in the context of software development and database evolution with a set of experimental queries for each. △ Less

Submitted 25 November, 2019; originally announced November 2019.

Comments: 15 pages, 11 figures

arXiv:1910.10263 [pdf, other]

Integrating Information About Entities Progressively

Authors: Ben McCamish, Christopher Buss, Arash Termehchy, David Maier

Abstract: Users often have to integrate information about entities from multiple data sources. This task is challenging as each data source may represent information about the same entity in a distinct form, e.g., each data source may use a different name for the same person. Currently, data from different representations are translated into a unified one via lengthy and costly expert attention and tuning.… ▽ More Users often have to integrate information about entities from multiple data sources. This task is challenging as each data source may represent information about the same entity in a distinct form, e.g., each data source may use a different name for the same person. Currently, data from different representations are translated into a unified one via lengthy and costly expert attention and tuning. Such methods cannot scale to the rapidly increasing number and variety of available data sources. We demonstrate ProgMap, a entity-matching framework in which data sources learn to collaborate and integrate information about entities on-demand and with minimal expert intervention. The data sources leverage user feedback to improve the accuracy of their collaboration and results. ProgMap also has techniques to reduce the amount of required user feedback to achieve effective matchings. △ Less

Submitted 22 October, 2019; originally announced October 2019.

Comments: demonstration

arXiv:1710.01420 [pdf, other]

Usable & Scalable Learning Over Relational Data With Automatic Language Bias

Authors: Jose Picado, Arash Termehchy, Sudhanshu Pathak, Alan Fern, Praveen Ilango, Yunqiao Cai

Abstract: Relational databases are valuable resources for learning novel and interesting relations and concepts. In order to constraint the search through the large space of candidate definitions, users must tune the algorithm by specifying a language bias. Unfortunately, specifying the language bias is done via trial and error and is guided by the expert's intuitions. We propose AutoBias, a system that lev… ▽ More Relational databases are valuable resources for learning novel and interesting relations and concepts. In order to constraint the search through the large space of candidate definitions, users must tune the algorithm by specifying a language bias. Unfortunately, specifying the language bias is done via trial and error and is guided by the expert's intuitions. We propose AutoBias, a system that leverages information in the schema and content of the database to automatically induce the language bias used by popular relational learning systems. We show that AutoBias delivers the same accuracy as using manually-written language bias by imposing only a slight overhead on the running time of the learning algorithm. △ Less

Submitted 6 April, 2020; v1 submitted 3 October, 2017; originally announced October 2017.

arXiv:1603.04068 [pdf, other]

A Signaling Game Approach to Databases Querying and Interaction

Authors: Ben McCamish, Vahid Ghadakchi, Arash Termehchy, Behrouz Touri

Abstract: As most users do not precisely know the structure and/or the content of databases, their queries do not exactly reflect their information needs. The database management systems (DBMS) may interact with users and use their feedback on the returned results to learn the information needs behind their queries. Current query interfaces assume that users do not learn and modify the way way they express… ▽ More As most users do not precisely know the structure and/or the content of databases, their queries do not exactly reflect their information needs. The database management systems (DBMS) may interact with users and use their feedback on the returned results to learn the information needs behind their queries. Current query interfaces assume that users do not learn and modify the way way they express their information needs in form of queries during their interaction with the DBMS. Using a real-world interaction workload, we show that users learn and modify how to express their information needs during their interactions with the DBMS and their learning is accurately modeled by a well-known reinforcement learning mechanism. As current data interaction systems assume that users do not modify their strategies, they cannot discover the information needs behind users' queries effectively. We model the interaction between users and DBMS as a game with identical interest between two rational agents whose goal is to establish a common language for representing information needs in form of queries. We propose a reinforcement learning method that learns and answers the information needs behind queries and adapts to the changes in users' strategies and prove that it improves the effectiveness of answering queries stochastically speaking. We propose two efficient implementation of this method over large relational databases. Our extensive empirical studies over real-world query workloads indicate that our algorithms are efficient and effective. △ Less

Submitted 4 May, 2018; v1 submitted 13 March, 2016; originally announced March 2016.

Comments: 21 pages

arXiv:1508.03846 [pdf, other]

Schema Independent Relational Learning

Authors: Jose Picado, Arash Termehchy, Alan Fern, Parisa Ataei

Abstract: Learning novel concepts and relations from relational databases is an important problem with many applications in database systems and machine learning. Relational learning algorithms learn the definition of a new relation in terms of existing relations in the database. Nevertheless, the same data set may be represented under different schemas for various reasons, such as efficiency, data quality,… ▽ More Learning novel concepts and relations from relational databases is an important problem with many applications in database systems and machine learning. Relational learning algorithms learn the definition of a new relation in terms of existing relations in the database. Nevertheless, the same data set may be represented under different schemas for various reasons, such as efficiency, data quality, and usability. Unfortunately, the output of current relational learning algorithms tends to vary quite substantially over the choice of schema, both in terms of learning accuracy and efficiency. This variation complicates their off-the-shelf application. In this paper, we introduce and formalize the property of schema independence of relational learning algorithms, and study both the theoretical and empirical dependence of existing algorithms on the common class of (de) composition schema transformations. We study both sample-based learning algorithms, which learn from sets of labeled examples, and query-based algorithms, which learn by asking queries to an oracle. We prove that current relational learning algorithms are generally not schema independent. For query-based learning algorithms we show that the (de) composition transformations influence their query complexity. We propose Castor, a sample-based relational learning algorithm that achieves schema independence by leveraging data dependencies. We support the theoretical results with an empirical study that demonstrates the schema dependence/independence of several algorithms on existing benchmark and real-world datasets under (de) compositions. △ Less

Submitted 6 November, 2017; v1 submitted 16 August, 2015; originally announced August 2015.

arXiv:1508.03763 [pdf, other]

Structural Generalizability: The Case of Similarity Search

Authors: Yodsawalai Chodpathumwan, Arash Termehchy, Stephen A. Ramsey, Aayam Shresta, Amy Glen, Zheng Liu

Abstract: Graph similarity search algorithms usually leverage the structural properties of a database. Hence, these algorithms are effective only on some structural variations of the data and are ineffective on other forms, which makes them hard to use. Ideally, one would like to design a data analytics algorithm that is structurally robust, i.e., it returns essentially the same accurate results over all po… ▽ More Graph similarity search algorithms usually leverage the structural properties of a database. Hence, these algorithms are effective only on some structural variations of the data and are ineffective on other forms, which makes them hard to use. Ideally, one would like to design a data analytics algorithm that is structurally robust, i.e., it returns essentially the same accurate results over all possible structural variations of a dataset. We propose a novel approach to create a structurally robust similarity search algorithm over graph databases. We leverage the classic insight in the database literature that schematic variations are caused by having constraints in the database. We then present RelSim algorithm which is provably structurally robust under these variations. Our empirical studies show that our proposed algorithms are structurally robust while being efficient and as effective as or more effective than the state-of-the-art similarity search algorithms. △ Less

Submitted 31 March, 2021; v1 submitted 15 August, 2015; originally announced August 2015.

arXiv:1503.05656 [pdf, other]

Cost-Effective Conceptual Design Using Taxonomies

Authors: Ali Vakilian, Yodsawalai Chodpathumwan, Arash Termehchy, Amir Nayyeri

Abstract: It is known that annotating named entities in unstructured and semi-structured data sets by their concepts improves the effectiveness of answering queries over these data sets. As every enterprise has a limited budget of time or computational resources, it has to annotate a subset of concepts in a given domain whose costs of annotation do not exceed the budget. We call such a subset of concepts a… ▽ More It is known that annotating named entities in unstructured and semi-structured data sets by their concepts improves the effectiveness of answering queries over these data sets. As every enterprise has a limited budget of time or computational resources, it has to annotate a subset of concepts in a given domain whose costs of annotation do not exceed the budget. We call such a subset of concepts a {\it conceptual design} for the annotated data set. We focus on finding a conceptual design that provides the most effective answers to queries over the annotated data set, i.e., a {\it cost-effective conceptual design}. Since, it is often less time-consuming and costly to annotate general concepts than specific concepts, we use information on superclass/subclass relationships between concepts in taxonomies to find a cost-effective conceptual design. We quantify the amount by which a conceptual design with concepts from a taxonomy improves the effectiveness of answering queries over an annotated data set. If the taxonomy is a tree, we prove that the problem is NP-hard and propose an efficient approximation and pseudo-polynomial time algorithms for the problem. We further prove that if the taxonomy is a directed acyclic graph, given some generally accepted hypothesis, it is not possible to find any approximation algorithm with reasonably small approximation ratio for the problem. Our empirical study using real-world data sets, taxonomies, and query workloads shows that our framework effectively quantifies the amount by which a conceptual design improves the effectiveness of answering queries. It also indicates that our algorithms are efficient for a design-time task with pseudo-polynomial algorithm being generally more effective than the approximation algorithm. △ Less

Submitted 6 January, 2018; v1 submitted 19 March, 2015; originally announced March 2015.

arXiv:1409.2553 [pdf, other]

Representation Independent Analytics Over Structured Data

Authors: Yodsawalai Chodpathumwan, Jose Picado, Arash Termehchy, Alan Fern, Yizhou Sun

Abstract: Database analytics algorithms leverage quantifiable structural properties of the data to predict interesting concepts and relationships. The same information, however, can be represented using many different structures and the structural properties observed over particular representations do not necessarily hold for alternative structures. Thus, there is no guarantee that current database analytic… ▽ More Database analytics algorithms leverage quantifiable structural properties of the data to predict interesting concepts and relationships. The same information, however, can be represented using many different structures and the structural properties observed over particular representations do not necessarily hold for alternative structures. Thus, there is no guarantee that current database analytics algorithms will still provide the correct insights, no matter what structures are chosen to organize the database. Because these algorithms tend to be highly effective over some choices of structure, such as that of the databases used to validate them, but not so effective with others, database analytics has largely remained the province of experts who can find the desired forms for these algorithms. We argue that in order to make database analytics usable, we should use or develop algorithms that are effective over a wide range of choices of structural organizations. We introduce the notion of representation independence, study its fundamental properties for a wide range of data analytics algorithms, and empirically analyze the amount of representation independence of some popular database analytics algorithms. Our results indicate that most algorithms are not generally representation independent and find the characteristics of more representation independent heuristics under certain representational shifts. △ Less

Submitted 8 September, 2014; originally announced September 2014.

Showing 1–14 of 14 results for author: Termehchy, A