Search | arXiv e-print repository

Intervention and Conditioning in Causal Bayesian Networks

Authors: Sainyam Galhotra, Joseph Y. Halpern

Abstract: Causal models are crucial for understanding complex systems and identifying causal relationships among variables. Even though causal models are extremely popular, conditional probability calculation of formulas involving interventions pose significant challenges. In case of Causal Bayesian Networks (CBNs), Pearl assumes autonomy of mechanisms that determine interventions to calculate a range of pr… ▽ More Causal models are crucial for understanding complex systems and identifying causal relationships among variables. Even though causal models are extremely popular, conditional probability calculation of formulas involving interventions pose significant challenges. In case of Causal Bayesian Networks (CBNs), Pearl assumes autonomy of mechanisms that determine interventions to calculate a range of probabilities. We show that by making simple yet often realistic independence assumptions, it is possible to uniquely estimate the probability of an interventional formula (including the well-studied notions of probability of sufficiency and necessity). We discuss when these assumptions are appropriate. Importantly, in many cases of interest, when the assumptions are appropriate, these probability estimates can be evaluated using observational data, which carries immense significance in scenarios where conducting experiments is impractical or unfeasible. △ Less

Submitted 23 May, 2024; originally announced May 2024.

arXiv:2404.04713 [pdf, other]

doi 10.1145/3654940

Faster Algorithms for Fair Max-Min Diversification in $\mathbb{R}^d$

Authors: Yash Kurkure, Miles Shamo, Joseph Wiseman, Sainyam Galhotra, Stavros Sintos

Abstract: The task of extracting a diverse subset from a dataset, often referred to as maximum diversification, plays a pivotal role in various real-world applications that have far-reaching consequences. In this work, we delve into the realm of fairness-aware data subset selection, specifically focusing on the problem of selecting a diverse set of size $k$ from a large collection of $n$ data points (FairDi… ▽ More The task of extracting a diverse subset from a dataset, often referred to as maximum diversification, plays a pivotal role in various real-world applications that have far-reaching consequences. In this work, we delve into the realm of fairness-aware data subset selection, specifically focusing on the problem of selecting a diverse set of size $k$ from a large collection of $n$ data points (FairDiv). The FairDiv problem is well-studied in the data management and theory community. In this work, we develop the first constant approximation algorithm for FairDiv that runs in near-linear time using only linear space. In contrast, all previously known constant approximation algorithms run in super-linear time (with respect to $n$ or $k$) and use super-linear space. Our approach achieves this efficiency by employing a novel combination of the Multiplicative Weight Update method and advanced geometric data structures to implicitly and approximately solve a linear program. Furthermore, we improve the efficiency of our techniques by constructing a coreset. Using our coreset, we also propose the first efficient streaming algorithm for the FairDiv problem whose efficiency does not depend on the distribution of data points. Empirical evaluation on million-sized datasets demonstrates that our algorithm achieves the best diversity within a minute. All prior techniques are either highly inefficient or do not generate a good solution. △ Less

Submitted 14 May, 2024; v1 submitted 6 April, 2024; originally announced April 2024.

Journal ref: SIGMOD 2024

arXiv:2310.17843 [pdf, other]

A Data-Centric Online Market for Machine Learning: From Discovery to Pricing

Authors: Minbiao Han, Jonathan Light, Steven Xia, Sainyam Galhotra, Raul Castro Fernandez, Haifeng Xu

Abstract: Data fuels machine learning (ML) - rich and high-quality training data is essential to the success of ML. However, to transform ML from the race among a few large corporations to an accessible technology that serves numerous normal users' data analysis requests, there still exist important challenges. One gap we observed is that many ML users can benefit from new data that other data owners posses… ▽ More Data fuels machine learning (ML) - rich and high-quality training data is essential to the success of ML. However, to transform ML from the race among a few large corporations to an accessible technology that serves numerous normal users' data analysis requests, there still exist important challenges. One gap we observed is that many ML users can benefit from new data that other data owners possess, whereas these data owners sit on piles of data without knowing who can benefit from it. This gap creates the opportunity for building an online market that can automatically connect supply with demand. While online matching markets are prevalent (e.g., ride-hailing systems), designing a data-centric market for ML exhibits many unprecedented challenges. This paper develops new techniques to tackle two core challenges in designing such a market: (a) to efficiently match demand with supply, we design an algorithm to automatically discover useful data for any ML task from a pool of thousands of datasets, achieving high-quality matching between ML models and data; (b) to encourage market participation of ML users without much ML expertise, we design a new pricing mechanism for selling data-augmented ML models. Furthermore, our market is designed to be API-compatible with existing online ML markets like Vertex AI and Sagemaker, making it easy to use while providing better results due to joint data and model search. We envision that the synergy of our data and model discovery algorithm and pricing mechanism will be an important step towards building a new data-centric online market that serves ML users effectively. △ Less

Submitted 26 October, 2023; originally announced October 2023.

arXiv:2304.09068 [pdf, other]

METAM: Goal-Oriented Data Discovery

Authors: Sainyam Galhotra, Yue Gong, Raul Castro Fernandez

Abstract: Data is a central component of machine learning and causal inference tasks. The availability of large amounts of data from sources such as open data repositories, data lakes and data marketplaces creates an opportunity to augment data and boost those tasks' performance. However, augmentation techniques rely on a user manually discovering and shortlisting useful candidate augmentations. Existing so… ▽ More Data is a central component of machine learning and causal inference tasks. The availability of large amounts of data from sources such as open data repositories, data lakes and data marketplaces creates an opportunity to augment data and boost those tasks' performance. However, augmentation techniques rely on a user manually discovering and shortlisting useful candidate augmentations. Existing solutions do not leverage the synergy between discovery and augmentation, thus under exploiting data. In this paper, we introduce METAM, a novel goal-oriented framework that queries the downstream task with a candidate dataset, forming a feedback loop that automatically steers the discovery and augmentation process. To select candidates efficiently, METAM leverages properties of the: i) data, ii) utility function, and iii) solution set size. We show METAM's theoretical guarantees and demonstrate those empirically on a broad set of tasks. All in all, we demonstrate the promise of goal-oriented data discovery to modern data science applications. △ Less

Submitted 18 April, 2023; originally announced April 2023.

Comments: ICDE 2023 paper

arXiv:2303.01378 [pdf, other]

A Vision for Semantically Enriched Data Science

Authors: Udayan Khurana, Kavitha Srinivas, Sainyam Galhotra, Horst Samulowitz

Abstract: The recent efforts in automation of machine learning or data science has achieved success in various tasks such as hyper-parameter optimization or model selection. However, key areas such as utilizing domain knowledge and data semantics are areas where we have seen little automation. Data Scientists have long leveraged common sense reasoning and domain knowledge to understand and enrich data for b… ▽ More The recent efforts in automation of machine learning or data science has achieved success in various tasks such as hyper-parameter optimization or model selection. However, key areas such as utilizing domain knowledge and data semantics are areas where we have seen little automation. Data Scientists have long leveraged common sense reasoning and domain knowledge to understand and enrich data for building predictive models. In this paper we discuss important shortcomings of current data science and machine learning solutions. We then envision how leveraging "semantic" understanding and reasoning on data in combination with novel tools for data science automation can help with consistent and explainable data augmentation and transformation. Additionally, we discuss how semantics can assist data scientists in a new manner by hel** with challenges related to trust, bias, and explainability in machine learning. Semantic annotation can also help better explore and organize large data sources. △ Less

Submitted 2 March, 2023; originally announced March 2023.

Comments: arXiv admin note: substantial text overlap with arXiv:2205.08018

arXiv:2212.10839 [pdf, other]

Consistent Range Approximation for Fair Predictive Modeling

Authors: Jiongli Zhu, Sainyam Galhotra, Nazanin Sabri, Babak Salimi

Abstract: This paper proposes a novel framework for certifying the fairness of predictive models trained on biased data. It draws from query answering for incomplete and inconsistent databases to formulate the problem of consistent range approximation (CRA) of fairness queries for a predictive model on a target population. The framework employs background knowledge of the data collection process and biased… ▽ More This paper proposes a novel framework for certifying the fairness of predictive models trained on biased data. It draws from query answering for incomplete and inconsistent databases to formulate the problem of consistent range approximation (CRA) of fairness queries for a predictive model on a target population. The framework employs background knowledge of the data collection process and biased data, working with or without limited statistics about the target population, to compute a range of answers for fairness queries. Using CRA, the framework builds predictive models that are certifiably fair on the target population, regardless of the availability of external data during training. The framework's efficacy is demonstrated through evaluations on real data, showing substantial improvement over existing state-of-the-art methods. △ Less

Submitted 28 July, 2023; v1 submitted 21 December, 2022; originally announced December 2022.

arXiv:2206.11303 [pdf, other]

Community Recovery in the Geometric Block Model

Authors: Sainyam Galhotra, Arya Mazumdar, Soumyabrata Pal, Barna Saha

Abstract: To capture the inherent geometric features of many community detection problems, we propose to use a new random graph model of communities that we call a Geometric Block Model. The geometric block model builds on the random geometric graphs (Gilbert, 1961), one of the basic models of random graphs for spatial networks, in the same way that the well-studied stochastic block model builds on the Erdő… ▽ More To capture the inherent geometric features of many community detection problems, we propose to use a new random graph model of communities that we call a Geometric Block Model. The geometric block model builds on the random geometric graphs (Gilbert, 1961), one of the basic models of random graphs for spatial networks, in the same way that the well-studied stochastic block model builds on the Erdős-R\'{en}yi random graphs. It is also a natural extension of random community models inspired by the recent theoretical and practical advancements in community detection. To analyze the geometric block model, we first provide new connectivity results for random annulus graphs which are generalizations of random geometric graphs. The connectivity properties of geometric graphs have been studied since their introduction, and analyzing them has been more difficult than their Erdős-R\'{en}yi counterparts due to correlated edge formation. We then use the connectivity results of random annulus graphs to provide necessary and sufficient conditions for efficient recovery of communities for the geometric block model. We show that a simple triangle-counting algorithm to detect communities in the geometric block model is near-optimal. For this we consider the following two regimes of graph density. In the regime where the average degree of the graph grows logarithmically with the number of vertices, we show that our algorithm performs extremely well, both theoretically and practically. In contrast, the triangle-counting algorithm is far from being optimum for the stochastic block model in the logarithmic degree regime. We simulate our results on both real and synthetic datasets to show superior performance of both the new model as well as our algorithm. △ Less

Submitted 17 November, 2023; v1 submitted 22 June, 2022; originally announced June 2022.

Comments: 53 pages, 18 figures. Accepted at the Journal of Machine Learning Research (JMLR). Shorter versions accepted in AAAI 2018 (see arXiv:1709.05510) and RANDOM 2019 (see arXiv:1804.05013). arXiv admin note: text overlap with arXiv:1804.05013

Journal ref: Journal of Machine Learning Research (JMLR) 2023

arXiv:2203.14692 [pdf, other]

HypeR: Hypothetical Reasoning With What-If and How-To Queries Using a Probabilistic Causal Approach

Authors: Sainyam Galhotra, Amir Gilad, Sudeepa Roy, Babak Salimi

Abstract: What-if (provisioning for an update to a database) and how-to (how to modify the database to achieve a goal) analyses provide insights to users who wish to examine hypothetical scenarios without making actual changes to a database and thereby help plan strategies in their fields. Typically, such analyses are done by testing the effect of an update in the existing database on a specific view create… ▽ More What-if (provisioning for an update to a database) and how-to (how to modify the database to achieve a goal) analyses provide insights to users who wish to examine hypothetical scenarios without making actual changes to a database and thereby help plan strategies in their fields. Typically, such analyses are done by testing the effect of an update in the existing database on a specific view created by a query of interest. In real-world scenarios, however, an update to a particular part of the database may affect tuples and attributes in a completely different part due to implicit semantic dependencies. To allow for hypothetical reasoning while accommodating such dependencies, we develop HypeR, a framework that supports what-if and how-to queries accounting for probabilistic dependencies among attributes captured by a probabilistic causal model. We extend the SQL syntax to include the necessary operators for expressing these hypothetical queries, define their semantics, devise efficient algorithms and optimizations to compute their results using concepts from causality and probabilistic databases, and evaluate the effectiveness of our approach experimentally. △ Less

Submitted 28 March, 2022; originally announced March 2022.

Comments: Full version of the SIGMOD 2022 paper with the same title

arXiv:2106.01543 [pdf, other]

Ver: View Discovery in the Wild

Authors: Yue Gong, Zhiru Zhu, Sainyam Galhotra, Raul Castro Fernandez

Abstract: We present Ver, a data discovery system that identifies project-join views over large repositories of tables that do not contain join path information, and even when input queries are inaccurate. Ver implements a reference architecture to solve both the technical (scale and search) and human (semantic ambiguity, navigating a large number of results) problems of view discovery. We demonstrate users… ▽ More We present Ver, a data discovery system that identifies project-join views over large repositories of tables that do not contain join path information, and even when input queries are inaccurate. Ver implements a reference architecture to solve both the technical (scale and search) and human (semantic ambiguity, navigating a large number of results) problems of view discovery. We demonstrate users find the view they want when using Ver with a user study and we demonstrate its performance with large-scale end-to-end experiments on real-world datasets containing tens of millions of join paths. △ Less

Submitted 4 October, 2022; v1 submitted 2 June, 2021; originally announced June 2021.

arXiv:2105.06058 [pdf, other]

DataExposer: Exposing Disconnect between Data and Systems

Authors: Sainyam Galhotra, Anna Fariha, Raoni Lourenço, Juliana Freire, Alexandra Meliou, Divesh Srivastava

Abstract: As data is a central component of many modern systems, the cause of a system malfunction may reside in the data, and, specifically, particular properties of the data. For example, a health-monitoring system that is designed under the assumption that weight is reported in imperial units (lbs) will malfunction when encountering weight reported in metric units (kilograms). Similar to software debuggi… ▽ More As data is a central component of many modern systems, the cause of a system malfunction may reside in the data, and, specifically, particular properties of the data. For example, a health-monitoring system that is designed under the assumption that weight is reported in imperial units (lbs) will malfunction when encountering weight reported in metric units (kilograms). Similar to software debugging, which aims to find bugs in the mechanism (source code or runtime conditions), our goal is to debug the data to identify potential sources of disconnect between the assumptions about the data and the systems that operate on that data. Specifically, we seek which properties of the data cause a data-driven system to malfunction. We propose DataExposer, a framework to identify data properties, called profiles, that are the root causes of performance degradation or failure of a system that operates on the data. Such identification is necessary to repair the system and resolve the disconnect between data and system. Our technique is based on causal reasoning through interventions: when a system malfunctions for a dataset, DataExposer alters the data profiles and observes changes in the system's behavior due to the alteration. Unlike statistical observational analysis that reports mere correlations, DataExposer reports causally verified root causes, in terms of data profiles, of the system malfunction. We empirically evaluate DataExposer on three real-world and several synthetic data-driven systems that fail on datasets due to a diverse set of reasons. In all cases, DataExposer identifies the root causes precisely while requiring orders of magnitude fewer interventions than prior techniques. △ Less

Submitted 12 May, 2021; originally announced May 2021.

arXiv:2105.05782 [pdf, other]

How to Design Robust Algorithms using Noisy Comparison Oracle

Authors: Raghavendra Addanki, Sainyam Galhotra, Barna Saha

Abstract: Metric based comparison operations such as finding maximum, nearest and farthest neighbor are fundamental to studying various clustering techniques such as $k$-center clustering and agglomerative hierarchical clustering. These techniques crucially rely on accurate estimation of pairwise distance between records. However, computing exact features of the records, and their pairwise distances is ofte… ▽ More Metric based comparison operations such as finding maximum, nearest and farthest neighbor are fundamental to studying various clustering techniques such as $k$-center clustering and agglomerative hierarchical clustering. These techniques crucially rely on accurate estimation of pairwise distance between records. However, computing exact features of the records, and their pairwise distances is often challenging, and sometimes not possible. We circumvent this challenge by leveraging weak supervision in the form of a comparison oracle that compares the relative distance between the queried points such as `Is point u closer to v or w closer to x?'. However, it is possible that some queries are easier to answer than others using a comparison oracle. We capture this by introducing two different noise models called adversarial and probabilistic noise. In this paper, we study various problems that include finding maximum, nearest/farthest neighbor search under these noise models. Building upon the techniques we develop for these comparison operations, we give robust algorithms for k-center clustering and agglomerative hierarchical clustering. We prove that our algorithms achieve good approximation guarantees with a high probability and analyze their query complexity. We evaluate the effectiveness and efficiency of our techniques empirically on various real-world datasets. △ Less

Submitted 12 May, 2021; originally announced May 2021.

Comments: PVLDB 2021

arXiv:2103.11972 [pdf, other]

Explaining Black-Box Algorithms Using Probabilistic Contrastive Counterfactuals

Authors: Sainyam Galhotra, Romila Pradhan, Babak Salimi

Abstract: There has been a recent resurgence of interest in explainable artificial intelligence (XAI) that aims to reduce the opaqueness of AI-based decision-making systems, allowing humans to scrutinize and trust them. Prior work in this context has focused on the attribution of responsibility for an algorithm's decisions to its inputs wherein responsibility is typically approached as a purely associationa… ▽ More There has been a recent resurgence of interest in explainable artificial intelligence (XAI) that aims to reduce the opaqueness of AI-based decision-making systems, allowing humans to scrutinize and trust them. Prior work in this context has focused on the attribution of responsibility for an algorithm's decisions to its inputs wherein responsibility is typically approached as a purely associational concept. In this paper, we propose a principled causality-based approach for explaining black-box decision-making systems that addresses limitations of existing methods in XAI. At the core of our framework lies probabilistic contrastive counterfactuals, a concept that can be traced back to philosophical, cognitive, and social foundations of theories on how humans generate and select explanations. We show how such counterfactuals can quantify the direct and indirect influences of a variable on decisions made by an algorithm, and provide actionable recourse for individuals negatively affected by the algorithm's decision. Unlike prior work, our system, LEWIS: (1)can compute provably effective explanations and recourse at local, global and contextual levels (2)is designed to work with users with varying levels of background knowledge of the underlying causal model and (3)makes no assumptions about the internals of an algorithmic system except for the availability of its input-output data. We empirically evaluate LEWIS on three real-world datasets and show that it generates human-understandable explanations that improve upon state-of-the-art approaches in XAI, including the popular LIME and SHAP. Experiments on synthetic data further demonstrate the correctness of LEWIS's explanations and the scalability of its recourse algorithm. △ Less

Submitted 23 June, 2021; v1 submitted 22 March, 2021; originally announced March 2021.

Comments: Proceedings of the 2021 International Conference on Management of Data. ACM, 2021

arXiv:2102.03977 [pdf, other]

Learning to Generate Fair Clusters from Demonstrations

Authors: Sainyam Galhotra, Sandhya Saisubramanian, Shlomo Zilberstein

Abstract: Fair clustering is the process of grou** similar entities together, while satisfying a mathematically well-defined fairness metric as a constraint. Due to the practical challenges in precise model specification, the prescribed fairness constraints are often incomplete and act as proxies to the intended fairness requirement, leading to biased outcomes when the system is deployed. We examine how t… ▽ More Fair clustering is the process of grou** similar entities together, while satisfying a mathematically well-defined fairness metric as a constraint. Due to the practical challenges in precise model specification, the prescribed fairness constraints are often incomplete and act as proxies to the intended fairness requirement, leading to biased outcomes when the system is deployed. We examine how to identify the intended fairness constraint for a problem based on limited demonstrations from an expert. Each demonstration is a clustering over a subset of the data. We present an algorithm to identify the fairness metric from demonstrations and generate clusters using existing off-the-shelf clustering techniques, and analyze its theoretical properties. To extend our approach to novel fairness metrics for which clustering algorithms do not currently exist, we present a greedy method for clustering. Additionally, we investigate how to generate interpretable solutions using our approach. Empirical evaluation on three real-world datasets demonstrates the effectiveness of our approach in quickly identifying the underlying fairness and interpretability constraints, which are then used to generate fair and interpretable clusters. △ Less

Submitted 7 February, 2021; originally announced February 2021.

arXiv:2012.08594 [pdf, other]

Semantic Annotation for Tabular Data

Authors: Udayan Khurana, Sainyam Galhotra

Abstract: Detecting semantic concept of columns in tabular data is of particular interest to many applications ranging from data integration, cleaning, search to feature engineering and model building in machine learning. Recently, several works have proposed supervised learning-based or heuristic pattern-based approaches to semantic type annotation. Both have shortcomings that prevent them from generalizin… ▽ More Detecting semantic concept of columns in tabular data is of particular interest to many applications ranging from data integration, cleaning, search to feature engineering and model building in machine learning. Recently, several works have proposed supervised learning-based or heuristic pattern-based approaches to semantic type annotation. Both have shortcomings that prevent them from generalizing over a large number of concepts or examples. Many neural network based methods also present scalability issues. Additionally, none of the known methods works well for numerical data. We propose $C^2$, a column to concept mapper that is based on a maximum likelihood estimation approach through ensembles. It is able to effectively utilize vast amounts of, albeit somewhat noisy, openly available table corpora in addition to two popular knowledge graphs to perform effective and efficient concept prediction for structured data. We demonstrate the effectiveness of $C^2$ over available techniques on 9 datasets, the most comprehensive comparison on this topic so far. △ Less

Submitted 15 December, 2020; originally announced December 2020.

arXiv:2006.06053 [pdf, other]

Causal Feature Selection for Algorithmic Fairness

Authors: Sainyam Galhotra, Karthikeyan Shanmugam, Prasanna Sattigeri, Kush R. Varshney

Abstract: The use of machine learning (ML) in high-stakes societal decisions has encouraged the consideration of fairness throughout the ML lifecycle. Although data integration is one of the primary steps to generate high quality training data, most of the fairness literature ignores this stage. In this work, we consider fairness in the integration component of data management, aiming to identify features t… ▽ More The use of machine learning (ML) in high-stakes societal decisions has encouraged the consideration of fairness throughout the ML lifecycle. Although data integration is one of the primary steps to generate high quality training data, most of the fairness literature ignores this stage. In this work, we consider fairness in the integration component of data management, aiming to identify features that improve prediction without adding any bias to the dataset. We work under the causal interventional fairness paradigm. Without requiring the underlying structural causal model a priori, we propose an approach to identify a sub-collection of features that ensure the fairness of the dataset by performing conditional independence tests between different subsets of features. We use group testing to improve the complexity of the approach. We theoretically prove the correctness of the proposed algorithm to identify features that ensure interventional fairness and show that sub-linear conditional independence tests are sufficient to identify these variables. A detailed empirical evaluation is performed on real-world datasets to demonstrate the efficacy and efficiency of our technique. △ Less

Submitted 31 March, 2022; v1 submitted 10 June, 2020; originally announced June 2020.

Comments: Full version of the paper at SIGMOD 2022

arXiv:2005.14326 [pdf, other]

doi 10.1007/s00778-021-00656-7

Efficient and Effective ER with Progressive Blocking

Authors: Sainyam Galhotra, Donatella Firmani, Barna Saha, Divesh Srivastava

Abstract: Blocking is a mechanism to improve the efficiency of Entity Resolution (ER) which aims to quickly prune out all non-matching record pairs. However, depending on the distributions of entity cluster sizes, existing techniques can be either (a) too aggressive, such that they help scale but can adversely affect the ER effectiveness, or (b) too permissive, potentially harming ER efficiency. In this pap… ▽ More Blocking is a mechanism to improve the efficiency of Entity Resolution (ER) which aims to quickly prune out all non-matching record pairs. However, depending on the distributions of entity cluster sizes, existing techniques can be either (a) too aggressive, such that they help scale but can adversely affect the ER effectiveness, or (b) too permissive, potentially harming ER efficiency. In this paper, we propose a new methodology of progressive blocking (pBlocking) to enable both efficient and effective ER, which works seamlessly across different entity cluster size distributions. pBlocking is based on the insight that the effectiveness-efficiency trade-off is revealed only when the output of ER starts to be available. Hence, pBlocking leverages partial ER output in a feedback loop to refine the blocking result in a data-driven fashion. Specifically, we bootstrap pBlocking with traditional blocking methods and progressively improve the building and scoring of blocks until we get the desired trade-off, leveraging a limited amount of ER results as a guidance at every round. We formally prove that pBlocking converges efficiently ($O(n log^2 n)$ time complexity, where n is the total number of records). Our experiments show that incorporating partial ER output in a feedback loop can improve the efficiency and effectiveness of blocking by 5x and 60% respectively, improving the overall F-score of the entire ER process up to 60%. △ Less

Submitted 16 March, 2021; v1 submitted 28 May, 2020; originally announced May 2020.

Comments: Galhotra, S., Firmani, D., Saha, B. et al. Efficient and effective ER with progressive blocking. The VLDB Journal (2021)

arXiv:2005.06133 [pdf, other]

Adaptive Rule Discovery for Labeling Text Data

Authors: Sainyam Galhotra, Behzad Golshan, Wang-Chiew Tan

Abstract: Creating and collecting labeled data is one of the major bottlenecks in machine learning pipelines and the emergence of automated feature generation techniques such as deep learning, which typically requires a lot of training data, has further exacerbated the problem. While weak-supervision techniques have circumvented this bottleneck, existing frameworks either require users to write a set of div… ▽ More Creating and collecting labeled data is one of the major bottlenecks in machine learning pipelines and the emergence of automated feature generation techniques such as deep learning, which typically requires a lot of training data, has further exacerbated the problem. While weak-supervision techniques have circumvented this bottleneck, existing frameworks either require users to write a set of diverse, high-quality rules to label data (e.g., Snorkel), or require a labeled subset of the data to automatically mine rules (e.g., Snuba). The process of manually writing rules can be tedious and time consuming. At the same time, creating a labeled subset of the data can be costly and even infeasible in imbalanced settings. This is due to the fact that a random sample in imbalanced settings often contains only a few positive instances. To address these shortcomings, we present Darwin, an interactive system designed to alleviate the task of writing rules for labeling text data in weakly-supervised settings. Given an initial labeling rule, Darwin automatically generates a set of candidate rules for the labeling task at hand, and utilizes the annotator's feedback to adapt the candidate rules. We describe how Darwin is scalable and versatile. It can operate over large text corpora (i.e., more than 1 million sentences) and supports a wide range of labeling functions (i.e., any function that can be specified using a context free grammar). Finally, we demonstrate with a suite of experiments over five real-world datasets that Darwin enables annotators to generate weakly-supervised labels efficiently and with a small cost. In fact, our experiments show that rules discovered by Darwin on average identify 40% more positive instances compared to Snuba even when it is provided with 1000 labeled instances. △ Less

Submitted 12 May, 2020; originally announced May 2020.

arXiv:2002.03508 [pdf, other]

Fair Correlation Clustering

Authors: Saba Ahmadi, Sainyam Galhotra, Barna Saha, Roy Schwartz

Abstract: In this paper we study the problem of correlation clustering under fairness constraints. In the classic correlation clustering problem, we are given a complete graph where each edge is labeled positive or negative. The goal is to obtain a clustering of the vertices that minimizes disagreements -- the number of negative edges trapped inside a cluster plus positive edges between different clusters.… ▽ More In this paper we study the problem of correlation clustering under fairness constraints. In the classic correlation clustering problem, we are given a complete graph where each edge is labeled positive or negative. The goal is to obtain a clustering of the vertices that minimizes disagreements -- the number of negative edges trapped inside a cluster plus positive edges between different clusters. We consider two variations of fairness constraint for the problem of correlation clustering where each node has a color, and the goal is to form clusters that do not over-represent vertices of any color. The first variant aims to generate clusters with minimum disagreements, where the distribution of a feature (e.g. gender) in each cluster is same as the global distribution. For the case of two colors when the desired ratio of the number of colors in each cluster is $1:p$, we get $\mathcal{O}(p^2)$-approximation algorithm. Our algorithm could be extended to the case of multiple colors. We prove this problem is NP-hard. The second variant considers relative upper and lower bounds on the number of nodes of any color in a cluster. The goal is to avoid violating upper and lower bounds corresponding to each color in each cluster while minimizing the total number of disagreements. Along with our theoretical results, we show the effectiveness of our algorithm to generate fair clusters by empirical evaluation on real world data sets. △ Less

Submitted 9 February, 2020; originally announced February 2020.

arXiv:1912.07820 [pdf, other]

Balancing the Tradeoff Between Clustering Value and Interpretability

Authors: Sandhya Saisubramanian, Sainyam Galhotra, Shlomo Zilberstein

Abstract: Graph clustering groups entities -- the vertices of a graph -- based on their similarity, typically using a complex distance function over a large number of features. Successful integration of clustering approaches in automated decision-support systems hinges on the interpretability of the resulting clusters. This paper addresses the problem of generating interpretable clusters, given features of… ▽ More Graph clustering groups entities -- the vertices of a graph -- based on their similarity, typically using a complex distance function over a large number of features. Successful integration of clustering approaches in automated decision-support systems hinges on the interpretability of the resulting clusters. This paper addresses the problem of generating interpretable clusters, given features of interest that signify interpretability to an end-user, by optimizing interpretability in addition to common clustering objectives. We propose a $β$-interpretable clustering algorithm that ensures that at least $β$ fraction of nodes in each cluster share the same feature value. The tunable parameter $β$ is user-specified. We also present a more efficient algorithm for scenarios with $β\!=\!1$ and analyze the theoretical guarantees of the two algorithms. Finally, we empirically demonstrate the benefits of our approaches in generating interpretable clusters using four real-world datasets. The interpretability of the clusters is complemented by generating simple explanations denoting the feature values of the nodes in the clusters, using frequent pattern mining. △ Less

Submitted 30 January, 2020; v1 submitted 17 December, 2019; originally announced December 2019.

Comments: Accepted at AIES 2020

arXiv:1907.00117 [pdf, ps, other]

Min-Max Correlation Clustering via MultiCut

Authors: Saba Ahmadi, Sainyam Galhotra, Samir Khuller, Barna Saha, Roy Schwartz

Abstract: Correlation clustering is a fundamental combinatorial optimization problem arising in many contexts and applications that has been the subject of dozens of papers in the literature. In this problem we are given a general weighted graph where each edge is labeled positive or negative. The goal is to obtain a partitioning (clustering) of the vertices that minimizes disagreements - weight of negative… ▽ More Correlation clustering is a fundamental combinatorial optimization problem arising in many contexts and applications that has been the subject of dozens of papers in the literature. In this problem we are given a general weighted graph where each edge is labeled positive or negative. The goal is to obtain a partitioning (clustering) of the vertices that minimizes disagreements - weight of negative edges trapped inside a cluster plus positive edges between different clusters. Most of the papers on this topic mainly focus on minimizing total disagreement, a global objective for this problem. In this paper, we study a cluster-wise objective function that asks to minimize the maximum number of disagreements of each cluster, which we call min-max correlation clustering. The min-max objective is a natural objective that respects the quality of every cluster. In this paper, we provide the first nontrivial approximation algorithm for this problem achieving an $\mathcal{O}(\sqrt{\log n\cdot\max\{\log(|E^-|),\log(k)\}})$ approximation for general weighted graphs, where $|E^-|$ denotes the number of negative edges and $k$ is the number of clusters in the optimum solution. To do so, we also obtain a corresponding result for multicut where we wish to find a multicut solution while trying to minimize the total weight of cut edges on every component. The results are then further improved to obtain (i) $\mathcal{O}(r^2)$-approximation for min-max correlation clustering and min-max multicut for graphs that exclude $K_{r,r}$ minors (ii) a 14-approximation for the min-max correlation clustering on complete graphs. △ Less

Submitted 28 June, 2019; originally announced July 2019.

arXiv:1903.00750 [pdf, other]

Lexicographically Ordered Multi-Objective Clustering

Authors: Sainyam Galhotra, Sandhya Saisubramanian, Shlomo Zilberstein

Abstract: We introduce a rich model for multi-objective clustering with lexicographic ordering over objectives and a slack. The slack denotes the allowed multiplicative deviation from the optimal objective value of the higher priority objective to facilitate improvement in lower-priority objectives. We then propose an algorithm called Zeus to solve this class of problems, which is characterized by a makeshi… ▽ More We introduce a rich model for multi-objective clustering with lexicographic ordering over objectives and a slack. The slack denotes the allowed multiplicative deviation from the optimal objective value of the higher priority objective to facilitate improvement in lower-priority objectives. We then propose an algorithm called Zeus to solve this class of problems, which is characterized by a makeshift function. The makeshift fine tunes the clusters formed by the processed objectives so as to improve the clustering with respect to the unprocessed objectives, given the slack. We present makeshift for solving three different classes of objectives and analyze their solution guarantees. Finally, we empirically demonstrate the effectiveness of our approach on three applications using real-world data. △ Less

Submitted 2 March, 2019; originally announced March 2019.

arXiv:1804.05013 [pdf, other]

Connectivity in Random Annulus Graphs and the Geometric Block Model

Authors: Sainyam Galhotra, Arya Mazumdar, Soumyabrata Pal, Barna Saha

Abstract: We provide new connectivity results for {\em vertex-random graphs} or {\em random annulus graphs} which are significant generalizations of random geometric graphs. Random geometric graphs (RGG) are one of the most basic models of random graphs for spatial networks proposed by Gilbert in 1961, shortly after the introduction of the Erdős-R\'{en}yi random graphs. They resemble social networks in many… ▽ More We provide new connectivity results for {\em vertex-random graphs} or {\em random annulus graphs} which are significant generalizations of random geometric graphs. Random geometric graphs (RGG) are one of the most basic models of random graphs for spatial networks proposed by Gilbert in 1961, shortly after the introduction of the Erdős-R\'{en}yi random graphs. They resemble social networks in many ways (e.g. by spontaneously creating cluster of nodes with high modularity). The connectivity properties of RGG have been studied since its introduction, and analyzing them has been significantly harder than their Erdős-R\'{en}yi counterparts due to correlated edge formation. Our next contribution is in using the connectivity of random annulus graphs to provide necessary and sufficient conditions for efficient recovery of communities for {\em the geometric block model} (GBM). The GBM is a probabilistic model for community detection defined over an RGG in a similar spirit as the popular {\em stochastic block model}, which is defined over an Erdős-R\'{en}yi random graph. The geometric block model inherits the transitivity properties of RGGs and thus models communities better than a stochastic block model. However, analyzing them requires fresh perspectives as all prior tools fail due to correlation in edge formation. We provide a simple and efficient algorithm that can recover communities in GBM exactly with high probability in the regime of connectivity. △ Less

Submitted 14 May, 2020; v1 submitted 12 April, 2018; originally announced April 2018.

arXiv:1709.05510 [pdf, other]

The Geometric Block Model

Authors: Sainyam Galhotra, Arya Mazumdar, Soumyabrata Pal, Barna Saha

Abstract: To capture the inherent geometric features of many community detection problems, we propose to use a new random graph model of communities that we call a Geometric Block Model. The geometric block model generalizes the random geometric graphs in the same way that the well-studied stochastic block model generalizes the Erdos-Renyi random graphs. It is also a natural extension of random community mo… ▽ More To capture the inherent geometric features of many community detection problems, we propose to use a new random graph model of communities that we call a Geometric Block Model. The geometric block model generalizes the random geometric graphs in the same way that the well-studied stochastic block model generalizes the Erdos-Renyi random graphs. It is also a natural extension of random community models inspired by the recent theoretical and practical advancement in community detection. While being a topic of fundamental theoretical interest, our main contribution is to show that many practical community structures are better explained by the geometric block model. We also show that a simple triangle-counting algorithm to detect communities in the geometric block model is near-optimal. Indeed, even in the regime where the average degree of the graph grows only logarithmically with the number of vertices (sparse-graph), we show that this algorithm performs extremely well, both theoretically and practically. In contrast, the triangle-counting algorithm is far from being optimum for the stochastic block model. We simulate our results on both real and synthetic datasets to show superior performance of both the new model as well as our algorithm. △ Less

Submitted 24 January, 2018; v1 submitted 16 September, 2017; originally announced September 2017.

Comments: A shorter version of this paper has appeared in 32nd AAAI Conference on Artificial Intelligence. The AAAI proceedings version as well as the previous version in arxiv contained some errors that have been corrected in this version

ACM Class: E.1

arXiv:1709.03221 [pdf, other]

doi 10.1145/3106237.3106277

Fairness Testing: Testing Software for Discrimination

Authors: Sainyam Galhotra, Yuriy Brun, Alexandra Meliou

Abstract: This paper defines software fairness and discrimination and develops a testing-based method for measuring if and how much software discriminates, focusing on causality in discriminatory behavior. Evidence of software discrimination has been found in modern software systems that recommend criminal sentences, grant access to financial products, and determine who is allowed to participate in promotio… ▽ More This paper defines software fairness and discrimination and develops a testing-based method for measuring if and how much software discriminates, focusing on causality in discriminatory behavior. Evidence of software discrimination has been found in modern software systems that recommend criminal sentences, grant access to financial products, and determine who is allowed to participate in promotions. Our approach, Themis, generates efficient test suites to measure discrimination. Given a schema describing valid system inputs, Themis generates discrimination tests automatically and does not require an oracle. We evaluate Themis on 20 software systems, 12 of which come from prior work with explicit focus on avoiding discrimination. We find that (1) Themis is effective at discovering software discrimination, (2) state-of-the-art techniques for removing discrimination from algorithms fail in many situations, at times discriminating against as much as 98% of an input subdomain, (3) Themis optimizations are effective at producing efficient test suites for measuring discrimination, and (4) Themis is more efficient on systems that exhibit more discrimination. We thus demonstrate that fairness testing is a critical aspect of the software development cycle in domains with possible discrimination and provide initial tools for measuring software discrimination. △ Less

Submitted 10 September, 2017; originally announced September 2017.

Comments: Sainyam Galhotra, Yuriy Brun, and Alexandra Meliou. 2017. Fairness Testing: Testing Software for Discrimination. In Proceedings of 2017 11th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE), Paderborn, Germany, September 4-8, 2017 (ESEC/FSE'17). https://doi.org/10.1145/3106237.3106277, ESEC/FSE, 2017

arXiv:1602.03110 [pdf, other]

doi 10.1145/2882903.2882929

Holistic Influence Maximization: Combining Scalability and Efficiency with Opinion-Aware Models

Authors: Sainyam Galhotra, Akhil Arora, Shourya Roy

Abstract: The steady growth of graph data from social networks has resulted in wide-spread research in finding solutions to the influence maximization problem. In this paper, we propose a holistic solution to the influence maximization (IM) problem. (1) We introduce an opinion-cum-interaction (OI) model that closely mirrors the real-world scenarios. Under the OI model, we introduce a novel problem of Maximi… ▽ More The steady growth of graph data from social networks has resulted in wide-spread research in finding solutions to the influence maximization problem. In this paper, we propose a holistic solution to the influence maximization (IM) problem. (1) We introduce an opinion-cum-interaction (OI) model that closely mirrors the real-world scenarios. Under the OI model, we introduce a novel problem of Maximizing the Effective Opinion (MEO) of influenced users. We prove that the MEO problem is NP-hard and cannot be approximated within a constant ratio unless P=NP. (2) We propose a heuristic algorithm OSIM to efficiently solve the MEO problem. To better explain the OSIM heuristic, we first introduce EaSyIM - the opinion-oblivious version of OSIM, a scalable algorithm capable of running within practical compute times on commodity hardware. In addition to serving as a fundamental building block for OSIM, EaSyIM is capable of addressing the scalability aspect - memory consumption and running time, of the IM problem as well. Empirically, our algorithms are capable of maintaining the deviation in the spread always within 5% of the best known methods in the literature. In addition, our experiments show that both OSIM and EaSyIM are effective, efficient, scalable and significantly enhance the ability to analyze real datasets. △ Less

Submitted 9 February, 2016; originally announced February 2016.

Comments: ACM SIGMOD Conference 2016, 18 pages, 29 figures

ACM Class: H.2.8

arXiv:1408.5069 [pdf, ps, other]

doi 10.1145/2663353

Optimal Radius for Connectivity in Duty-Cycled Wireless Sensor Networks

Authors: Amitabha Bagchi, Cristina Pinotti, Sainyam Galhotra, Tarun Mangla

Abstract: We investigate the condition on transmission radius needed to achieve connectivity in duty-cycled wireless sensor networks (briefly, DC-WSN). First, we settle a conjecture of Das et. al. (2012) and prove that the connectivity condition on Random Geometric Graphs (RGG), given by Gupta and Kumar (1989), can be used to derive a weak sufficient condition to achieve connectivity in DC-WSN. To find a st… ▽ More We investigate the condition on transmission radius needed to achieve connectivity in duty-cycled wireless sensor networks (briefly, DC-WSN). First, we settle a conjecture of Das et. al. (2012) and prove that the connectivity condition on Random Geometric Graphs (RGG), given by Gupta and Kumar (1989), can be used to derive a weak sufficient condition to achieve connectivity in DC-WSN. To find a stronger result, we define a new vertex-based random connection model which is of independent interest. Following a proof technique of Penrose (1991) we prove that when the density of the nodes approaches infinity then a finite component of size greater than 1 exists with probability 0 in this model. We use this result to obtain an optimal condition on node transmission radius which is both necessary and sufficient to achieve connectivity and is hence optimal. The optimality of such a radius is also tested via simulation for two specific duty-cycle schemes, called the contiguous and the random selection duty-cycle scheme. Finally, we design a minimum-radius duty-cycling scheme that achieves connectivity with a transmission radius arbitrarily close to the one required in Random Geometric Graphs. The overhead in this case is that we have to spend some time computing the schedule. △ Less

Submitted 28 August, 2014; v1 submitted 21 August, 2014; originally announced August 2014.

Comments: To appear in ACM Transactions on Sensor Networks. Brief version appeared in Proc. of ACM MSWIM 2013

MSC Class: 60K35 ACM Class: C.2.1

Journal ref: ACM T Sensor Network 11(2):36, (February 2015)

Showing 1–26 of 26 results for author: Galhotra, S