Skip to main content

Showing 1–26 of 26 results for author: Galhotra, S

.
  1. arXiv:2405.14728  [pdf, ps, other

    cs.AI cs.LG

    Intervention and Conditioning in Causal Bayesian Networks

    Authors: Sainyam Galhotra, Joseph Y. Halpern

    Abstract: Causal models are crucial for understanding complex systems and identifying causal relationships among variables. Even though causal models are extremely popular, conditional probability calculation of formulas involving interventions pose significant challenges. In case of Causal Bayesian Networks (CBNs), Pearl assumes autonomy of mechanisms that determine interventions to calculate a range of pr… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

  2. arXiv:2404.04713  [pdf, other

    cs.DB cs.DS

    Faster Algorithms for Fair Max-Min Diversification in $\mathbb{R}^d$

    Authors: Yash Kurkure, Miles Shamo, Joseph Wiseman, Sainyam Galhotra, Stavros Sintos

    Abstract: The task of extracting a diverse subset from a dataset, often referred to as maximum diversification, plays a pivotal role in various real-world applications that have far-reaching consequences. In this work, we delve into the realm of fairness-aware data subset selection, specifically focusing on the problem of selecting a diverse set of size $k$ from a large collection of $n$ data points (FairDi… ▽ More

    Submitted 14 May, 2024; v1 submitted 6 April, 2024; originally announced April 2024.

    Journal ref: SIGMOD 2024

  3. arXiv:2310.17843  [pdf, other

    cs.LG cs.GT

    A Data-Centric Online Market for Machine Learning: From Discovery to Pricing

    Authors: Minbiao Han, Jonathan Light, Steven Xia, Sainyam Galhotra, Raul Castro Fernandez, Haifeng Xu

    Abstract: Data fuels machine learning (ML) - rich and high-quality training data is essential to the success of ML. However, to transform ML from the race among a few large corporations to an accessible technology that serves numerous normal users' data analysis requests, there still exist important challenges. One gap we observed is that many ML users can benefit from new data that other data owners posses… ▽ More

    Submitted 26 October, 2023; originally announced October 2023.

  4. arXiv:2304.09068  [pdf, other

    cs.DB cs.LG

    METAM: Goal-Oriented Data Discovery

    Authors: Sainyam Galhotra, Yue Gong, Raul Castro Fernandez

    Abstract: Data is a central component of machine learning and causal inference tasks. The availability of large amounts of data from sources such as open data repositories, data lakes and data marketplaces creates an opportunity to augment data and boost those tasks' performance. However, augmentation techniques rely on a user manually discovering and shortlisting useful candidate augmentations. Existing so… ▽ More

    Submitted 18 April, 2023; originally announced April 2023.

    Comments: ICDE 2023 paper

  5. arXiv:2303.01378  [pdf, other

    cs.AI cs.DB cs.LG

    A Vision for Semantically Enriched Data Science

    Authors: Udayan Khurana, Kavitha Srinivas, Sainyam Galhotra, Horst Samulowitz

    Abstract: The recent efforts in automation of machine learning or data science has achieved success in various tasks such as hyper-parameter optimization or model selection. However, key areas such as utilizing domain knowledge and data semantics are areas where we have seen little automation. Data Scientists have long leveraged common sense reasoning and domain knowledge to understand and enrich data for b… ▽ More

    Submitted 2 March, 2023; originally announced March 2023.

    Comments: arXiv admin note: substantial text overlap with arXiv:2205.08018

  6. arXiv:2212.10839  [pdf, other

    cs.LG cs.AI cs.DB stat.ML

    Consistent Range Approximation for Fair Predictive Modeling

    Authors: Jiongli Zhu, Sainyam Galhotra, Nazanin Sabri, Babak Salimi

    Abstract: This paper proposes a novel framework for certifying the fairness of predictive models trained on biased data. It draws from query answering for incomplete and inconsistent databases to formulate the problem of consistent range approximation (CRA) of fairness queries for a predictive model on a target population. The framework employs background knowledge of the data collection process and biased… ▽ More

    Submitted 28 July, 2023; v1 submitted 21 December, 2022; originally announced December 2022.

  7. arXiv:2206.11303  [pdf, other

    cs.SI cs.DM cs.IT cs.LG

    Community Recovery in the Geometric Block Model

    Authors: Sainyam Galhotra, Arya Mazumdar, Soumyabrata Pal, Barna Saha

    Abstract: To capture the inherent geometric features of many community detection problems, we propose to use a new random graph model of communities that we call a Geometric Block Model. The geometric block model builds on the random geometric graphs (Gilbert, 1961), one of the basic models of random graphs for spatial networks, in the same way that the well-studied stochastic block model builds on the Erdő… ▽ More

    Submitted 17 November, 2023; v1 submitted 22 June, 2022; originally announced June 2022.

    Comments: 53 pages, 18 figures. Accepted at the Journal of Machine Learning Research (JMLR). Shorter versions accepted in AAAI 2018 (see arXiv:1709.05510) and RANDOM 2019 (see arXiv:1804.05013). arXiv admin note: text overlap with arXiv:1804.05013

    Journal ref: Journal of Machine Learning Research (JMLR) 2023

  8. arXiv:2203.14692  [pdf, other

    cs.DB

    HypeR: Hypothetical Reasoning With What-If and How-To Queries Using a Probabilistic Causal Approach

    Authors: Sainyam Galhotra, Amir Gilad, Sudeepa Roy, Babak Salimi

    Abstract: What-if (provisioning for an update to a database) and how-to (how to modify the database to achieve a goal) analyses provide insights to users who wish to examine hypothetical scenarios without making actual changes to a database and thereby help plan strategies in their fields. Typically, such analyses are done by testing the effect of an update in the existing database on a specific view create… ▽ More

    Submitted 28 March, 2022; originally announced March 2022.

    Comments: Full version of the SIGMOD 2022 paper with the same title

  9. arXiv:2106.01543  [pdf, other

    cs.DB

    Ver: View Discovery in the Wild

    Authors: Yue Gong, Zhiru Zhu, Sainyam Galhotra, Raul Castro Fernandez

    Abstract: We present Ver, a data discovery system that identifies project-join views over large repositories of tables that do not contain join path information, and even when input queries are inaccurate. Ver implements a reference architecture to solve both the technical (scale and search) and human (semantic ambiguity, navigating a large number of results) problems of view discovery. We demonstrate users… ▽ More

    Submitted 4 October, 2022; v1 submitted 2 June, 2021; originally announced June 2021.

  10. arXiv:2105.06058  [pdf, other

    cs.DB

    DataExposer: Exposing Disconnect between Data and Systems

    Authors: Sainyam Galhotra, Anna Fariha, Raoni Lourenço, Juliana Freire, Alexandra Meliou, Divesh Srivastava

    Abstract: As data is a central component of many modern systems, the cause of a system malfunction may reside in the data, and, specifically, particular properties of the data. For example, a health-monitoring system that is designed under the assumption that weight is reported in imperial units (lbs) will malfunction when encountering weight reported in metric units (kilograms). Similar to software debuggi… ▽ More

    Submitted 12 May, 2021; originally announced May 2021.

  11. arXiv:2105.05782  [pdf, other

    cs.DS cs.DB stat.ML

    How to Design Robust Algorithms using Noisy Comparison Oracle

    Authors: Raghavendra Addanki, Sainyam Galhotra, Barna Saha

    Abstract: Metric based comparison operations such as finding maximum, nearest and farthest neighbor are fundamental to studying various clustering techniques such as $k$-center clustering and agglomerative hierarchical clustering. These techniques crucially rely on accurate estimation of pairwise distance between records. However, computing exact features of the records, and their pairwise distances is ofte… ▽ More

    Submitted 12 May, 2021; originally announced May 2021.

    Comments: PVLDB 2021

  12. arXiv:2103.11972  [pdf, other

    cs.AI cs.DB cs.LG

    Explaining Black-Box Algorithms Using Probabilistic Contrastive Counterfactuals

    Authors: Sainyam Galhotra, Romila Pradhan, Babak Salimi

    Abstract: There has been a recent resurgence of interest in explainable artificial intelligence (XAI) that aims to reduce the opaqueness of AI-based decision-making systems, allowing humans to scrutinize and trust them. Prior work in this context has focused on the attribution of responsibility for an algorithm's decisions to its inputs wherein responsibility is typically approached as a purely associationa… ▽ More

    Submitted 23 June, 2021; v1 submitted 22 March, 2021; originally announced March 2021.

    Comments: Proceedings of the 2021 International Conference on Management of Data. ACM, 2021

  13. arXiv:2102.03977  [pdf, other

    stat.ML cs.AI cs.CY cs.DS cs.LG

    Learning to Generate Fair Clusters from Demonstrations

    Authors: Sainyam Galhotra, Sandhya Saisubramanian, Shlomo Zilberstein

    Abstract: Fair clustering is the process of grou** similar entities together, while satisfying a mathematically well-defined fairness metric as a constraint. Due to the practical challenges in precise model specification, the prescribed fairness constraints are often incomplete and act as proxies to the intended fairness requirement, leading to biased outcomes when the system is deployed. We examine how t… ▽ More

    Submitted 7 February, 2021; originally announced February 2021.

  14. arXiv:2012.08594  [pdf, other

    cs.AI cs.DB

    Semantic Annotation for Tabular Data

    Authors: Udayan Khurana, Sainyam Galhotra

    Abstract: Detecting semantic concept of columns in tabular data is of particular interest to many applications ranging from data integration, cleaning, search to feature engineering and model building in machine learning. Recently, several works have proposed supervised learning-based or heuristic pattern-based approaches to semantic type annotation. Both have shortcomings that prevent them from generalizin… ▽ More

    Submitted 15 December, 2020; originally announced December 2020.

  15. arXiv:2006.06053  [pdf, other

    cs.LG cs.CY cs.DB stat.ML

    Causal Feature Selection for Algorithmic Fairness

    Authors: Sainyam Galhotra, Karthikeyan Shanmugam, Prasanna Sattigeri, Kush R. Varshney

    Abstract: The use of machine learning (ML) in high-stakes societal decisions has encouraged the consideration of fairness throughout the ML lifecycle. Although data integration is one of the primary steps to generate high quality training data, most of the fairness literature ignores this stage. In this work, we consider fairness in the integration component of data management, aiming to identify features t… ▽ More

    Submitted 31 March, 2022; v1 submitted 10 June, 2020; originally announced June 2020.

    Comments: Full version of the paper at SIGMOD 2022

  16. Efficient and Effective ER with Progressive Blocking

    Authors: Sainyam Galhotra, Donatella Firmani, Barna Saha, Divesh Srivastava

    Abstract: Blocking is a mechanism to improve the efficiency of Entity Resolution (ER) which aims to quickly prune out all non-matching record pairs. However, depending on the distributions of entity cluster sizes, existing techniques can be either (a) too aggressive, such that they help scale but can adversely affect the ER effectiveness, or (b) too permissive, potentially harming ER efficiency. In this pap… ▽ More

    Submitted 16 March, 2021; v1 submitted 28 May, 2020; originally announced May 2020.

    Comments: Galhotra, S., Firmani, D., Saha, B. et al. Efficient and effective ER with progressive blocking. The VLDB Journal (2021)

  17. arXiv:2005.06133  [pdf, other

    cs.DB cs.LG

    Adaptive Rule Discovery for Labeling Text Data

    Authors: Sainyam Galhotra, Behzad Golshan, Wang-Chiew Tan

    Abstract: Creating and collecting labeled data is one of the major bottlenecks in machine learning pipelines and the emergence of automated feature generation techniques such as deep learning, which typically requires a lot of training data, has further exacerbated the problem. While weak-supervision techniques have circumvented this bottleneck, existing frameworks either require users to write a set of div… ▽ More

    Submitted 12 May, 2020; originally announced May 2020.

  18. arXiv:2002.03508  [pdf, other

    cs.DS cs.AI cs.LG stat.ML

    Fair Correlation Clustering

    Authors: Saba Ahmadi, Sainyam Galhotra, Barna Saha, Roy Schwartz

    Abstract: In this paper we study the problem of correlation clustering under fairness constraints. In the classic correlation clustering problem, we are given a complete graph where each edge is labeled positive or negative. The goal is to obtain a clustering of the vertices that minimizes disagreements -- the number of negative edges trapped inside a cluster plus positive edges between different clusters.… ▽ More

    Submitted 9 February, 2020; originally announced February 2020.

  19. arXiv:1912.07820  [pdf, other

    stat.ML cs.DS cs.LG

    Balancing the Tradeoff Between Clustering Value and Interpretability

    Authors: Sandhya Saisubramanian, Sainyam Galhotra, Shlomo Zilberstein

    Abstract: Graph clustering groups entities -- the vertices of a graph -- based on their similarity, typically using a complex distance function over a large number of features. Successful integration of clustering approaches in automated decision-support systems hinges on the interpretability of the resulting clusters. This paper addresses the problem of generating interpretable clusters, given features of… ▽ More

    Submitted 30 January, 2020; v1 submitted 17 December, 2019; originally announced December 2019.

    Comments: Accepted at AIES 2020

  20. arXiv:1907.00117  [pdf, ps, other

    cs.DS

    Min-Max Correlation Clustering via MultiCut

    Authors: Saba Ahmadi, Sainyam Galhotra, Samir Khuller, Barna Saha, Roy Schwartz

    Abstract: Correlation clustering is a fundamental combinatorial optimization problem arising in many contexts and applications that has been the subject of dozens of papers in the literature. In this problem we are given a general weighted graph where each edge is labeled positive or negative. The goal is to obtain a partitioning (clustering) of the vertices that minimizes disagreements - weight of negative… ▽ More

    Submitted 28 June, 2019; originally announced July 2019.

  21. arXiv:1903.00750  [pdf, other

    stat.ML cs.AI cs.DS cs.LG

    Lexicographically Ordered Multi-Objective Clustering

    Authors: Sainyam Galhotra, Sandhya Saisubramanian, Shlomo Zilberstein

    Abstract: We introduce a rich model for multi-objective clustering with lexicographic ordering over objectives and a slack. The slack denotes the allowed multiplicative deviation from the optimal objective value of the higher priority objective to facilitate improvement in lower-priority objectives. We then propose an algorithm called Zeus to solve this class of problems, which is characterized by a makeshi… ▽ More

    Submitted 2 March, 2019; originally announced March 2019.

  22. arXiv:1804.05013  [pdf, other

    cs.DM cs.DS cs.IT cs.LG

    Connectivity in Random Annulus Graphs and the Geometric Block Model

    Authors: Sainyam Galhotra, Arya Mazumdar, Soumyabrata Pal, Barna Saha

    Abstract: We provide new connectivity results for {\em vertex-random graphs} or {\em random annulus graphs} which are significant generalizations of random geometric graphs. Random geometric graphs (RGG) are one of the most basic models of random graphs for spatial networks proposed by Gilbert in 1961, shortly after the introduction of the Erdős-R\'{en}yi random graphs. They resemble social networks in many… ▽ More

    Submitted 14 May, 2020; v1 submitted 12 April, 2018; originally announced April 2018.

  23. arXiv:1709.05510  [pdf, other

    cs.SI cs.DS stat.ML

    The Geometric Block Model

    Authors: Sainyam Galhotra, Arya Mazumdar, Soumyabrata Pal, Barna Saha

    Abstract: To capture the inherent geometric features of many community detection problems, we propose to use a new random graph model of communities that we call a Geometric Block Model. The geometric block model generalizes the random geometric graphs in the same way that the well-studied stochastic block model generalizes the Erdos-Renyi random graphs. It is also a natural extension of random community mo… ▽ More

    Submitted 24 January, 2018; v1 submitted 16 September, 2017; originally announced September 2017.

    Comments: A shorter version of this paper has appeared in 32nd AAAI Conference on Artificial Intelligence. The AAAI proceedings version as well as the previous version in arxiv contained some errors that have been corrected in this version

    ACM Class: E.1

  24. arXiv:1709.03221  [pdf, other

    cs.SE cs.AI cs.CY cs.DB cs.LG

    Fairness Testing: Testing Software for Discrimination

    Authors: Sainyam Galhotra, Yuriy Brun, Alexandra Meliou

    Abstract: This paper defines software fairness and discrimination and develops a testing-based method for measuring if and how much software discriminates, focusing on causality in discriminatory behavior. Evidence of software discrimination has been found in modern software systems that recommend criminal sentences, grant access to financial products, and determine who is allowed to participate in promotio… ▽ More

    Submitted 10 September, 2017; originally announced September 2017.

    Comments: Sainyam Galhotra, Yuriy Brun, and Alexandra Meliou. 2017. Fairness Testing: Testing Software for Discrimination. In Proceedings of 2017 11th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE), Paderborn, Germany, September 4-8, 2017 (ESEC/FSE'17). https://doi.org/10.1145/3106237.3106277, ESEC/FSE, 2017

  25. Holistic Influence Maximization: Combining Scalability and Efficiency with Opinion-Aware Models

    Authors: Sainyam Galhotra, Akhil Arora, Shourya Roy

    Abstract: The steady growth of graph data from social networks has resulted in wide-spread research in finding solutions to the influence maximization problem. In this paper, we propose a holistic solution to the influence maximization (IM) problem. (1) We introduce an opinion-cum-interaction (OI) model that closely mirrors the real-world scenarios. Under the OI model, we introduce a novel problem of Maximi… ▽ More

    Submitted 9 February, 2016; originally announced February 2016.

    Comments: ACM SIGMOD Conference 2016, 18 pages, 29 figures

    ACM Class: H.2.8

  26. arXiv:1408.5069  [pdf, ps, other

    cs.NI math.PR

    Optimal Radius for Connectivity in Duty-Cycled Wireless Sensor Networks

    Authors: Amitabha Bagchi, Cristina Pinotti, Sainyam Galhotra, Tarun Mangla

    Abstract: We investigate the condition on transmission radius needed to achieve connectivity in duty-cycled wireless sensor networks (briefly, DC-WSN). First, we settle a conjecture of Das et. al. (2012) and prove that the connectivity condition on Random Geometric Graphs (RGG), given by Gupta and Kumar (1989), can be used to derive a weak sufficient condition to achieve connectivity in DC-WSN. To find a st… ▽ More

    Submitted 28 August, 2014; v1 submitted 21 August, 2014; originally announced August 2014.

    Comments: To appear in ACM Transactions on Sensor Networks. Brief version appeared in Proc. of ACM MSWIM 2013

    MSC Class: 60K35 ACM Class: C.2.1

    Journal ref: ACM T Sensor Network 11(2):36, (February 2015)