Search | arXiv e-print repository

The Role of Learning Algorithms in Collective Action

Authors: Omri Ben-Dov, Jake Fawkes, Samira Samadi, Amartya Sanyal

Abstract: Collective action in machine learning is the study of the control that a coordinated group can have over machine learning algorithms. While previous research has concentrated on assessing the impact of collectives against Bayes (sub-)optimal classifiers, this perspective is limited in that it does not account for the choice of learning algorithm. Since classifiers seldom behave like Bayes classifi… ▽ More Collective action in machine learning is the study of the control that a coordinated group can have over machine learning algorithms. While previous research has concentrated on assessing the impact of collectives against Bayes (sub-)optimal classifiers, this perspective is limited in that it does not account for the choice of learning algorithm. Since classifiers seldom behave like Bayes classifiers and are influenced by the choice of learning algorithms along with their inherent biases, in this work we initiate the study of how the choice of the learning algorithm plays a role in the success of a collective in practical settings. Specifically, we focus on distributionally robust optimization (DRO), popular for improving a worst group error, and on the ubiquitous stochastic gradient descent (SGD), due to its inductive bias for "simpler" functions. Our empirical results, supported by a theoretical foundation, show that the effective size and success of the collective are highly dependent on properties of the learning algorithm. This highlights the necessity of taking the learning algorithm into account when studying the impact of collective action in machine learning. △ Less

Submitted 4 June, 2024; v1 submitted 10 May, 2024; originally announced May 2024.

Comments: Accepted at the International Conference in Machine Learning (ICML), 2024

arXiv:2402.04579 [pdf, other]

Collective Counterfactual Explanations via Optimal Transport

Authors: Ahmad-Reza Ehyaei, Ali Shirali, Samira Samadi

Abstract: Counterfactual explanations provide individuals with cost-optimal actions that can alter their labels to desired classes. However, if substantial instances seek state modification, such individual-centric methods can lead to new competitions and unanticipated costs. Furthermore, these recommendations, disregarding the underlying data distribution, may suggest actions that users perceive as outlier… ▽ More Counterfactual explanations provide individuals with cost-optimal actions that can alter their labels to desired classes. However, if substantial instances seek state modification, such individual-centric methods can lead to new competitions and unanticipated costs. Furthermore, these recommendations, disregarding the underlying data distribution, may suggest actions that users perceive as outliers. To address these issues, our work proposes a collective approach for formulating counterfactual explanations, with an emphasis on utilizing the current density of the individuals to inform the recommended actions. Our problem naturally casts as an optimal transport problem. Leveraging the extensive literature on optimal transport, we illustrate how this collective method improves upon the desiderata of classical counterfactual explanations. We support our proposal with numerical simulations, illustrating the effectiveness of the proposed approach and its relation to classic methods. △ Less

Submitted 6 February, 2024; originally announced February 2024.

arXiv:2312.02110 [pdf, ps, other]

Fourier Methods for Sufficient Dimension Reduction in Time Series

Authors: S. Yaser Samadi, Tharindu P. De Alwis

Abstract: Dimensionality reduction has always been one of the most significant and challenging problems in the analysis of high-dimensional data. In the context of time series analysis, our focus is on the estimation and inference of conditional mean and variance functions. By using central mean and variance dimension reduction subspaces that preserve sufficient information about the response, one can effec… ▽ More Dimensionality reduction has always been one of the most significant and challenging problems in the analysis of high-dimensional data. In the context of time series analysis, our focus is on the estimation and inference of conditional mean and variance functions. By using central mean and variance dimension reduction subspaces that preserve sufficient information about the response, one can effectively estimate the unknown mean and variance functions of the time series. While the literature presents several approaches to estimate the time series central mean and variance subspaces (TS-CMS and TS-CVS), these methods tend to be computationally intensive and infeasible for practical applications. By employing the Fourier transform, we derive explicit estimators for TS-CMS and TS-CVS. These proposed estimators are demonstrated to be consistent, asymptotically normal, and efficient. Simulation studies have been conducted to evaluate the performance of the proposed method. The results show that our method is significantly more accurate and computationally efficient than existing methods. Furthermore, the method has been applied to the Canadian Lynx dataset. △ Less

Submitted 4 December, 2023; originally announced December 2023.

arXiv:2309.12902 [pdf, ps, other]

doi 10.1080/07350015.2023.2260862

Reduced-rank Envelope Vector Autoregressive Models

Authors: S. Yaser Samadi, Wiranthe B. Herath

Abstract: The standard vector autoregressive (VAR) models suffer from overparameterization which is a serious issue for high-dimensional time series data as it restricts the number of variables and lags that can be incorporated into the model. Several statistical methods, such as the reduced-rank model for multivariate (multiple) time series (Velu et al., 1986; Reinsel and Velu, 1998; Reinsel et al., 2022)… ▽ More The standard vector autoregressive (VAR) models suffer from overparameterization which is a serious issue for high-dimensional time series data as it restricts the number of variables and lags that can be incorporated into the model. Several statistical methods, such as the reduced-rank model for multivariate (multiple) time series (Velu et al., 1986; Reinsel and Velu, 1998; Reinsel et al., 2022) and the Envelope VAR model (Wang and Ding, 2018), provide solutions for achieving dimension reduction of the parameter space of the VAR model. However, these methods can be inefficient in extracting relevant information from complex data, as they fail to distinguish between relevant and irrelevant information, or they are inefficient in addressing the rank deficiency problem. We put together the idea of envelope models into the reduced-rank VAR model to simultaneously tackle these challenges, and propose a new parsimonious version of the classical VAR model called the reduced-rank envelope VAR (REVAR) model. Our proposed REVAR model incorporates the strengths of both reduced-rank VAR and envelope VAR models and leads to significant gains in efficiency and accuracy. The asymptotic properties of the proposed estimators are established under different error assumptions. Simulation studies and real data analysis are conducted to evaluate and illustrate the proposed method. △ Less

Submitted 22 September, 2023; originally announced September 2023.

Journal ref: Journal of Business and Economic Statistics, 2023

arXiv:2305.18263 [pdf, other]

doi 10.1007/s11634-023-00546-6

MLE for the parameters of bivariate interval-valued models

Authors: S. Yaser Samadi, L. Billard, Jiin-Huarng Guo, Wei Xu

Abstract: With contemporary data sets becoming too large to analyze the data directly, various forms of aggregated data are becoming common. The original individual data are points, but after aggregation, the observations are interval-valued (e.g.). While some researchers simply analyze the set of averages of the observations by aggregated class, it is easily established that approach ignores much of the in… ▽ More With contemporary data sets becoming too large to analyze the data directly, various forms of aggregated data are becoming common. The original individual data are points, but after aggregation, the observations are interval-valued (e.g.). While some researchers simply analyze the set of averages of the observations by aggregated class, it is easily established that approach ignores much of the information in the original data set. The initial theoretical work for interval-valued data was that of Le-Rademacher and Billard (2011), but those results were limited to estimation of the mean and variance of a single variable only. This article seeks to redress the limitation of their work by deriving the maximum likelihood estimator for the all important covariance statistic, a basic requirement for numerous methodologies, such as regression, principal components, and canonical analyses. Asymptotic properties of the proposed estimators are established. The Le-Rademacher and Billard results emerge as special cases of our wider derivations. △ Less

Submitted 29 May, 2023; originally announced May 2023.

Comments: Will appear in ADAC

Journal ref: Advances in Data Analysis and Classification, 2023

arXiv:2204.08341 [pdf, other]

itdr: An R package of Integral Transformation Methods to Estimate the SDR Subspaces in Regression

Authors: Tharindu P. De Alwis, S. Yaser Samadi, Jiaying Weng

Abstract: Sufficient dimension reduction (SDR) is an effective tool for regression models, offering a viable approach to address and analyze the nonlinear nature of regression problems. This paper introduces the itdr R package, a comprehensive and user-friendly tool that introduces several functions based on integral transformation methods for estimating SDR subspaces. In particular, the itdr package incorp… ▽ More Sufficient dimension reduction (SDR) is an effective tool for regression models, offering a viable approach to address and analyze the nonlinear nature of regression problems. This paper introduces the itdr R package, a comprehensive and user-friendly tool that introduces several functions based on integral transformation methods for estimating SDR subspaces. In particular, the itdr package incorporates two key methods, namely the Fourier method (FM) and the convolution method (CM). These methods allow for estimating the SDR subspaces, namely the central mean subspace (CMS) and the central subspace (CS), in cases where the response is univariate. Furthermore, the itdr package facilitates the recovery of the CMS through the iterative Hessian transformation (IHT) method for univariate responses. Additionally, it enables the recovery of the CS by employing various Fourier transformation strategies, such as the inverse dimension reduction method, the minimum discrepancy approach using Fourier transformation, and the Fourier transform sparse inverse regression approach, specifically designed for cases with multivariate responses. To demonstrate its capabilities, the itdr package is applied to five different datasets. Furthermore, this package is the pioneering implementation of integral transformation methods for estimating SDR subspaces, thus promising significant advancements in SDR research. △ Less

Submitted 16 July, 2023; v1 submitted 13 April, 2022; originally announced April 2022.

Comments: 17 pages, 1 figure

arXiv:2105.03153 [pdf, other]

Pairwise Fairness for Ordinal Regression

Authors: Matthäus Kleindessner, Samira Samadi, Muhammad Bilal Zafar, Krishnaram Kenthapadi, Chris Russell

Abstract: We initiate the study of fairness for ordinal regression. We adapt two fairness notions previously considered in fair ranking and propose a strategy for training a predictor that is approximately fair according to either notion. Our predictor has the form of a threshold model, composed of a scoring function and a set of thresholds, and our strategy is based on a reduction to fair binary classifica… ▽ More We initiate the study of fairness for ordinal regression. We adapt two fairness notions previously considered in fair ranking and propose a strategy for training a predictor that is approximately fair according to either notion. Our predictor has the form of a threshold model, composed of a scoring function and a set of thresholds, and our strategy is based on a reduction to fair binary classification for learning the scoring function and local search for choosing the thresholds. We provide generalization guarantees on the error and fairness violation of our predictor, and we illustrate the effectiveness of our approach in extensive experiments. △ Less

Submitted 11 February, 2022; v1 submitted 7 May, 2021; originally announced May 2021.

arXiv:2012.02021 [pdf, ps, other]

doi 10.1080/02331888.2020.1867140

Modeling Count Data via Copulas

Authors: Hadi Safari-Katesari, S. Yaser Samadi, Samira Zaroudi

Abstract: Copula models have been widely used to model the dependence between continuous random variables, but modeling count data via copulas has recently become popular in the statistics literature. Spearman's rho is an appropriate and effective tool to measure the degree of dependence between two random variables. In this paper, we derived the population version of Spearman's rho correlation via copulas… ▽ More Copula models have been widely used to model the dependence between continuous random variables, but modeling count data via copulas has recently become popular in the statistics literature. Spearman's rho is an appropriate and effective tool to measure the degree of dependence between two random variables. In this paper, we derived the population version of Spearman's rho correlation via copulas when both random variables are discrete. The closed-form expressions of the Spearman correlation are obtained for some copulas of simple structure such as Archimedean copulas with different marginal distributions. We derive the upper bound and the lower bound of the Spearman's rho for Bernoulli random variables. Then, the proposed Spearman's rho correlations are compared with their corresponding Kendall's tau values. We characterize the functional relationship between these two measures of dependence in some special cases. An extensive simulation study is conducted to demonstrate the validity of our theoretical results. Finally, we propose a bivariate copula regression model to analyze the count data of a \emph{cervical cancer} dataset. △ Less

Submitted 3 December, 2020; originally announced December 2020.

Comments: 33 pages

Report number: 2020 MSC Class: 60E15; 62p10

Journal ref: Statistics 2020

arXiv:2006.10085 [pdf, other]

Socially Fair k-Means Clustering

Authors: Mehrdad Ghadiri, Samira Samadi, Santosh Vempala

Abstract: We show that the popular k-means clustering algorithm (Lloyd's heuristic), used for a variety of scientific data, can result in outcomes that are unfavorable to subgroups of data (e.g., demographic groups). Such biased clusterings can have deleterious implications for human-centric applications such as resource allocation. We present a fair k-means objective and algorithm to choose cluster centers… ▽ More We show that the popular k-means clustering algorithm (Lloyd's heuristic), used for a variety of scientific data, can result in outcomes that are unfavorable to subgroups of data (e.g., demographic groups). Such biased clusterings can have deleterious implications for human-centric applications such as resource allocation. We present a fair k-means objective and algorithm to choose cluster centers that provide equitable costs for different groups. The algorithm, Fair-Lloyd, is a modification of Lloyd's heuristic for k-means, inheriting its simplicity, efficiency, and stability. In comparison with standard Lloyd's, we find that on benchmark datasets, Fair-Lloyd exhibits unbiased performance by ensuring that all groups have equal costs in the output k-clustering, while incurring a negligible increase in running time, thus making it a viable fair option wherever k-means is currently used. △ Less

Submitted 29 October, 2020; v1 submitted 17 June, 2020; originally announced June 2020.

Comments: 12 pages, 11 figures

arXiv:1901.08668 [pdf, other]

Guarantees for Spectral Clustering with Fairness Constraints

Authors: Matthäus Kleindessner, Samira Samadi, Pranjal Awasthi, Jamie Morgenstern

Abstract: Given the widespread popularity of spectral clustering (SC) for partitioning graph data, we study a version of constrained SC in which we try to incorporate the fairness notion proposed by Chierichetti et al. (2017). According to this notion, a clustering is fair if every demographic group is approximately proportionally represented in each cluster. To this end, we develop variants of both normali… ▽ More Given the widespread popularity of spectral clustering (SC) for partitioning graph data, we study a version of constrained SC in which we try to incorporate the fairness notion proposed by Chierichetti et al. (2017). According to this notion, a clustering is fair if every demographic group is approximately proportionally represented in each cluster. To this end, we develop variants of both normalized and unnormalized constrained SC and show that they help find fairer clusterings on both synthetic and real data. We also provide a rigorous theoretical analysis of our algorithms on a natural variant of the stochastic block model, where $h$ groups have strong inter-group connectivity, but also exhibit a "natural" clustering structure which is fair. We prove that our algorithms can recover this fair clustering with high probability. △ Less

Submitted 10 May, 2019; v1 submitted 24 January, 2019; originally announced January 2019.

arXiv:1811.00103 [pdf, other]

The Price of Fair PCA: One Extra Dimension

Authors: Samira Samadi, Uthaipon Tantipongpipat, Jamie Morgenstern, Mohit Singh, Santosh Vempala

Abstract: We investigate whether the standard dimensionality reduction technique of PCA inadvertently produces data representations with different fidelity for two different populations. We show on several real-world data sets, PCA has higher reconstruction error on population A than on B (for example, women versus men or lower- versus higher-educated individuals). This can happen even when the data set has… ▽ More We investigate whether the standard dimensionality reduction technique of PCA inadvertently produces data representations with different fidelity for two different populations. We show on several real-world data sets, PCA has higher reconstruction error on population A than on B (for example, women versus men or lower- versus higher-educated individuals). This can happen even when the data set has a similar number of samples from A and B. This motivates our study of dimensionality reduction techniques which maintain similar fidelity for A and B. We define the notion of Fair PCA and give a polynomial-time algorithm for finding a low dimensional representation of the data which is nearly-optimal with respect to this measure. Finally, we show on real-world data sets that our algorithm can be used to efficiently generate a fair low dimensional representation of the data. △ Less

Submitted 31 October, 2018; originally announced November 2018.

Showing 1–11 of 11 results for author: Samadi, S