Search | arXiv e-print repository

Addressing Discretization-Induced Bias in Demographic Prediction

Authors: Evan Dong, Aaron Schein, Yixin Wang, Nikhil Garg

Abstract: Racial and other demographic imputation is necessary for many applications, especially in auditing disparities and outreach targeting in political campaigns. The canonical approach is to construct continuous predictions -- e.g., based on name and geography -- and then to $\textit{discretize}$ the predictions by selecting the most likely class (argmax). We study how this practice produces… ▽ More Racial and other demographic imputation is necessary for many applications, especially in auditing disparities and outreach targeting in political campaigns. The canonical approach is to construct continuous predictions -- e.g., based on name and geography -- and then to $\textit{discretize}$ the predictions by selecting the most likely class (argmax). We study how this practice produces $\textit{discretization bias}$. In particular, we show that argmax labeling, as used by a prominent commercial voter file vendor to impute race/ethnicity, results in a substantial under-count of African-American voters, e.g., by 28.2% points in North Carolina. This bias can have substantial implications in downstream tasks that use such labels. We then introduce a $\textit{joint optimization}$ approach -- and a tractable $\textit{data-driven thresholding}$ heuristic -- that can eliminate this bias, with negligible individual-level accuracy loss. Finally, we theoretically analyze discretization bias, show that calibrated continuous models are insufficient to eliminate it, and that an approach such as ours is necessary. Broadly, we warn researchers and practitioners against discretizing continuous demographic predictions without considering downstream consequences. △ Less

Submitted 26 May, 2024; originally announced May 2024.

Comments: A version of this paper was accepted to the 2024 ACM Conference on Fairness, Accountability, and Transparency

ACM Class: K.4.0

arXiv:2404.04633 [pdf, other]

Context versus Prior Knowledge in Language Models

Authors: Kevin Du, Vésteinn Snæbjarnarson, Niklas Stoehr, Jennifer C. White, Aaron Schein, Ryan Cotterell

Abstract: To answer a question, language models often need to integrate prior knowledge learned during pretraining and new information presented in context. We hypothesize that models perform this integration in a predictable way across different questions and contexts: models will rely more on prior knowledge for questions about entities (e.g., persons, places, etc.) that they are more familiar with due to… ▽ More To answer a question, language models often need to integrate prior knowledge learned during pretraining and new information presented in context. We hypothesize that models perform this integration in a predictable way across different questions and contexts: models will rely more on prior knowledge for questions about entities (e.g., persons, places, etc.) that they are more familiar with due to higher exposure in the training corpus, and be more easily persuaded by some contexts than others. To formalize this problem, we propose two mutual information-based metrics to measure a model's dependency on a context and on its prior about an entity: first, the persuasion score of a given context represents how much a model depends on the context in its decision, and second, the susceptibility score of a given entity represents how much the model can be swayed away from its original answer distribution about an entity. We empirically test our metrics for their validity and reliability. Finally, we explore and find a relationship between the scores and the model's expected familiarity with an entity, and provide two use cases to illustrate their benefits. △ Less

Submitted 16 June, 2024; v1 submitted 6 April, 2024; originally announced April 2024.

Comments: Long paper accepted at ACL 2024

arXiv:2403.06153 [pdf, other]

The AL$\ell_0$CORE Tensor Decomposition for Sparse Count Data

Authors: John Hood, Aaron Schein

Abstract: This paper introduces AL$\ell_0$CORE, a new form of probabilistic non-negative tensor decomposition. AL$\ell_0$CORE is a Tucker decomposition where the number of non-zero elements (i.e., the $\ell_0$-norm) of the core tensor is constrained to a preset value $Q$ much smaller than the size of the core. While the user dictates the total budget $Q$, the locations and values of the non-zero elements ar… ▽ More This paper introduces AL$\ell_0$CORE, a new form of probabilistic non-negative tensor decomposition. AL$\ell_0$CORE is a Tucker decomposition where the number of non-zero elements (i.e., the $\ell_0$-norm) of the core tensor is constrained to a preset value $Q$ much smaller than the size of the core. While the user dictates the total budget $Q$, the locations and values of the non-zero elements are latent variables and allocated across the core tensor during inference. AL$\ell_0$CORE -- i.e., $allo$cated $\ell_0$-$co$nstrained $core$-- thus enjoys both the computational tractability of CP decomposition and the qualitatively appealing latent structure of Tucker. In a suite of real-data experiments, we demonstrate that AL$\ell_0$CORE typically requires only tiny fractions (e.g.,~1%) of the full core to achieve the same results as full Tucker decomposition at only a correspondingly tiny fraction of the cost. △ Less

Submitted 12 March, 2024; v1 submitted 10 March, 2024; originally announced March 2024.

arXiv:2312.09203 [pdf, other]

Measurement in the Age of LLMs: An Application to Ideological Scaling

Authors: Sean O'Hagan, Aaron Schein

Abstract: Much of social science is centered around terms like ``ideology'' or ``power'', which generally elude precise definition, and whose contextual meanings are trapped in surrounding language. This paper explores the use of large language models (LLMs) to flexibly navigate the conceptual clutter inherent to social scientific measurement tasks. We rely on LLMs' remarkable linguistic fluency to elicit i… ▽ More Much of social science is centered around terms like ``ideology'' or ``power'', which generally elude precise definition, and whose contextual meanings are trapped in surrounding language. This paper explores the use of large language models (LLMs) to flexibly navigate the conceptual clutter inherent to social scientific measurement tasks. We rely on LLMs' remarkable linguistic fluency to elicit ideological scales of both legislators and text, which accord closely to established methods and our own judgement. A key aspect of our approach is that we elicit such scores directly, instructing the LLM to furnish numeric scores itself. This approach affords a great deal of flexibility, which we showcase through a variety of different case studies. Our results suggest that LLMs can be used to characterize highly subtle and diffuse manifestations of political ideology in text. △ Less

Submitted 7 April, 2024; v1 submitted 14 December, 2023; originally announced December 2023.

Comments: Under review a Harvard Data Science Review. Previously presented at the 4th International Conference of Social Computing in Bei**g, China, September 2023, the New Directions in Analyzing Text as Data (TADA) meeting in Amherst, MA, USA, November 2023, and the NeurIPS workshop titled "I Can't Believe It's Not Better!'' Failure Modes in the Age of Foundation Models in New Orleans, LA, December 2023

arXiv:2212.04130 [pdf, other]

The Ordered Matrix Dirichlet for State-Space Models

Authors: Niklas Stoehr, Benjamin J. Radford, Ryan Cotterell, Aaron Schein

Abstract: Many dynamical systems in the real world are naturally described by latent states with intrinsic orderings, such as "ally", "neutral", and "enemy" relationships in international relations. These latent states manifest through countries' cooperative versus conflictual interactions over time. State-space models (SSMs) explicitly relate the dynamics of observed measurements to transitions in latent s… ▽ More Many dynamical systems in the real world are naturally described by latent states with intrinsic orderings, such as "ally", "neutral", and "enemy" relationships in international relations. These latent states manifest through countries' cooperative versus conflictual interactions over time. State-space models (SSMs) explicitly relate the dynamics of observed measurements to transitions in latent states. For discrete data, SSMs commonly do so through a state-to-action emission matrix and a state-to-state transition matrix. This paper introduces the Ordered Matrix Dirichlet (OMD) as a prior distribution over ordered stochastic matrices wherein the discrete distribution in the kth row stochastically dominates the (k+1)th, such that probability mass is shifted to the right when moving down rows. We illustrate the OMD prior within two SSMs: a hidden Markov model, and a novel dynamic Poisson Tucker decomposition model tailored to international relations data. We find that models built on the OMD recover interpretable ordered latent structure without forfeiting predictive performance. We suggest future applications to other domains where models with stochastic matrices are popular (e.g., topic modeling), and publish user-friendly code. △ Less

Submitted 25 February, 2023; v1 submitted 8 December, 2022; originally announced December 2022.

Comments: Presented at the 26th International Conference on Artificial Intelligence and Statistics (AISTATS) 2023

arXiv:2210.03971 [pdf, other]

An Ordinal Latent Variable Model of Conflict Intensity

Authors: Niklas Stoehr, Lucas Torroba Hennigen, Josef Valvoda, Robert West, Ryan Cotterell, Aaron Schein

Abstract: Measuring the intensity of events is crucial for monitoring and tracking armed conflict. Advances in automated event extraction have yielded massive data sets of "who did what to whom" micro-records that enable data-driven approaches to monitoring conflict. The Goldstein scale is a widely-used expert-based measure that scores events on a conflictual-cooperative scale. It is based only on the actio… ▽ More Measuring the intensity of events is crucial for monitoring and tracking armed conflict. Advances in automated event extraction have yielded massive data sets of "who did what to whom" micro-records that enable data-driven approaches to monitoring conflict. The Goldstein scale is a widely-used expert-based measure that scores events on a conflictual-cooperative scale. It is based only on the action category ("what") and disregards the subject ("who") and object ("to whom") of an event, as well as contextual information, like associated casualty count, that should contribute to the perception of an event's "intensity". This paper takes a latent variable-based approach to measuring conflict intensity. We introduce a probabilistic generative model that assumes each observed event is associated with a latent intensity class. A novel aspect of this model is that it imposes an ordering on the classes, such that higher-valued classes denote higher levels of intensity. The ordinal nature of the latent variable is induced from naturally ordered aspects of the data (e.g., casualty counts) where higher values naturally indicate higher intensity. We evaluate the proposed model both intrinsically and extrinsically, showing that it obtains comparatively good held-out predictive performance. △ Less

Submitted 4 June, 2023; v1 submitted 8 October, 2022; originally announced October 2022.

Comments: Long Paper at ACL 2023

arXiv:2106.06691 [pdf, other]

Doubly Non-Central Beta Matrix Factorization for DNA Methylation Data

Authors: Aaron Schein, Anjali Nagulpally, Hanna Wallach, Patrick Flaherty

Abstract: We present a new non-negative matrix factorization model for $(0,1)$ bounded-support data based on the doubly non-central beta (DNCB) distribution, a generalization of the beta distribution. The expressiveness of the DNCB distribution is particularly useful for modeling DNA methylation datasets, which are typically highly dispersed and multi-modal; however, the model structure is sufficiently gene… ▽ More We present a new non-negative matrix factorization model for $(0,1)$ bounded-support data based on the doubly non-central beta (DNCB) distribution, a generalization of the beta distribution. The expressiveness of the DNCB distribution is particularly useful for modeling DNA methylation datasets, which are typically highly dispersed and multi-modal; however, the model structure is sufficiently general that it can be adapted to many other domains where latent representations of $(0,1)$ bounded-support data are of interest. Although the DNCB distribution lacks a closed-form conjugate prior, several augmentations let us derive an efficient posterior inference algorithm composed entirely of analytic updates. Our model improves out-of-sample predictive performance on both real and synthetic DNA methylation datasets over state-of-the-art methods in bioinformatics. In addition, our model yields meaningful latent representations that accord with existing biological knowledge. △ Less

Submitted 12 June, 2021; originally announced June 2021.

Comments: To appear in the Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI) 2021

arXiv:2011.13998 [pdf, other]

doi 10.1002/nme.6667

Preserving general physical properties in model reduction of dynamical systems via constrained-optimization projection

Authors: A. Schein, K. T. Carlberg, M. J. Zahr

Abstract: Model-reduction techniques aim to reduce the computational complexity of simulating dynamical systems by applying a (Petrov-)Galerkin projection process that enforces the dynamics to evolve in a low-dimensional subspace of the original state space. Frequently, the resulting reduced-order model (ROM) violates intrinsic physical properties of the original full-order model (FOM) (e.g., global conserv… ▽ More Model-reduction techniques aim to reduce the computational complexity of simulating dynamical systems by applying a (Petrov-)Galerkin projection process that enforces the dynamics to evolve in a low-dimensional subspace of the original state space. Frequently, the resulting reduced-order model (ROM) violates intrinsic physical properties of the original full-order model (FOM) (e.g., global conservation, Lagrangian structure, state-variable bounds) because the projection process does not generally ensure preservation of these properties. However, in many applications, ensuring the ROM preserves such intrinsic properties can enable the ROM to retain physical meaning and lead to improved accuracy and stability properties. In this work, we present a general constrained-optimization formulation for projection-based model reduction that can be used as a template to enforce the ROM to satisfy specific properties on the kinematics and dynamics. We introduce constrained-optimization formulations at both the time-continuous (i.e., ODE) level, which leads to a constrained Galerkin projection, and at the time-discrete level, which leads to a least-squares Petrov-Galerkin (LSPG) projection, in the context of linear multistep schemes. We demonstrate the ability of the proposed formulations to equip ROMs with desired properties such as global energy conservation and bounds on the total variation. △ Less

Submitted 1 April, 2021; v1 submitted 27 November, 2020; originally announced November 2020.

arXiv:1910.12991 [pdf, other]

Poisson-Randomized Gamma Dynamical Systems

Authors: Aaron Schein, Scott W. Linderman, Mingyuan Zhou, David M. Blei, Hanna Wallach

Abstract: This paper presents the Poisson-randomized gamma dynamical system (PRGDS), a model for sequentially observed count tensors that encodes a strong inductive bias toward sparsity and burstiness. The PRGDS is based on a new motif in Bayesian latent variable modeling, an alternating chain of discrete Poisson and continuous gamma latent states that is analytically convenient and computationally tractabl… ▽ More This paper presents the Poisson-randomized gamma dynamical system (PRGDS), a model for sequentially observed count tensors that encodes a strong inductive bias toward sparsity and burstiness. The PRGDS is based on a new motif in Bayesian latent variable modeling, an alternating chain of discrete Poisson and continuous gamma latent states that is analytically convenient and computationally tractable. This motif yields closed-form complete conditionals for all variables by way of the Bessel distribution and a novel discrete distribution that we call the shifted confluent hypergeometric distribution. We draw connections to closely related models and compare the PRGDS to these models in studies of real-world count data sets of text, international events, and neural spike trains. We find that a sparse variant of the PRGDS, which allows the continuous gamma latent states to take values of exactly zero, often obtains better predictive performance than other models and is uniquely capable of inferring latent structures that are highly localized in time. △ Less

Submitted 28 October, 2019; originally announced October 2019.

Comments: To appear in the Proceedings of the 32nd Advances in Neural Information Processing Systems (NeurIPS 2019)

arXiv:1807.08225 [pdf, other]

The Hyperedge Event Model

Authors: Bomin Kim, Aaron Schein, Bruce A. Desmarais, Hanna Wallach

Abstract: We introduce the hyperedge event model (HEM)---a generative model for events that can be represented as directed edges with one sender and one or more receivers or one receiver and one or more senders. We integrate a dynamic version of the exponential random graph model (ERGM) of edge structure with a survival model for event timing to jointly understand who interacts with whom, and when. The HEM… ▽ More We introduce the hyperedge event model (HEM)---a generative model for events that can be represented as directed edges with one sender and one or more receivers or one receiver and one or more senders. We integrate a dynamic version of the exponential random graph model (ERGM) of edge structure with a survival model for event timing to jointly understand who interacts with whom, and when. The HEM offers three innovations with respect to the literature---first, it extends a growing class of dynamic network models to model hyperedges. The current state-of-the-art approach to dealing with hyperedges is to inappropriately break them into separate edges/events. Second, our model involves a novel receiver selection distribution that is based on established edge formation models, but assures non-empty receiver lists. Third, the HEM integrates separate, but interacting, equations governing edge formation and event timing. We use the HEM to analyze emails sent among department managers in Montgomery County government in North Carolina. Our application demonstrates that the model is effective at predicting and explaining time-stamped network data involving edges with multiple receivers. We present an out-of-sample prediction experiment to illustrate how researchers can select between different specifications of the model. △ Less

Submitted 21 July, 2018; originally announced July 2018.

arXiv:1803.08471 [pdf, other]

Locally Private Bayesian Inference for Count Models

Authors: Aaron Schein, Zhiwei Steven Wu, Alexandra Schofield, Mingyuan Zhou, Hanna Wallach

Abstract: We present a general method for privacy-preserving Bayesian inference in Poisson factorization, a broad class of models that includes some of the most widely used models in the social sciences. Our method satisfies limited precision local privacy, a generalization of local differential privacy, which we introduce to formulate privacy guarantees appropriate for sparse count data. We develop an MCMC… ▽ More We present a general method for privacy-preserving Bayesian inference in Poisson factorization, a broad class of models that includes some of the most widely used models in the social sciences. Our method satisfies limited precision local privacy, a generalization of local differential privacy, which we introduce to formulate privacy guarantees appropriate for sparse count data. We develop an MCMC algorithm that approximates the locally private posterior over model parameters given data that has been locally privatized by the geometric mechanism (Ghosh et al., 2012). Our solution is based on two insights: 1) a novel reinterpretation of the geometric mechanism in terms of the Skellam distribution (Skellam, 1946) and 2) a general theorem that relates the Skellam to the Bessel distribution (Yuan & Kalbfleisch, 2000). We demonstrate our method in two case studies on real-world email data in which we show that our method consistently outperforms the commonly-used naive approach, obtaining higher quality topics in text and more accurate link prediction in networks. On some tasks, our privacy-preserving method even outperforms non-private inference which conditions on the true data. △ Less

Submitted 21 February, 2019; v1 submitted 22 March, 2018; originally announced March 2018.

arXiv:1701.05573 [pdf, other]

Poisson--Gamma Dynamical Systems

Authors: Aaron Schein, Mingyuan Zhou, Hanna Wallach

Abstract: We introduce a new dynamical system for sequentially observed multivariate count data. This model is based on the gamma--Poisson construction---a natural choice for count data---and relies on a novel Bayesian nonparametric prior that ties and shrinks the model parameters, thus avoiding overfitting. We present an efficient MCMC inference algorithm that advances recent work on augmentation schemes f… ▽ More We introduce a new dynamical system for sequentially observed multivariate count data. This model is based on the gamma--Poisson construction---a natural choice for count data---and relies on a novel Bayesian nonparametric prior that ties and shrinks the model parameters, thus avoiding overfitting. We present an efficient MCMC inference algorithm that advances recent work on augmentation schemes for inference in negative binomial models. Finally, we demonstrate the model's inductive bias using a variety of real-world data sets, showing that it exhibits superior predictive performance over other models and infers highly interpretable latent structure. △ Less

Submitted 19 January, 2017; originally announced January 2017.

Comments: Appeared in the Proceedings of the 29th Advances in Neural Information Processing Systems (NIPS 2016)

arXiv:1606.01855 [pdf, other]

Bayesian Poisson Tucker Decomposition for Learning the Structure of International Relations

Authors: Aaron Schein, Mingyuan Zhou, David M. Blei, Hanna Wallach

Abstract: We introduce Bayesian Poisson Tucker decomposition (BPTD) for modeling country--country interaction event data. These data consist of interaction events of the form "country $i$ took action $a$ toward country $j$ at time $t$." BPTD discovers overlap** country--community memberships, including the number of latent communities. In addition, it discovers directed community--community interaction ne… ▽ More We introduce Bayesian Poisson Tucker decomposition (BPTD) for modeling country--country interaction event data. These data consist of interaction events of the form "country $i$ took action $a$ toward country $j$ at time $t$." BPTD discovers overlap** country--community memberships, including the number of latent communities. In addition, it discovers directed community--community interaction networks that are specific to "topics" of action types and temporal "regimes." We show that BPTD yields an efficient MCMC inference algorithm and achieves better predictive performance than related models. We also demonstrate that it discovers interpretable latent structure that agrees with our knowledge of international relations. △ Less

Submitted 6 June, 2016; originally announced June 2016.

Comments: To appear in Proceedings of the 33rd International Conference on Machine Learning (ICML 2016)

arXiv:1506.03493 [pdf, other]

Bayesian Poisson Tensor Factorization for Inferring Multilateral Relations from Sparse Dyadic Event Counts

Authors: Aaron Schein, John Paisley, David M. Blei, Hanna Wallach

Abstract: We present a Bayesian tensor factorization model for inferring latent group structures from dynamic pairwise interaction patterns. For decades, political scientists have collected and analyzed records of the form "country $i$ took action $a$ toward country $j$ at time $t$"---known as dyadic events---in order to form and test theories of international relations. We represent these event data as a t… ▽ More We present a Bayesian tensor factorization model for inferring latent group structures from dynamic pairwise interaction patterns. For decades, political scientists have collected and analyzed records of the form "country $i$ took action $a$ toward country $j$ at time $t$"---known as dyadic events---in order to form and test theories of international relations. We represent these event data as a tensor of counts and develop Bayesian Poisson tensor factorization to infer a low-dimensional, interpretable representation of their salient patterns. We demonstrate that our model's predictive performance is better than that of standard non-negative tensor factorization methods. We also provide a comparison of our variational updates to their maximum likelihood counterparts. In doing so, we identify a better way to form point estimates of the latent factors than that typically used in Bayesian Poisson matrix factorization. Finally, we showcase our model as an exploratory analysis tool for political scientists. We show that the inferred latent factor matrices capture interpretable multilateral relations that both conform to and inform our knowledge of international affairs. △ Less

Submitted 10 June, 2015; originally announced June 2015.

Comments: To appear in Proceedings of the 21st ACM SIGKDD Conference of Knowledge Discovery and Data Mining (KDD 2015)

arXiv:1311.3982 [pdf, other]

Inferring Multilateral Relations from Dynamic Pairwise Interactions

Authors: Aaron Schein, Juston Moore, Hanna Wallach

Abstract: Correlations between anomalous activity patterns can yield pertinent information about complex social processes: a significant deviation from normal behavior, exhibited simultaneously by multiple pairs of actors, provides evidence for some underlying relationship involving those pairs---i.e., a multilateral relation. We introduce a new nonparametric Bayesian latent variable model that explicitly c… ▽ More Correlations between anomalous activity patterns can yield pertinent information about complex social processes: a significant deviation from normal behavior, exhibited simultaneously by multiple pairs of actors, provides evidence for some underlying relationship involving those pairs---i.e., a multilateral relation. We introduce a new nonparametric Bayesian latent variable model that explicitly captures correlations between anomalous interaction counts and uses these shared deviations from normal activity patterns to identify and characterize multilateral relations. We showcase our model's capabilities using the newly curated Global Database of Events, Location, and Tone, a dataset that has seen considerable interest in the social sciences and the popular press, but which has is largely unexplored by the machine learning community. We provide a detailed analysis of the latent structure inferred by our model and show that the multilateral relations correspond to major international events and long-term international relationships. These findings lead us to recommend our model for any data-driven analysis of interaction networks where dynamic interactions over the edges provide evidence for latent social structure. △ Less

Submitted 15 November, 2013; originally announced November 2013.

Comments: NIPS 2013 Workshop on Frontiers of Network Analysis

Showing 1–15 of 15 results for author: Schein, A