-
Mitigating the optical depth degeneracy in the cosmological measurement of neutrino masses using 21-cm observations
Authors:
Gali Shmueli,
Debanjan Sarkar,
Ely D. Kovetz
Abstract:
Massive neutrinos modify the expansion history of the universe and suppress the structure formation below their free streaming scale. Cosmic microwave background (CMB) observations at small angular scales can be used to constrain the total mass $Σm_ν$ of the three neutrino flavors. However, at these scales, the CMB-measured $Σm_ν$ is degenerate with $τ$, the optical depth to reionization, which qu…
▽ More
Massive neutrinos modify the expansion history of the universe and suppress the structure formation below their free streaming scale. Cosmic microwave background (CMB) observations at small angular scales can be used to constrain the total mass $Σm_ν$ of the three neutrino flavors. However, at these scales, the CMB-measured $Σm_ν$ is degenerate with $τ$, the optical depth to reionization, which quantifies the dam** of CMB anisotropies due to the scattering of CMB photons with free electrons along the line of sight. Here we revisit the idea to use 21-cm power spectrum observations to provide direct estimates for $τ$. A joint analysis of CMB and 21-cm data can alleviate the $τ-Σm_ν$ degeneracy, making it possible to measure $Σm_ν$ with unprecedented precision. Forecasting for the upcoming Hydrogen Epoch of Reionization Array (HERA), we find that a $\lesssim\mathcal{O}(10\%)$ measurement of $τ$ is achievable, which would enable a $\gtrsim 5σ$ measurement of $Σm_ν=60\,[{\rm meV}]$, for any astrophysics model that we considered. Precise estimates of $τ$ also help reduce uncertainties in other cosmological parameters, such as $A_s$, the amplitude of the primordial scalar fluctuations power spectrum.
△ Less
Submitted 11 May, 2023;
originally announced May 2023.
-
Monetizing Explainable AI: A Double-edged Sword
Authors:
Travis Greene,
Sofie Goethals,
David Martens,
Galit Shmueli
Abstract:
Algorithms used by organizations increasingly wield power in society as they decide the allocation of key resources and basic goods. In order to promote fairer, juster, and more transparent uses of such decision-making power, explainable artificial intelligence (XAI) aims to provide insights into the logic of algorithmic decision-making. Despite much research on the topic, consumer-facing applicat…
▽ More
Algorithms used by organizations increasingly wield power in society as they decide the allocation of key resources and basic goods. In order to promote fairer, juster, and more transparent uses of such decision-making power, explainable artificial intelligence (XAI) aims to provide insights into the logic of algorithmic decision-making. Despite much research on the topic, consumer-facing applications of XAI remain rare. A central reason may be that a viable platform-based monetization strategy for this new technology has yet to be found. We introduce and describe a novel monetization strategy for fusing algorithmic explanations with programmatic advertising via an explanation platform. We claim the explanation platform represents a new, socially-impactful, and profitable form of human-algorithm interaction and estimate its potential for revenue generation in the high-risk domains of finance, hiring, and education. We then consider possible undesirable and unintended effects of monetizing XAI and simulate these scenarios using real-world credit lending data. Ultimately, we argue that monetizing XAI may be a double-edged sword: while monetization may incentivize industry adoption of XAI in a variety of consumer applications, it may also conflict with the original legal and ethical justifications for develo** XAI. We conclude by discussing whether there may be ways to responsibly and democratically harness the potential of monetized XAI to provide greater consumer access to algorithmic explanations.
△ Less
Submitted 27 March, 2023;
originally announced April 2023.
-
Atomist or Holist? A Diagnosis and Vision for More Productive Interdisciplinary AI Ethics Dialogue
Authors:
Travis Greene,
Amit Dhurandhar,
Galit Shmueli
Abstract:
In response to growing recognition of the social impact of new AI-based technologies, major AI and ML conferences and journals now encourage or require papers to include ethics impact statements and undergo ethics reviews. This move has sparked heated debate concerning the role of ethics in AI research, at times devolving into name-calling and threats of "cancellation." We diagnose this conflict a…
▽ More
In response to growing recognition of the social impact of new AI-based technologies, major AI and ML conferences and journals now encourage or require papers to include ethics impact statements and undergo ethics reviews. This move has sparked heated debate concerning the role of ethics in AI research, at times devolving into name-calling and threats of "cancellation." We diagnose this conflict as one between atomist and holist ideologies. Among other things, atomists believe facts are and should be kept separate from values, while holists believe facts and values are and should be inextricable from one another. With the goal of reducing disciplinary polarization, we draw on numerous philosophical and historical sources to describe each ideology's core beliefs and assumptions. Finally, we call on atomists and holists within the ever-expanding data science community to exhibit greater empathy during ethical disagreements and propose four targeted strategies to ensure AI research benefits society.
△ Less
Submitted 12 November, 2022; v1 submitted 19 August, 2022;
originally announced August 2022.
-
Forks Over Knives: Predictive Inconsistency in Criminal Justice Algorithmic Risk Assessment Tools
Authors:
Travis Greene,
Galit Shmueli,
Jan Fell,
Ching-Fu Lin,
Han-Wei Liu
Abstract:
Big data and algorithmic risk prediction tools promise to improve criminal justice systems by reducing human biases and inconsistencies in decision making. Yet different, equally-justifiable choices when develo**, testing, and deploying these sociotechnical tools can lead to disparate predicted risk scores for the same individual. Synthesizing diverse perspectives from machine learning, statisti…
▽ More
Big data and algorithmic risk prediction tools promise to improve criminal justice systems by reducing human biases and inconsistencies in decision making. Yet different, equally-justifiable choices when develo**, testing, and deploying these sociotechnical tools can lead to disparate predicted risk scores for the same individual. Synthesizing diverse perspectives from machine learning, statistics, sociology, criminology, law, philosophy and economics, we conceptualize this phenomenon as predictive inconsistency. We describe sources of predictive inconsistency at different stages of algorithmic risk assessment tool development and deployment and consider how future technological developments may amplify predictive inconsistency. We argue, however, that in a diverse and pluralistic society we should not expect to completely eliminate predictive inconsistency. Instead, to bolster the legal, political, and scientific legitimacy of algorithmic risk prediction tools, we propose identifying and documenting relevant and reasonable "forking paths" to enable quantifiable, reproducible multiverse and specification curve analyses of predictive inconsistency at the individual level.
△ Less
Submitted 22 September, 2022; v1 submitted 1 December, 2020;
originally announced December 2020.
-
Beyond Our Behavior: The GDPR and Humanistic Personalization
Authors:
Travis Greene,
Galit Shmueli
Abstract:
Personalization should take the human person seriously. This requires a deeper understanding of how recommender systems can shape both our self-understanding and identity. We unpack key European humanistic and philosophical ideas underlying the General Data Protection Regulation (GDPR) and propose a new paradigm of humanistic personalization. Humanistic personalization responds to the IEEE's call…
▽ More
Personalization should take the human person seriously. This requires a deeper understanding of how recommender systems can shape both our self-understanding and identity. We unpack key European humanistic and philosophical ideas underlying the General Data Protection Regulation (GDPR) and propose a new paradigm of humanistic personalization. Humanistic personalization responds to the IEEE's call for Ethically Aligned Design (EAD) and is based on fundamental human capacities and values. Humanistic personalization focuses on narrative accuracy: the subjective fit between a person's self-narrative and both the input (personal data) and output of a recommender system. In doing so, we re-frame the distinction between implicit and explicit data collection as one of nonconscious ("organismic") behavior and conscious ("reflective") action. This distinction raises important ethical and interpretive issues related to agency, self-understanding, and political participation. Finally, we discuss how an emphasis on narrative accuracy can reduce opportunities for epistemic injustice done to data subjects.
△ Less
Submitted 31 August, 2020;
originally announced August 2020.
-
How to "Improve" Prediction Using Behavior Modification
Authors:
Galit Shmueli,
Ali Tafti
Abstract:
Many internet platforms that collect behavioral big data use it to predict user behavior for internal purposes and for their business customers (e.g., advertisers, insurers, security forces, governments, political consulting firms) who utilize the predictions for personalization, targeting, and other decision-making. Improving predictive accuracy is therefore extremely valuable. Data science resea…
▽ More
Many internet platforms that collect behavioral big data use it to predict user behavior for internal purposes and for their business customers (e.g., advertisers, insurers, security forces, governments, political consulting firms) who utilize the predictions for personalization, targeting, and other decision-making. Improving predictive accuracy is therefore extremely valuable. Data science researchers design algorithms, models, and approaches to improve prediction. Prediction is also improved with larger and richer data. Beyond improving algorithms and data, platforms can stealthily achieve better prediction accuracy by pushing users' behaviors towards their predicted values, using behavior modification techniques, thereby demonstrating more certain predictions. Such apparent "improved" prediction can result from employing reinforcement learning algorithms that combine prediction and behavior modification. This strategy is absent from the machine learning and statistics literature. Investigating its properties requires integrating causal with predictive notation. To this end, we incorporate Pearl's causal do(.) operator into the predictive vocabulary. We then decompose the expected prediction error given behavior modification, and identify the components impacting predictive power. Our derivation elucidates implications of such behavior modification to data scientists, platforms, their customers, and the humans whose behavior is manipulated. Behavior modification can make users' behavior more predictable and even more homogeneous; yet this apparent predictability might not generalize when business customers use predictions in practice. Outcomes pushed towards their predictions can be at odds with customers' intentions, and harmful to manipulated users.
△ Less
Submitted 23 July, 2022; v1 submitted 26 August, 2020;
originally announced August 2020.
-
Selected Topics in Statistical Computing
Authors:
Suneel Babu Chatla,
Chun-houh Chen,
Galit Shmueli
Abstract:
The field of computational statistics refers to statistical methods or tools that are computationally intensive. Due to the recent advances in computing power some of these methods have become prominent and central to modern data analysis. In this article we focus on several of the main methods including density estimation, kernel smoothing, smoothing splines, and additive models. While the field…
▽ More
The field of computational statistics refers to statistical methods or tools that are computationally intensive. Due to the recent advances in computing power some of these methods have become prominent and central to modern data analysis. In this article we focus on several of the main methods including density estimation, kernel smoothing, smoothing splines, and additive models. While the field of computational statistics includes many more methods, this article serves as a brief introduction to selected popular topics.
△ Less
Submitted 24 April, 2020;
originally announced April 2020.
-
A Tree-based Semi-Varying Coefficient Model for the COM-Poisson Distribution
Authors:
Suneel Babu Chatla,
Galit Shmueli
Abstract:
We propose a tree-based semi-varying coefficient model for the Conway-Maxwell- Poisson (CMP or COM-Poisson) distribution which is a two-parameter generalization of the Poisson distribution and is flexible enough to capture both under-dispersion and over-dispersion in count data. The advantage of tree-based methods is their scalability to high-dimensional data. We develop CMPMOB, an estimation proc…
▽ More
We propose a tree-based semi-varying coefficient model for the Conway-Maxwell- Poisson (CMP or COM-Poisson) distribution which is a two-parameter generalization of the Poisson distribution and is flexible enough to capture both under-dispersion and over-dispersion in count data. The advantage of tree-based methods is their scalability to high-dimensional data. We develop CMPMOB, an estimation procedure for a semi-varying coefficient model, using model-based recursive partitioning (MOB). The proposed framework is broader than the existing MOB framework as it allows node-invariant effects to be included in the model. To simplify the computational burden of the exhaustive search employed in the original MOB algorithm, a new split point estimation procedure is proposed by borrowing tools from change point estimation methodology. The proposed method uses only the estimated score functions without fitting models for each split point and, therefore, is computationally simpler. Since the tree-based methods only provide a piece-wise constant approximation to the underlying smooth function, we propose the CMPBoost semi-varying coefficient model which uses the gradient boosting procedure for estimation. The usefulness of the proposed methods are illustrated using simulation studies and a real example from a bike sharing system in Washington, DC.
△ Less
Submitted 24 April, 2020;
originally announced April 2020.
-
How Personal is Machine Learning Personalization?
Authors:
Travis Greene,
Galit Shmueli
Abstract:
Though used extensively, the concept and process of machine learning (ML) personalization have generally received little attention from academics, practitioners, and the general public. We describe the ML approach as relying on the metaphor of the person as a feature vector and contrast this with humanistic views of the person. In light of the recent calls by the IEEE to consider the effects of ML…
▽ More
Though used extensively, the concept and process of machine learning (ML) personalization have generally received little attention from academics, practitioners, and the general public. We describe the ML approach as relying on the metaphor of the person as a feature vector and contrast this with humanistic views of the person. In light of the recent calls by the IEEE to consider the effects of ML on human well-being, we ask whether ML personalization can be reconciled with these humanistic views of the person, which highlight the importance of moral and social identity. As human behavior increasingly becomes digitized, analyzed, and predicted, to what extent do our subsequent decisions about what to choose, buy, or do, made both by us and others, reflect who we are as persons? This paper first explicates the term personalization by considering ML personalization and highlights its relation to humanistic conceptions of the person, then proposes several dimensions for evaluating the degree of personalization of ML personalized scores. By doing so, we hope to contribute to current debate on the issues of algorithmic bias, transparency, and fairness in machine learning.
△ Less
Submitted 23 December, 2019; v1 submitted 17 December, 2019;
originally announced December 2019.
-
Lift Up and Act! Classifier Performance in Resource-Constrained Applications
Authors:
Galit Shmueli
Abstract:
Classification tasks are common across many fields and applications where the decision maker's action is limited by resource constraints. In direct marketing only a subset of customers is contacted; scarce human resources limit the number of interviews to the most promising job candidates; limited donated organs are prioritized to those with best fit. In such scenarios, performance measures such a…
▽ More
Classification tasks are common across many fields and applications where the decision maker's action is limited by resource constraints. In direct marketing only a subset of customers is contacted; scarce human resources limit the number of interviews to the most promising job candidates; limited donated organs are prioritized to those with best fit. In such scenarios, performance measures such as the classification matrix, ROC analysis, and even ranking metrics such as AUC measures outcomes different from the action of interest. At the same time, gains and lift that do measure the relevant outcome are rarely used by machine learners. In this paper we define resource-constrained classifier performance as a task distinguished from classification and ranking. We explain how gains and lift can lead to different algorithm choices and discuss the effect of class distribution.
△ Less
Submitted 20 June, 2019; v1 submitted 7 June, 2019;
originally announced June 2019.
-
Efficient Estimation of COM-Poisson Regression and Generalized Additive Model
Authors:
Suneel Babu Chatla,
Galit Shmueli
Abstract:
The Conway-Maxwell-Poisson (CMP) or COM-Poison regression is a popular model for count data due to its ability to capture both under dispersion and over dispersion. However, CMP regression is limited when dealing with complex nonlinear relationships. With today's wide availability of count data, especially due to the growing collection of data on human and social behavior, there is need for count…
▽ More
The Conway-Maxwell-Poisson (CMP) or COM-Poison regression is a popular model for count data due to its ability to capture both under dispersion and over dispersion. However, CMP regression is limited when dealing with complex nonlinear relationships. With today's wide availability of count data, especially due to the growing collection of data on human and social behavior, there is need for count data models that can capture complex nonlinear relationships. One useful approach is additive models; but, there has been no additive model implementation for the CMP distribution. To fill this void, we first propose a flexible estimation framework for CMP regression based on iterative reweighed least squares (IRLS) and then extend this model to allow for additive components using a penalized splines approach. Because the CMP distribution belongs to the exponential family, convergence of IRLS is guaranteed under some regularity conditions. Further, it is also known that IRLS provides smaller standard errors compared to gradient-based methods. We illustrate the usefulness of this approach through extensive simulation studies and using real data from a bike sharing system in Washington, DC.
△ Less
Submitted 24 April, 2020; v1 submitted 26 October, 2016;
originally announced October 2016.
-
Modeling Bimodal Discrete Data Using Conway-Maxwell-Poisson Mixture Models
Authors:
Pragya Sur,
Galit Shmueli,
Smarajit Bose,
Paromita Dubey
Abstract:
Bimodal truncated count distributions are frequently observed in aggregate survey data and in user ratings when respondents are mixed in their opinion. They also arise in censored count data, where the highest category might create an additional mode. Modeling bimodal behavior in discrete data is useful for various purposes, from comparing shapes of different samples (or survey questions) to predi…
▽ More
Bimodal truncated count distributions are frequently observed in aggregate survey data and in user ratings when respondents are mixed in their opinion. They also arise in censored count data, where the highest category might create an additional mode. Modeling bimodal behavior in discrete data is useful for various purposes, from comparing shapes of different samples (or survey questions) to predicting future ratings by new raters. The Poisson distribution is the most common distribution for fitting count data and can be modified to achieve mixtures of truncated Poisson distributions. However, it is suitable only for modeling equi-dispersed distributions and is limited in its ability to capture bimodality. The Conway-Maxwell-Poisson (CMP) distribution is a two-parameter generalization of the Poisson distribution that allows for over- and under-dispersion. In this work, we propose a mixture of CMPs for capturing a wide range of truncated discrete data, which can exhibit unimodal and bimodal behavior. We present methods for estimating the parameters of a mixture of two CMP distributions using an EM approach. Our approach introduces a special two-step optimization within the M step to estimate multiple parameters. We examine computational and theoretical issues. The methods are illustrated for modeling ordered rating data as well as truncated count data, using simulated and real examples.
△ Less
Submitted 23 January, 2014; v1 submitted 2 September, 2013;
originally announced September 2013.
-
To Explain or to Predict?
Authors:
Galit Shmueli
Abstract:
Statistical modeling is a powerful tool for develo** and testing theories by way of causal explanation, prediction, and description. In many disciplines there is near-exclusive use of statistical modeling for causal explanation and the assumption that models with high explanatory power are inherently of high predictive power. Conflation between explanation and prediction is common, yet the disti…
▽ More
Statistical modeling is a powerful tool for develo** and testing theories by way of causal explanation, prediction, and description. In many disciplines there is near-exclusive use of statistical modeling for causal explanation and the assumption that models with high explanatory power are inherently of high predictive power. Conflation between explanation and prediction is common, yet the distinction must be understood for progressing scientific knowledge. While this distinction has been recognized in the philosophy of science, the statistical literature lacks a thorough discussion of the many differences that arise in the process of modeling for an explanatory versus a predictive goal. The purpose of this article is to clarify the distinction between explanatory and predictive modeling, to discuss its sources, and to reveal the practical implications of the distinction to each step in the modeling process.
△ Less
Submitted 5 January, 2011;
originally announced January 2011.
-
A flexible regression model for count data
Authors:
Kimberly F. Sellers,
Galit Shmueli
Abstract:
Poisson regression is a popular tool for modeling count data and is applied in a vast array of applications from the social to the physical sciences and beyond. Real data, however, are often over- or under-dispersed and, thus, not conducive to Poisson regression. We propose a regression model based on the Conway--Maxwell-Poisson (COM-Poisson) distribution to address this problem. The COM-Poisson r…
▽ More
Poisson regression is a popular tool for modeling count data and is applied in a vast array of applications from the social to the physical sciences and beyond. Real data, however, are often over- or under-dispersed and, thus, not conducive to Poisson regression. We propose a regression model based on the Conway--Maxwell-Poisson (COM-Poisson) distribution to address this problem. The COM-Poisson regression generalizes the well-known Poisson and logistic regression models, and is suitable for fitting count data with a wide range of dispersion levels. With a GLM approach that takes advantage of exponential family properties, we discuss model estimation, inference, diagnostics, and interpretation, and present a test for determining the need for a COM-Poisson regression over a standard Poisson regression. We compare the COM-Poisson to several alternatives and illustrate its advantages and usefulness using three data sets with varying dispersion.
△ Less
Submitted 9 November, 2010;
originally announced November 2010.
-
The BARISTA: A model for bid arrivals in online auctions
Authors:
Galit Shmueli,
Ralph P. Russo,
Wolfgang Jank
Abstract:
The arrival process of bidders and bids in online auctions is important for studying and modeling supply and demand in the online marketplace. A popular assumption in the online auction literature is that a Poisson bidder arrival process is a reasonable approximation. This approximation underlies theoretical derivations, statistical models and simulations used in field studies. However, when it…
▽ More
The arrival process of bidders and bids in online auctions is important for studying and modeling supply and demand in the online marketplace. A popular assumption in the online auction literature is that a Poisson bidder arrival process is a reasonable approximation. This approximation underlies theoretical derivations, statistical models and simulations used in field studies. However, when it comes to the bid arrivals, empirical research has shown that the process is far from Poisson, with early bidding and last-moment bids taking place. An additional feature that has been reported by various authors is an apparent self-similarity in the bid arrival process. Despite the wide evidence for the changing bidding intensities and the self-similarity, there has been no rigorous attempt at develo** a model that adequately approximates bid arrivals and accounts for these features. The goal of this paper is to introduce a family of distributions that well-approximate the bid time distribution in hard-close auctions. We call this the BARISTA process (Bid ARrivals In STAges) because of its ability to generate different intensities at different stages. We describe the properties of this model, show how to simulate bid arrivals from it, and how to use it for estimation and inference. We illustrate its power and usefulness by fitting simulated and real data from eBay.com. Finally, we show how a Poisson bidder arrival process relates to a BARISTA bid arrival process.
△ Less
Submitted 12 December, 2007;
originally announced December 2007.
-
An Elegant Method for Generating Multivariate Poisson Random Variable
Authors:
Inbal Yahav,
Galit Shmueli
Abstract:
Generating multivariate Poisson data is essential in many applications. Current simulation methods suffer from limitations ranging from computational complexity to restrictions on the structure of the correlation matrix. We propose a computationally efficient and conceptually appealing method for generating multivariate Poisson data. The method is based on simulating multivariate Normal data and…
▽ More
Generating multivariate Poisson data is essential in many applications. Current simulation methods suffer from limitations ranging from computational complexity to restrictions on the structure of the correlation matrix. We propose a computationally efficient and conceptually appealing method for generating multivariate Poisson data. The method is based on simulating multivariate Normal data and converting them to achieve a specific correlation matrix and Poisson rate vector. This allows for generating data that have positive or negative correlations as well as different rates.
△ Less
Submitted 12 March, 2008; v1 submitted 30 October, 2007;
originally announced October 2007.
-
Functional Data Analysis in Electronic Commerce Research
Authors:
Wolfgang Jank,
Galit Shmueli
Abstract:
This paper describes opportunities and challenges of using functional data analysis (FDA) for the exploration and analysis of data originating from electronic commerce (eCommerce). We discuss the special data structures that arise in the online environment and why FDA is a natural approach for representing and analyzing such data. The paper reviews several FDA methods and motivates their usefuln…
▽ More
This paper describes opportunities and challenges of using functional data analysis (FDA) for the exploration and analysis of data originating from electronic commerce (eCommerce). We discuss the special data structures that arise in the online environment and why FDA is a natural approach for representing and analyzing such data. The paper reviews several FDA methods and motivates their usefulness in eCommerce research by providing a glimpse into new domain insights that they allow. We argue that the wedding of eCommerce with FDA leads to innovations both in statistical methodology, due to the challenges and complications that arise in eCommerce data, and in online research, by being able to ask (and subsequently answer) new research questions that classical statistical methods are not able to address, and also by expanding on research questions beyond the ones traditionally asked in the offline environment. We describe several applications originating from online transactions which are new to the statistics literature, and point out statistical challenges accompanied by some solutions. We also discuss some promising future directions for joint research efforts between researchers in eCommerce and statistics.
△ Less
Submitted 6 September, 2006;
originally announced September 2006.
-
A Special Issue on Statistical Challenges and Opportunities in Electronic Commerce Research
Authors:
Wolfgang Jank,
Galit Shmueli
Abstract:
This special issue is a product of the First Interdisciplinary Symposium on Statistical Challenges and Opportunities in Electronic Commerce Research, which took place on May 22--23, 2005, at the Robert H. Smith School of Business, University of Maryland, College Park (\url{www.smith.umd.edu/dit/statschallenges/}). The symposium brought together, for the first time, researchers from statistics, i…
▽ More
This special issue is a product of the First Interdisciplinary Symposium on Statistical Challenges and Opportunities in Electronic Commerce Research, which took place on May 22--23, 2005, at the Robert H. Smith School of Business, University of Maryland, College Park (\url{www.smith.umd.edu/dit/statschallenges/}). The symposium brought together, for the first time, researchers from statistics, information systems, and related fields, all of whom work or are interested in empirical research related to electronic commerce. The goal of the symposium was to cross the borders, discuss joint research opportunities, expose this field and its statistical challenges, and promote collaboration between the different fields.
△ Less
Submitted 11 September, 2006; v1 submitted 6 September, 2006;
originally announced September 2006.