Search | arXiv e-print repository

Sowing 'Seeds of Doubt': Cottage Industries of Election and Medical Misinformation in Brazil and the United States

Authors: Amelia Hassoun, Gabrielle Borenstein, Beth Goldberg, Jacob McAuliffe, Katy Osborn

Abstract: We conducted ethnographic research with 31 misinformation creators and consumers in Brazil and the US before, during, and after a major election to understand the consumption and production of election and medical misinformation. This study contributes to research on misinformation ecosystems by focusing on poorly understood small players, or "micro-influencers", who create misinformation in peer-… ▽ More We conducted ethnographic research with 31 misinformation creators and consumers in Brazil and the US before, during, and after a major election to understand the consumption and production of election and medical misinformation. This study contributes to research on misinformation ecosystems by focusing on poorly understood small players, or "micro-influencers", who create misinformation in peer-to-peer networks. We detail four key tactics that micro-influencers use. First, they typically disseminate "gray area" content rather than expert-falsified claims, using subtle aesthetic and rhetorical tactics to evade moderation. Second, they post in small, closed groups where members feel safe and predisposed to trust content. Third, they explicitly target misinformation consumers' emotional and social needs. Finally, they post a high volume of short, repetitive content to plant seeds of doubt and build trust in influencers as unofficial experts. We discuss the implications these micro-influencers have for misinformation interventions and platforms' efforts to moderate misinformation. △ Less

Submitted 9 January, 2024; v1 submitted 4 August, 2023; originally announced August 2023.

Comments: 30 pages, 13 figures, 2 tables

arXiv:2102.02409 [pdf, other]

Variational Inference for Deblending Crowded Starfields

Authors: Run**g Liu, Jon D. McAuliffe, Jeffrey Regier

Abstract: In images collected by astronomical surveys, stars and galaxies often overlap visually. Deblending is the task of distinguishing and characterizing individual light sources in survey images. We propose StarNet, a Bayesian method to deblend sources in astronomical images of crowded star fields. StarNet leverages recent advances in variational inference, including amortized variational distributions… ▽ More In images collected by astronomical surveys, stars and galaxies often overlap visually. Deblending is the task of distinguishing and characterizing individual light sources in survey images. We propose StarNet, a Bayesian method to deblend sources in astronomical images of crowded star fields. StarNet leverages recent advances in variational inference, including amortized variational distributions and an optimization objective targeting an expectation of the forward KL divergence. In our experiments with SDSS images of the M2 globular cluster, StarNet is substantially more accurate than two competing methods: Probabilistic Cataloging (PCAT), a method that uses MCMC for inference, and DAOPHOT, a software pipeline employed by SDSS for deblending. In addition, the amortized approach to inference gives StarNet the scaling characteristics necessary to perform Bayesian inference on modern astronomical surveys. △ Less

Submitted 28 August, 2023; v1 submitted 3 February, 2021; originally announced February 2021.

Journal ref: Journal of Machine Learning Research, volume 24, 2023

arXiv:1810.08240 [pdf, other]

doi 10.1214/20-AOS1991

Time-uniform, nonparametric, nonasymptotic confidence sequences

Authors: Steven R. Howard, Aaditya Ramdas, Jon McAuliffe, Jasjeet Sekhon

Abstract: A confidence sequence is a sequence of confidence intervals that is uniformly valid over an unbounded time horizon. Our work develops confidence sequences whose widths go to zero, with nonasymptotic coverage guarantees under nonparametric conditions. We draw connections between the Cramér-Chernoff method for exponential concentration, the law of the iterated logarithm (LIL), and the sequential pro… ▽ More A confidence sequence is a sequence of confidence intervals that is uniformly valid over an unbounded time horizon. Our work develops confidence sequences whose widths go to zero, with nonasymptotic coverage guarantees under nonparametric conditions. We draw connections between the Cramér-Chernoff method for exponential concentration, the law of the iterated logarithm (LIL), and the sequential probability ratio test -- our confidence sequences are time-uniform extensions of the first; provide tight, nonasymptotic characterizations of the second; and generalize the third to nonparametric settings, including sub-Gaussian and Bernstein conditions, self-normalized processes, and matrix martingales. We illustrate the generality of our proof techniques by deriving an empirical-Bernstein bound growing at a LIL rate, as well as a novel upper LIL for the maximum eigenvalue of a sum of random matrices. Finally, we apply our methods to covariance matrix estimation and to estimation of sample average treatment effect under the Neyman-Rubin potential outcomes model. △ Less

Submitted 6 August, 2022; v1 submitted 18 October, 2018; originally announced October 2018.

Comments: 48 pages, 10 figures

Journal ref: Ann. Statist. 49(2): 1055-1080 (April 2021)

arXiv:1810.04777 [pdf, other]

Rao-Blackwellized Stochastic Gradients for Discrete Distributions

Authors: Run**g Liu, Jeffrey Regier, Nilesh Tripuraneni, Michael I. Jordan, Jon McAuliffe

Abstract: We wish to compute the gradient of an expectation over a finite or countably infinite sample space having $K \leq \infty$ categories. When $K$ is indeed infinite, or finite but very large, the relevant summation is intractable. Accordingly, various stochastic gradient estimators have been proposed. In this paper, we describe a technique that can be applied to reduce the variance of any such estima… ▽ More We wish to compute the gradient of an expectation over a finite or countably infinite sample space having $K \leq \infty$ categories. When $K$ is indeed infinite, or finite but very large, the relevant summation is intractable. Accordingly, various stochastic gradient estimators have been proposed. In this paper, we describe a technique that can be applied to reduce the variance of any such estimator, without changing its bias---in particular, unbiasedness is retained. We show that our technique is an instance of Rao-Blackwellization, and we demonstrate the improvement it yields on a semi-supervised classification problem and a pixel attention task. △ Less

Submitted 13 May, 2019; v1 submitted 10 October, 2018; originally announced October 2018.

Comments: Accepted to ICML 2019

arXiv:1808.03204 [pdf, other]

Time-uniform Chernoff bounds via nonnegative supermartingales

Authors: Steven R. Howard, Aaditya Ramdas, Jon McAuliffe, Jasjeet Sekhon

Abstract: We develop a class of exponential bounds for the probability that a martingale sequence crosses a time-dependent linear threshold. Our key insight is that it is both natural and fruitful to formulate exponential concentration inequalities in this way. We illustrate this point by presenting a single assumption and theorem that together unify and strengthen many tail bounds for martingales, includin… ▽ More We develop a class of exponential bounds for the probability that a martingale sequence crosses a time-dependent linear threshold. Our key insight is that it is both natural and fruitful to formulate exponential concentration inequalities in this way. We illustrate this point by presenting a single assumption and theorem that together unify and strengthen many tail bounds for martingales, including classical inequalities (1960-80) by Bernstein, Bennett, Hoeffding, and Freedman; contemporary inequalities (1980-2000) by Shorack and Wellner, Pinelis, Blackwell, van de Geer, and de la Peña; and several modern inequalities (post-2000) by Khan, Tropp, Bercu and Touati, Delyon, and others. In each of these cases, we give the strongest and most general statements to date, quantifying the time-uniform concentration of scalar, matrix, and Banach-space-valued martingales, under a variety of nonparametric assumptions in discrete and continuous time. In doing so, we bridge the gap between existing line-crossing inequalities, the sequential probability ratio test, the Cramér-Chernoff method, self-normalized processes, and other parts of the literature. △ Less

Submitted 12 May, 2020; v1 submitted 9 August, 2018; originally announced August 2018.

Comments: 63 pages, 7 figures, to appear in Probability Surveys

MSC Class: 60E15; 60G17 (Primary) 60F10; 60B20 (Secondary)

arXiv:1803.00113 [pdf, other]

Approximate Inference for Constructing Astronomical Catalogs from Images

Authors: Jeffrey Regier, Andrew C. Miller, David Schlegel, Ryan P. Adams, Jon D. McAuliffe, Prabhat

Abstract: We present a new, fully generative model for constructing astronomical catalogs from optical telescope image sets. Each pixel intensity is treated as a random variable with parameters that depend on the latent properties of stars and galaxies. These latent properties are themselves modeled as random. We compare two procedures for posterior inference. One procedure is based on Markov chain Monte Ca… ▽ More We present a new, fully generative model for constructing astronomical catalogs from optical telescope image sets. Each pixel intensity is treated as a random variable with parameters that depend on the latent properties of stars and galaxies. These latent properties are themselves modeled as random. We compare two procedures for posterior inference. One procedure is based on Markov chain Monte Carlo (MCMC) while the other is based on variational inference (VI). The MCMC procedure excels at quantifying uncertainty, while the VI procedure is 1000 times faster. On a supercomputer, the VI procedure efficiently uses 665,000 CPU cores to construct an astronomical catalog from 50 terabytes of images in 14.6 minutes, demonstrating the scaling characteristics necessary to construct catalogs for upcoming astronomical surveys. △ Less

Submitted 9 April, 2019; v1 submitted 28 February, 2018; originally announced March 2018.

Comments: accepted to the Annals of Applied Statistics

MSC Class: 62P35 ACM Class: G.3

arXiv:1801.10277 [pdf, other]

Cataloging the Visible Universe through Bayesian Inference at Petascale

Authors: Jeffrey Regier, Kiran Pamnany, Keno Fischer, Andreas Noack, Maximilian Lam, Jarrett Revels, Steve Howard, Ryan Giordano, David Schlegel, Jon McAuliffe, Rollin Thomas, Prabhat

Abstract: Astronomical catalogs derived from wide-field imaging surveys are an important tool for understanding the Universe. We construct an astronomical catalog from 55 TB of imaging data using Celeste, a Bayesian variational inference code written entirely in the high-productivity programming language Julia. Using over 1.3 million threads on 650,000 Intel Xeon Phi cores of the Cori Phase II supercomputer… ▽ More Astronomical catalogs derived from wide-field imaging surveys are an important tool for understanding the Universe. We construct an astronomical catalog from 55 TB of imaging data using Celeste, a Bayesian variational inference code written entirely in the high-productivity programming language Julia. Using over 1.3 million threads on 650,000 Intel Xeon Phi cores of the Cori Phase II supercomputer, Celeste achieves a peak rate of 1.54 DP PFLOP/s. Celeste is able to jointly optimize parameters for 188M stars and galaxies, loading and processing 178 TB across 8192 nodes in 14.6 minutes. To achieve this, Celeste exploits parallelism at multiple levels (cluster, node, and thread) and accelerates I/O through Cori's Burst Buffer. Julia's native performance enables Celeste to employ high-level constructs without resorting to hand-written or generated low-level code (C/C++/Fortran), and yet achieve petascale performance. △ Less

Submitted 30 January, 2018; originally announced January 2018.

Comments: accepted to IPDPS 2018

MSC Class: 85A35; 68W10; 62P35 ACM Class: J.2; D.1.3; G.3; I.2; D.2

arXiv:1711.08063 [pdf]

Clonal analysis of newborn hippocampal dentate granule cell proliferation and development in temporal lobe epilepsy

Authors: Shatrunjai P. Singh, Candi L. LaSarge, Amen An, John J. McAuliffe, Steve C. Danzer

Abstract: Hippocampal dentate granule cells are among the few neuronal cell types generated throughout adult life in mammals. In the normal brain, new granule cells are generated from progenitors in the subgranular zone and integrate in a typical fashion. During the development of epilepsy, granule cell integration is profoundly altered. The new cells migrate to ectopic locations and develop misoriented bas… ▽ More Hippocampal dentate granule cells are among the few neuronal cell types generated throughout adult life in mammals. In the normal brain, new granule cells are generated from progenitors in the subgranular zone and integrate in a typical fashion. During the development of epilepsy, granule cell integration is profoundly altered. The new cells migrate to ectopic locations and develop misoriented basal dendrites. Although it has been established that these abnormal cells are newly generated, it is not known whether they arise ubiquitously throughout the progenitor cell pool or are derived from a smaller number of bad actor progenitors. To explore this question, we conducted a clonal analysis study in mice expressing the Brainbow fluorescent protein reporter construct in dentate granule cell progenitors. Mice were examined 2 months after pilocarpine-induced status epilepticus, a treatment that leads to the development of epilepsy. Brain sections were rendered translucent so that entire hippocampi could be reconstructed and all fluorescently labeled cells identified. Our findings reveal that a small number of progenitors produce the majority of ectopic cells following status epilepticus, indicating that either the affected progenitors or their local microenvironments have become pathological. By contrast, granule cells with basal dendrites were equally distributed among clonal groups. This indicates that these progenitors can produce normal cells and suggests that global factors sporadically disrupt the dendritic development of some new cells. Together, these findings strongly predict that distinct mechanisms regulate different aspects △ Less

Submitted 21 November, 2017; originally announced November 2017.

Comments: 44 pages, 6 figures

Journal ref: eNeuro. 2015;2(6):ENEURO.0087-15.2015. doi:10.1523/ENEURO.0087-15.2015

arXiv:1706.02375 [pdf, other]

Fast Black-box Variational Inference through Stochastic Trust-Region Optimization

Authors: Jeffrey Regier, Michael I. Jordan, Jon McAuliffe

Abstract: We introduce TrustVI, a fast second-order algorithm for black-box variational inference based on trust-region optimization and the reparameterization trick. At each iteration, TrustVI proposes and assesses a step based on minibatches of draws from the variational distribution. The algorithm provably converges to a stationary point. We implemented TrustVI in the Stan framework and compared it to tw… ▽ More We introduce TrustVI, a fast second-order algorithm for black-box variational inference based on trust-region optimization and the reparameterization trick. At each iteration, TrustVI proposes and assesses a step based on minibatches of draws from the variational distribution. The algorithm provably converges to a stationary point. We implemented TrustVI in the Stan framework and compared it to two alternatives: Automatic Differentiation Variational Inference (ADVI) and Hessian-free Stochastic Gradient Variational Inference (HFSGVI). The former is based on stochastic first-order optimization. The latter uses second-order information, but lacks convergence guarantees. TrustVI typically converged at least one order of magnitude faster than ADVI, demonstrating the value of stochastic second-order information. TrustVI often found substantially better variational distributions than HFSGVI, demonstrating that our convergence theory can matter in practice. △ Less

Submitted 4 November, 2017; v1 submitted 7 June, 2017; originally announced June 2017.

Comments: NIPS 2017 camera-ready

MSC Class: 62F15 ACM Class: G.3

arXiv:1611.03404 [pdf, other]

Learning an Astronomical Catalog of the Visible Universe through Scalable Bayesian Inference

Authors: Jeffrey Regier, Kiran Pamnany, Ryan Giordano, Rollin Thomas, David Schlegel, Jon McAuliffe, Prabhat

Abstract: Celeste is a procedure for inferring astronomical catalogs that attains state-of-the-art scientific results. To date, Celeste has been scaled to at most hundreds of megabytes of astronomical images: Bayesian posterior inference is notoriously demanding computationally. In this paper, we report on a scalable, parallel version of Celeste, suitable for learning catalogs from modern large-scale astron… ▽ More Celeste is a procedure for inferring astronomical catalogs that attains state-of-the-art scientific results. To date, Celeste has been scaled to at most hundreds of megabytes of astronomical images: Bayesian posterior inference is notoriously demanding computationally. In this paper, we report on a scalable, parallel version of Celeste, suitable for learning catalogs from modern large-scale astronomical datasets. Our algorithmic innovations include a fast numerical optimization routine for Bayesian posterior inference and a statistically efficient scheme for decomposing astronomical optimization problems into subproblems. Our scalable implementation is written entirely in Julia, a new high-level dynamic programming language designed for scientific and numerical computing. We use Julia's high-level constructs for shared and distributed memory parallelism, and demonstrate effective load balancing and efficient scaling on up to 8192 Xeon cores on the NERSC Cori supercomputer. △ Less

Submitted 10 November, 2016; originally announced November 2016.

Comments: submitting to IPDPS'17

MSC Class: 85A35 (Primary); 68W10; 62P35 ACM Class: J.2; D.1.3; G.3; I.2; D.2

arXiv:1601.00670 [pdf, other]

doi 10.1080/01621459.2017.1285773

Variational Inference: A Review for Statisticians

Authors: David M. Blei, Alp Kucukelbir, Jon D. McAuliffe

Abstract: One of the core problems of modern statistics is to approximate difficult-to-compute probability densities. This problem is especially important in Bayesian statistics, which frames all inference about unknown quantities as a calculation involving the posterior density. In this paper, we review variational inference (VI), a method from machine learning that approximates probability densities throu… ▽ More One of the core problems of modern statistics is to approximate difficult-to-compute probability densities. This problem is especially important in Bayesian statistics, which frames all inference about unknown quantities as a calculation involving the posterior density. In this paper, we review variational inference (VI), a method from machine learning that approximates probability densities through optimization. VI has been used in many applications and tends to be faster than classical methods, such as Markov chain Monte Carlo sampling. The idea behind VI is to first posit a family of densities and then to find the member of that family which is close to the target. Closeness is measured by Kullback-Leibler divergence. We review the ideas behind mean-field variational inference, discuss the special case of VI applied to exponential family models, present a full example with a Bayesian mixture of Gaussians, and derive a variant that uses stochastic optimization to scale up to massive data. We discuss modern research in VI and highlight important open problems. VI is powerful, but it is not yet well understood. Our hope in writing this paper is to catalyze statistical research on this class of algorithms. △ Less

Submitted 9 May, 2018; v1 submitted 4 January, 2016; originally announced January 2016.

Journal ref: Journal of the American Statistical Association, Vol. 112 , Iss. 518, 2017

arXiv:1506.01351 [pdf]

Celeste: Variational inference for a generative model of astronomical images

Authors: Jeffrey Regier, Andrew Miller, Jon McAuliffe, Ryan Adams, Matt Hoffman, Dustin Lang, David Schlegel, Prabhat

Abstract: We present a new, fully generative model of optical telescope image sets, along with a variational procedure for inference. Each pixel intensity is treated as a Poisson random variable, with a rate parameter dependent on latent properties of stars and galaxies. Key latent properties are themselves random, with scientific prior distributions constructed from large ancillary data sets. We check our… ▽ More We present a new, fully generative model of optical telescope image sets, along with a variational procedure for inference. Each pixel intensity is treated as a Poisson random variable, with a rate parameter dependent on latent properties of stars and galaxies. Key latent properties are themselves random, with scientific prior distributions constructed from large ancillary data sets. We check our approach on synthetic images. We also run it on images from a major sky survey, where it exceeds the performance of the current state-of-the-art method for locating celestial bodies and measuring their colors. △ Less

Submitted 3 June, 2015; originally announced June 2015.

Comments: in the Proceedings of the 32nd International Conference on Machine Learning (2015)

MSC Class: 62P35; 85A35; 68T01 ACM Class: G.3

arXiv:1003.0783 [pdf, other]

Supervised Topic Models

Authors: David M. Blei, Jon D. McAuliffe

Abstract: We introduce supervised latent Dirichlet allocation (sLDA), a statistical model of labelled documents. The model accommodates a variety of response types. We derive an approximate maximum-likelihood procedure for parameter estimation, which relies on variational methods to handle intractable posterior expectations. Prediction problems motivate this research: we use the fitted model to predict re… ▽ More We introduce supervised latent Dirichlet allocation (sLDA), a statistical model of labelled documents. The model accommodates a variety of response types. We derive an approximate maximum-likelihood procedure for parameter estimation, which relies on variational methods to handle intractable posterior expectations. Prediction problems motivate this research: we use the fitted model to predict response values for new documents. We test sLDA on two real-world problems: movie ratings predicted from reviews, and the political tone of amendments in the U.S. Senate based on the amendment text. We illustrate the benefits of sLDA versus modern regularized regression, as well as versus an unsupervised LDA analysis followed by a separate regression. △ Less

Submitted 3 March, 2010; originally announced March 2010.

arXiv:0712.2526 [pdf, other]

doi 10.1198/jasa.2009.tm08030

Variational inference for large-scale models of discrete choice

Authors: Michael Braun, Jon McAuliffe

Abstract: Discrete choice models are commonly used by applied statisticians in numerous fields, such as marketing, economics, finance, and operations research. When agents in discrete choice models are assumed to have differing preferences, exact inference is often intractable. Markov chain Monte Carlo techniques make approximate inference possible, but the computational cost is prohibitive on the large d… ▽ More Discrete choice models are commonly used by applied statisticians in numerous fields, such as marketing, economics, finance, and operations research. When agents in discrete choice models are assumed to have differing preferences, exact inference is often intractable. Markov chain Monte Carlo techniques make approximate inference possible, but the computational cost is prohibitive on the large data sets now becoming routinely available. Variational methods provide a deterministic alternative for approximation of the posterior distribution. We derive variational procedures for empirical Bayes and fully Bayesian inference in the mixed multinomial logit model of discrete choice. The algorithms require only that we solve a sequence of unconstrained optimization problems, which are shown to be convex. Extensive simulations demonstrate that variational methods achieve accuracy competitive with Markov chain Monte Carlo, at a small fraction of the computational cost. Thus, variational methods permit inferences on data sets that otherwise could not be analyzed without bias-inducing modifications to the underlying model. △ Less

Submitted 15 January, 2008; v1 submitted 15 December, 2007; originally announced December 2007.

Comments: 29 pages, 2 tables, 2 figures

Journal ref: Journal of the American Statistical Association (2010) 105(489): 324-334

arXiv:math/0612821 [pdf, ps, other]

doi 10.1214/088342306000000475

Comment on "Support Vector Machines with Applications"

Authors: Peter L. Bartlett, Michael I. Jordan, Jon D. McAuliffe

Abstract: Comment on "Support Vector Machines with Applications" [math.ST/0612817] Comment on "Support Vector Machines with Applications" [math.ST/0612817] △ Less

Submitted 28 December, 2006; originally announced December 2006.

Comments: Published at http://dx.doi.org/10.1214/088342306000000475 in the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-STS-STS153C

Journal ref: Statistical Science 2006, Vol. 21, No. 3, 341-346

arXiv:q-bio/0412012 [pdf, ps, other]

Subtree power analysis finds optimal species for comparative genomics

Authors: Jon D. McAuliffe, Michael I. Jordan, Lior Pachter

Abstract: Sequence comparison across multiple organisms aids in the detection of regions under selection. However, resource limitations require a prioritization of genomes to be sequenced. This prioritization should be grounded in two considerations: the lineal scope encompassing the biological phenomena of interest, and the optimal species within that scope for detecting functional elements. We introduce… ▽ More Sequence comparison across multiple organisms aids in the detection of regions under selection. However, resource limitations require a prioritization of genomes to be sequenced. This prioritization should be grounded in two considerations: the lineal scope encompassing the biological phenomena of interest, and the optimal species within that scope for detecting functional elements. We introduce a statistical framework for optimal species subset selection, based on maximizing power to detect conserved sites. In a study of vertebrate species, we show that the optimal species subset is not in general the most evolutionarily diverged subset. Our results suggest that marsupials are prime sequencing candidates. △ Less

Submitted 6 December, 2004; originally announced December 2004.

Comments: 16 pages, 3 figures, 3 tables

Report number: UCB-Stat-TR-677

Showing 1–16 of 16 results for author: McAuliffe, J