-
Sowing 'Seeds of Doubt': Cottage Industries of Election and Medical Misinformation in Brazil and the United States
Authors:
Amelia Hassoun,
Gabrielle Borenstein,
Beth Goldberg,
Jacob McAuliffe,
Katy Osborn
Abstract:
We conducted ethnographic research with 31 misinformation creators and consumers in Brazil and the US before, during, and after a major election to understand the consumption and production of election and medical misinformation. This study contributes to research on misinformation ecosystems by focusing on poorly understood small players, or "micro-influencers", who create misinformation in peer-…
▽ More
We conducted ethnographic research with 31 misinformation creators and consumers in Brazil and the US before, during, and after a major election to understand the consumption and production of election and medical misinformation. This study contributes to research on misinformation ecosystems by focusing on poorly understood small players, or "micro-influencers", who create misinformation in peer-to-peer networks. We detail four key tactics that micro-influencers use. First, they typically disseminate "gray area" content rather than expert-falsified claims, using subtle aesthetic and rhetorical tactics to evade moderation. Second, they post in small, closed groups where members feel safe and predisposed to trust content. Third, they explicitly target misinformation consumers' emotional and social needs. Finally, they post a high volume of short, repetitive content to plant seeds of doubt and build trust in influencers as unofficial experts. We discuss the implications these micro-influencers have for misinformation interventions and platforms' efforts to moderate misinformation.
△ Less
Submitted 9 January, 2024; v1 submitted 4 August, 2023;
originally announced August 2023.
-
Variational Inference for Deblending Crowded Starfields
Authors:
Run**g Liu,
Jon D. McAuliffe,
Jeffrey Regier
Abstract:
In images collected by astronomical surveys, stars and galaxies often overlap visually. Deblending is the task of distinguishing and characterizing individual light sources in survey images. We propose StarNet, a Bayesian method to deblend sources in astronomical images of crowded star fields. StarNet leverages recent advances in variational inference, including amortized variational distributions…
▽ More
In images collected by astronomical surveys, stars and galaxies often overlap visually. Deblending is the task of distinguishing and characterizing individual light sources in survey images. We propose StarNet, a Bayesian method to deblend sources in astronomical images of crowded star fields. StarNet leverages recent advances in variational inference, including amortized variational distributions and an optimization objective targeting an expectation of the forward KL divergence. In our experiments with SDSS images of the M2 globular cluster, StarNet is substantially more accurate than two competing methods: Probabilistic Cataloging (PCAT), a method that uses MCMC for inference, and DAOPHOT, a software pipeline employed by SDSS for deblending. In addition, the amortized approach to inference gives StarNet the scaling characteristics necessary to perform Bayesian inference on modern astronomical surveys.
△ Less
Submitted 28 August, 2023; v1 submitted 3 February, 2021;
originally announced February 2021.
-
Time-uniform, nonparametric, nonasymptotic confidence sequences
Authors:
Steven R. Howard,
Aaditya Ramdas,
Jon McAuliffe,
Jasjeet Sekhon
Abstract:
A confidence sequence is a sequence of confidence intervals that is uniformly valid over an unbounded time horizon. Our work develops confidence sequences whose widths go to zero, with nonasymptotic coverage guarantees under nonparametric conditions. We draw connections between the Cramér-Chernoff method for exponential concentration, the law of the iterated logarithm (LIL), and the sequential pro…
▽ More
A confidence sequence is a sequence of confidence intervals that is uniformly valid over an unbounded time horizon. Our work develops confidence sequences whose widths go to zero, with nonasymptotic coverage guarantees under nonparametric conditions. We draw connections between the Cramér-Chernoff method for exponential concentration, the law of the iterated logarithm (LIL), and the sequential probability ratio test -- our confidence sequences are time-uniform extensions of the first; provide tight, nonasymptotic characterizations of the second; and generalize the third to nonparametric settings, including sub-Gaussian and Bernstein conditions, self-normalized processes, and matrix martingales. We illustrate the generality of our proof techniques by deriving an empirical-Bernstein bound growing at a LIL rate, as well as a novel upper LIL for the maximum eigenvalue of a sum of random matrices. Finally, we apply our methods to covariance matrix estimation and to estimation of sample average treatment effect under the Neyman-Rubin potential outcomes model.
△ Less
Submitted 6 August, 2022; v1 submitted 18 October, 2018;
originally announced October 2018.
-
Rao-Blackwellized Stochastic Gradients for Discrete Distributions
Authors:
Run**g Liu,
Jeffrey Regier,
Nilesh Tripuraneni,
Michael I. Jordan,
Jon McAuliffe
Abstract:
We wish to compute the gradient of an expectation over a finite or countably infinite sample space having $K \leq \infty$ categories. When $K$ is indeed infinite, or finite but very large, the relevant summation is intractable. Accordingly, various stochastic gradient estimators have been proposed. In this paper, we describe a technique that can be applied to reduce the variance of any such estima…
▽ More
We wish to compute the gradient of an expectation over a finite or countably infinite sample space having $K \leq \infty$ categories. When $K$ is indeed infinite, or finite but very large, the relevant summation is intractable. Accordingly, various stochastic gradient estimators have been proposed. In this paper, we describe a technique that can be applied to reduce the variance of any such estimator, without changing its bias---in particular, unbiasedness is retained. We show that our technique is an instance of Rao-Blackwellization, and we demonstrate the improvement it yields on a semi-supervised classification problem and a pixel attention task.
△ Less
Submitted 13 May, 2019; v1 submitted 10 October, 2018;
originally announced October 2018.
-
Time-uniform Chernoff bounds via nonnegative supermartingales
Authors:
Steven R. Howard,
Aaditya Ramdas,
Jon McAuliffe,
Jasjeet Sekhon
Abstract:
We develop a class of exponential bounds for the probability that a martingale sequence crosses a time-dependent linear threshold. Our key insight is that it is both natural and fruitful to formulate exponential concentration inequalities in this way. We illustrate this point by presenting a single assumption and theorem that together unify and strengthen many tail bounds for martingales, includin…
▽ More
We develop a class of exponential bounds for the probability that a martingale sequence crosses a time-dependent linear threshold. Our key insight is that it is both natural and fruitful to formulate exponential concentration inequalities in this way. We illustrate this point by presenting a single assumption and theorem that together unify and strengthen many tail bounds for martingales, including classical inequalities (1960-80) by Bernstein, Bennett, Hoeffding, and Freedman; contemporary inequalities (1980-2000) by Shorack and Wellner, Pinelis, Blackwell, van de Geer, and de la Peña; and several modern inequalities (post-2000) by Khan, Tropp, Bercu and Touati, Delyon, and others. In each of these cases, we give the strongest and most general statements to date, quantifying the time-uniform concentration of scalar, matrix, and Banach-space-valued martingales, under a variety of nonparametric assumptions in discrete and continuous time. In doing so, we bridge the gap between existing line-crossing inequalities, the sequential probability ratio test, the Cramér-Chernoff method, self-normalized processes, and other parts of the literature.
△ Less
Submitted 12 May, 2020; v1 submitted 9 August, 2018;
originally announced August 2018.
-
Approximate Inference for Constructing Astronomical Catalogs from Images
Authors:
Jeffrey Regier,
Andrew C. Miller,
David Schlegel,
Ryan P. Adams,
Jon D. McAuliffe,
Prabhat
Abstract:
We present a new, fully generative model for constructing astronomical catalogs from optical telescope image sets. Each pixel intensity is treated as a random variable with parameters that depend on the latent properties of stars and galaxies. These latent properties are themselves modeled as random. We compare two procedures for posterior inference. One procedure is based on Markov chain Monte Ca…
▽ More
We present a new, fully generative model for constructing astronomical catalogs from optical telescope image sets. Each pixel intensity is treated as a random variable with parameters that depend on the latent properties of stars and galaxies. These latent properties are themselves modeled as random. We compare two procedures for posterior inference. One procedure is based on Markov chain Monte Carlo (MCMC) while the other is based on variational inference (VI). The MCMC procedure excels at quantifying uncertainty, while the VI procedure is 1000 times faster. On a supercomputer, the VI procedure efficiently uses 665,000 CPU cores to construct an astronomical catalog from 50 terabytes of images in 14.6 minutes, demonstrating the scaling characteristics necessary to construct catalogs for upcoming astronomical surveys.
△ Less
Submitted 9 April, 2019; v1 submitted 28 February, 2018;
originally announced March 2018.
-
Cataloging the Visible Universe through Bayesian Inference at Petascale
Authors:
Jeffrey Regier,
Kiran Pamnany,
Keno Fischer,
Andreas Noack,
Maximilian Lam,
Jarrett Revels,
Steve Howard,
Ryan Giordano,
David Schlegel,
Jon McAuliffe,
Rollin Thomas,
Prabhat
Abstract:
Astronomical catalogs derived from wide-field imaging surveys are an important tool for understanding the Universe. We construct an astronomical catalog from 55 TB of imaging data using Celeste, a Bayesian variational inference code written entirely in the high-productivity programming language Julia. Using over 1.3 million threads on 650,000 Intel Xeon Phi cores of the Cori Phase II supercomputer…
▽ More
Astronomical catalogs derived from wide-field imaging surveys are an important tool for understanding the Universe. We construct an astronomical catalog from 55 TB of imaging data using Celeste, a Bayesian variational inference code written entirely in the high-productivity programming language Julia. Using over 1.3 million threads on 650,000 Intel Xeon Phi cores of the Cori Phase II supercomputer, Celeste achieves a peak rate of 1.54 DP PFLOP/s. Celeste is able to jointly optimize parameters for 188M stars and galaxies, loading and processing 178 TB across 8192 nodes in 14.6 minutes. To achieve this, Celeste exploits parallelism at multiple levels (cluster, node, and thread) and accelerates I/O through Cori's Burst Buffer. Julia's native performance enables Celeste to employ high-level constructs without resorting to hand-written or generated low-level code (C/C++/Fortran), and yet achieve petascale performance.
△ Less
Submitted 30 January, 2018;
originally announced January 2018.
-
Clonal analysis of newborn hippocampal dentate granule cell proliferation and development in temporal lobe epilepsy
Authors:
Shatrunjai P. Singh,
Candi L. LaSarge,
Amen An,
John J. McAuliffe,
Steve C. Danzer
Abstract:
Hippocampal dentate granule cells are among the few neuronal cell types generated throughout adult life in mammals. In the normal brain, new granule cells are generated from progenitors in the subgranular zone and integrate in a typical fashion. During the development of epilepsy, granule cell integration is profoundly altered. The new cells migrate to ectopic locations and develop misoriented bas…
▽ More
Hippocampal dentate granule cells are among the few neuronal cell types generated throughout adult life in mammals. In the normal brain, new granule cells are generated from progenitors in the subgranular zone and integrate in a typical fashion. During the development of epilepsy, granule cell integration is profoundly altered. The new cells migrate to ectopic locations and develop misoriented basal dendrites. Although it has been established that these abnormal cells are newly generated, it is not known whether they arise ubiquitously throughout the progenitor cell pool or are derived from a smaller number of bad actor progenitors. To explore this question, we conducted a clonal analysis study in mice expressing the Brainbow fluorescent protein reporter construct in dentate granule cell progenitors. Mice were examined 2 months after pilocarpine-induced status epilepticus, a treatment that leads to the development of epilepsy. Brain sections were rendered translucent so that entire hippocampi could be reconstructed and all fluorescently labeled cells identified. Our findings reveal that a small number of progenitors produce the majority of ectopic cells following status epilepticus, indicating that either the affected progenitors or their local microenvironments have become pathological. By contrast, granule cells with basal dendrites were equally distributed among clonal groups. This indicates that these progenitors can produce normal cells and suggests that global factors sporadically disrupt the dendritic development of some new cells. Together, these findings strongly predict that distinct mechanisms regulate different aspects
△ Less
Submitted 21 November, 2017;
originally announced November 2017.
-
Fast Black-box Variational Inference through Stochastic Trust-Region Optimization
Authors:
Jeffrey Regier,
Michael I. Jordan,
Jon McAuliffe
Abstract:
We introduce TrustVI, a fast second-order algorithm for black-box variational inference based on trust-region optimization and the reparameterization trick. At each iteration, TrustVI proposes and assesses a step based on minibatches of draws from the variational distribution. The algorithm provably converges to a stationary point. We implemented TrustVI in the Stan framework and compared it to tw…
▽ More
We introduce TrustVI, a fast second-order algorithm for black-box variational inference based on trust-region optimization and the reparameterization trick. At each iteration, TrustVI proposes and assesses a step based on minibatches of draws from the variational distribution. The algorithm provably converges to a stationary point. We implemented TrustVI in the Stan framework and compared it to two alternatives: Automatic Differentiation Variational Inference (ADVI) and Hessian-free Stochastic Gradient Variational Inference (HFSGVI). The former is based on stochastic first-order optimization. The latter uses second-order information, but lacks convergence guarantees. TrustVI typically converged at least one order of magnitude faster than ADVI, demonstrating the value of stochastic second-order information. TrustVI often found substantially better variational distributions than HFSGVI, demonstrating that our convergence theory can matter in practice.
△ Less
Submitted 4 November, 2017; v1 submitted 7 June, 2017;
originally announced June 2017.
-
Learning an Astronomical Catalog of the Visible Universe through Scalable Bayesian Inference
Authors:
Jeffrey Regier,
Kiran Pamnany,
Ryan Giordano,
Rollin Thomas,
David Schlegel,
Jon McAuliffe,
Prabhat
Abstract:
Celeste is a procedure for inferring astronomical catalogs that attains state-of-the-art scientific results. To date, Celeste has been scaled to at most hundreds of megabytes of astronomical images: Bayesian posterior inference is notoriously demanding computationally. In this paper, we report on a scalable, parallel version of Celeste, suitable for learning catalogs from modern large-scale astron…
▽ More
Celeste is a procedure for inferring astronomical catalogs that attains state-of-the-art scientific results. To date, Celeste has been scaled to at most hundreds of megabytes of astronomical images: Bayesian posterior inference is notoriously demanding computationally. In this paper, we report on a scalable, parallel version of Celeste, suitable for learning catalogs from modern large-scale astronomical datasets. Our algorithmic innovations include a fast numerical optimization routine for Bayesian posterior inference and a statistically efficient scheme for decomposing astronomical optimization problems into subproblems.
Our scalable implementation is written entirely in Julia, a new high-level dynamic programming language designed for scientific and numerical computing. We use Julia's high-level constructs for shared and distributed memory parallelism, and demonstrate effective load balancing and efficient scaling on up to 8192 Xeon cores on the NERSC Cori supercomputer.
△ Less
Submitted 10 November, 2016;
originally announced November 2016.
-
Variational Inference: A Review for Statisticians
Authors:
David M. Blei,
Alp Kucukelbir,
Jon D. McAuliffe
Abstract:
One of the core problems of modern statistics is to approximate difficult-to-compute probability densities. This problem is especially important in Bayesian statistics, which frames all inference about unknown quantities as a calculation involving the posterior density. In this paper, we review variational inference (VI), a method from machine learning that approximates probability densities throu…
▽ More
One of the core problems of modern statistics is to approximate difficult-to-compute probability densities. This problem is especially important in Bayesian statistics, which frames all inference about unknown quantities as a calculation involving the posterior density. In this paper, we review variational inference (VI), a method from machine learning that approximates probability densities through optimization. VI has been used in many applications and tends to be faster than classical methods, such as Markov chain Monte Carlo sampling. The idea behind VI is to first posit a family of densities and then to find the member of that family which is close to the target. Closeness is measured by Kullback-Leibler divergence. We review the ideas behind mean-field variational inference, discuss the special case of VI applied to exponential family models, present a full example with a Bayesian mixture of Gaussians, and derive a variant that uses stochastic optimization to scale up to massive data. We discuss modern research in VI and highlight important open problems. VI is powerful, but it is not yet well understood. Our hope in writing this paper is to catalyze statistical research on this class of algorithms.
△ Less
Submitted 9 May, 2018; v1 submitted 4 January, 2016;
originally announced January 2016.
-
Celeste: Variational inference for a generative model of astronomical images
Authors:
Jeffrey Regier,
Andrew Miller,
Jon McAuliffe,
Ryan Adams,
Matt Hoffman,
Dustin Lang,
David Schlegel,
Prabhat
Abstract:
We present a new, fully generative model of optical telescope image sets, along with a variational procedure for inference. Each pixel intensity is treated as a Poisson random variable, with a rate parameter dependent on latent properties of stars and galaxies. Key latent properties are themselves random, with scientific prior distributions constructed from large ancillary data sets. We check our…
▽ More
We present a new, fully generative model of optical telescope image sets, along with a variational procedure for inference. Each pixel intensity is treated as a Poisson random variable, with a rate parameter dependent on latent properties of stars and galaxies. Key latent properties are themselves random, with scientific prior distributions constructed from large ancillary data sets. We check our approach on synthetic images. We also run it on images from a major sky survey, where it exceeds the performance of the current state-of-the-art method for locating celestial bodies and measuring their colors.
△ Less
Submitted 3 June, 2015;
originally announced June 2015.
-
Supervised Topic Models
Authors:
David M. Blei,
Jon D. McAuliffe
Abstract:
We introduce supervised latent Dirichlet allocation (sLDA), a statistical model of labelled documents. The model accommodates a variety of response types. We derive an approximate maximum-likelihood procedure for parameter estimation, which relies on variational methods to handle intractable posterior expectations. Prediction problems motivate this research: we use the fitted model to predict re…
▽ More
We introduce supervised latent Dirichlet allocation (sLDA), a statistical model of labelled documents. The model accommodates a variety of response types. We derive an approximate maximum-likelihood procedure for parameter estimation, which relies on variational methods to handle intractable posterior expectations. Prediction problems motivate this research: we use the fitted model to predict response values for new documents. We test sLDA on two real-world problems: movie ratings predicted from reviews, and the political tone of amendments in the U.S. Senate based on the amendment text. We illustrate the benefits of sLDA versus modern regularized regression, as well as versus an unsupervised LDA analysis followed by a separate regression.
△ Less
Submitted 3 March, 2010;
originally announced March 2010.
-
Variational inference for large-scale models of discrete choice
Authors:
Michael Braun,
Jon McAuliffe
Abstract:
Discrete choice models are commonly used by applied statisticians in numerous fields, such as marketing, economics, finance, and operations research. When agents in discrete choice models are assumed to have differing preferences, exact inference is often intractable. Markov chain Monte Carlo techniques make approximate inference possible, but the computational cost is prohibitive on the large d…
▽ More
Discrete choice models are commonly used by applied statisticians in numerous fields, such as marketing, economics, finance, and operations research. When agents in discrete choice models are assumed to have differing preferences, exact inference is often intractable. Markov chain Monte Carlo techniques make approximate inference possible, but the computational cost is prohibitive on the large data sets now becoming routinely available. Variational methods provide a deterministic alternative for approximation of the posterior distribution. We derive variational procedures for empirical Bayes and fully Bayesian inference in the mixed multinomial logit model of discrete choice. The algorithms require only that we solve a sequence of unconstrained optimization problems, which are shown to be convex. Extensive simulations demonstrate that variational methods achieve accuracy competitive with Markov chain Monte Carlo, at a small fraction of the computational cost. Thus, variational methods permit inferences on data sets that otherwise could not be analyzed without bias-inducing modifications to the underlying model.
△ Less
Submitted 15 January, 2008; v1 submitted 15 December, 2007;
originally announced December 2007.
-
Comment on "Support Vector Machines with Applications"
Authors:
Peter L. Bartlett,
Michael I. Jordan,
Jon D. McAuliffe
Abstract:
Comment on "Support Vector Machines with Applications" [math.ST/0612817]
Comment on "Support Vector Machines with Applications" [math.ST/0612817]
△ Less
Submitted 28 December, 2006;
originally announced December 2006.
-
Subtree power analysis finds optimal species for comparative genomics
Authors:
Jon D. McAuliffe,
Michael I. Jordan,
Lior Pachter
Abstract:
Sequence comparison across multiple organisms aids in the detection of regions under selection. However, resource limitations require a prioritization of genomes to be sequenced. This prioritization should be grounded in two considerations: the lineal scope encompassing the biological phenomena of interest, and the optimal species within that scope for detecting functional elements. We introduce…
▽ More
Sequence comparison across multiple organisms aids in the detection of regions under selection. However, resource limitations require a prioritization of genomes to be sequenced. This prioritization should be grounded in two considerations: the lineal scope encompassing the biological phenomena of interest, and the optimal species within that scope for detecting functional elements. We introduce a statistical framework for optimal species subset selection, based on maximizing power to detect conserved sites. In a study of vertebrate species, we show that the optimal species subset is not in general the most evolutionarily diverged subset. Our results suggest that marsupials are prime sequencing candidates.
△ Less
Submitted 6 December, 2004;
originally announced December 2004.