Search | arXiv e-print repository

A Bayesian factor analysis model for high-dimensional microbiome count data

Authors: Ismaïla Ba, Maxime Turgeon, Simona Veniamin, Juan Joel, Richard Miller, Morag Graham, Christine Bonner, Charles N. Bernstein, Douglas L. Arnold, Amit Bar-Or, Ruth Ann Marrie, Julia O'Mahony, E. Ann Yeh, Brenda Banwell, Emmanuelle Waubant, Natalie Knox, Gary Van Domselaar, Ali I. Mirza, Heather Armstrong, Saman Muthukumarana, Kevin McGregor

Abstract: Dimension reduction techniques are among the most essential analytical tools in the analysis of high-dimensional data. Generalized principal component analysis (PCA) is an extension to standard PCA that has been widely used to identify low-dimensional features in high-dimensional discrete data, such as binary, multi-category and count data. For microbiome count data in particular, the multinomial… ▽ More Dimension reduction techniques are among the most essential analytical tools in the analysis of high-dimensional data. Generalized principal component analysis (PCA) is an extension to standard PCA that has been widely used to identify low-dimensional features in high-dimensional discrete data, such as binary, multi-category and count data. For microbiome count data in particular, the multinomial PCA is a natural counterpart of the standard PCA. However, this technique fails to account for the excessive number of zero values, which is frequently observed in microbiome count data. To allow for sparsity, zero-inflated multivariate distributions can be used. We propose a zero-inflated probabilistic PCA model for latent factor analysis. The proposed model is a fully Bayesian factor analysis technique that is appropriate for microbiome count data analysis. In addition, we use the mean-field-type variational family to approximate the marginal likelihood and develop a classification variational approximation algorithm to fit the model. We demonstrate the efficiency of our procedure for predictions based on the latent factors and the model parameters through simulation experiments, showcasing its superiority over competing methods. This efficiency is further illustrated with two real microbiome count datasets. The method is implemented in R. △ Less

Submitted 23 April, 2024; v1 submitted 3 April, 2024; originally announced April 2024.

Comments: 2 figures, 3 tables

arXiv:2310.04153 [pdf, other]

Fair coins tend to land on the same side they started: Evidence from 350,757 flips

Authors: František Bartoš, Alexandra Sarafoglou, Henrik R. Godmann, Amir Sahrani, David Klein Leunk, Pierre Y. Gui, David Voss, Kaleem Ullah, Malte J. Zoubek, Franziska Nippold, Frederik Aust, Felipe F. Vieira, Chris-Gabriel Islam, Anton J. Zoubek, Sara Shabani, Jonas Petter, Ingeborg B. Roos, Adam Finnemann, Aaron B. Lob, Madlen F. Hoffstadt, Jason Nak, Jill de Ron, Koen Derks, Karoline Huth, Sjoerd Terpstra , et al. (25 additional authors not shown)

Abstract: Many people have flipped coins but few have stopped to ponder the statistical and physical intricacies of the process. In a preregistered study we collected $350{,}757$ coin flips to test the counterintuitive prediction from a physics model of human coin tossing developed by Diaconis, Holmes, and Montgomery (DHM; 2007). The model asserts that when people flip an ordinary coin, it tends to land on… ▽ More Many people have flipped coins but few have stopped to ponder the statistical and physical intricacies of the process. In a preregistered study we collected $350{,}757$ coin flips to test the counterintuitive prediction from a physics model of human coin tossing developed by Diaconis, Holmes, and Montgomery (DHM; 2007). The model asserts that when people flip an ordinary coin, it tends to land on the same side it started -- DHM estimated the probability of a same-side outcome to be about 51%. Our data lend strong support to this precise prediction: the coins landed on the same side more often than not, $\text{Pr}(\text{same side}) = 0.508$, 95% credible interval (CI) [$0.506$, $0.509$], $\text{BF}_{\text{same-side bias}} = 2359$. Furthermore, the data revealed considerable between-people variation in the degree of this same-side bias. Our data also confirmed the generic prediction that when people flip an ordinary coin -- with the initial side-up randomly determined -- it is equally likely to land heads or tails: $\text{Pr}(\text{heads}) = 0.500$, 95% CI [$0.498$, $0.502$], $\text{BF}_{\text{heads-tails bias}} = 0.182$. Furthermore, this lack of heads-tails bias does not appear to vary across coins. Additional exploratory analyses revealed that the within-people same-side bias decreased as more coins were flipped, an effect that is consistent with the possibility that practice makes people flip coins in a less wobbly fashion. Our data therefore provide strong evidence that when some (but not all) people flip a fair coin, it tends to land on the same side it started. Our data provide compelling statistical support for the DHM physics model of coin tossing. △ Less

Submitted 2 June, 2024; v1 submitted 6 October, 2023; originally announced October 2023.

arXiv:1810.02689 [pdf, other]

Hows and Whys of Artificial Intelligence for Public Sector Decisions: Explanation and Evaluation

Authors: Alun Preece, Rob Ashelford, Harry Armstrong, Dave Braines

Abstract: Evaluation has always been a key challenge in the development of artificial intelligence (AI) based software, due to the technical complexity of the software artifact and, often, its embedding in complex sociotechnical processes. Recent advances in machine learning (ML) enabled by deep neural networks has exacerbated the challenge of evaluating such software due to the opaque nature of these ML-ba… ▽ More Evaluation has always been a key challenge in the development of artificial intelligence (AI) based software, due to the technical complexity of the software artifact and, often, its embedding in complex sociotechnical processes. Recent advances in machine learning (ML) enabled by deep neural networks has exacerbated the challenge of evaluating such software due to the opaque nature of these ML-based artifacts. A key related issue is the (in)ability of such systems to generate useful explanations of their outputs, and we argue that the explanation and evaluation problems are closely linked. The paper models the elements of a ML-based AI system in the context of public sector decision (PSD) applications involving both artificial and human intelligence, and maps these elements against issues in both evaluation and explanation, showing how the two are related. We consider a number of common PSD application patterns in the light of our model, and identify a set of key issues connected to explanation and evaluation in each case. Finally, we propose multiple strategies to promote wider adoption of AI/ML technologies in PSD, where each is distinguished by a focus on different elements of our model, allowing PSD policy makers to adopt an approach that best fits their context and concerns. △ Less

Submitted 19 October, 2018; v1 submitted 28 September, 2018; originally announced October 2018.

Comments: Presented at AAAI FSS-18: Artificial Intelligence in Government and Public Sector, Arlington, Virginia, USA; corrected typos in this version

arXiv:0712.0776 [pdf, ps, other]

Identification of RR Lyrae Variables in SDSS from Single-Epoch Photometric and Spectroscopic Observations

Authors: Ronald Wilhelm, W. Lee Powell Jr., Timothy C. Beers, Branimir Sesar, Carlos Alende Prieto, Kenneth W. Carrell, Young Sun Lee, Brian Yanny, Constance M. Rockosi, Nathan De Lee, Gwen Hansford Armstrong, Stephen J. Torrence

Abstract: We describe a new RR Lyrae identification technique based on out-of-phase single-epoch photometric and spectroscopic observations contained in SDSS Data Release 6 (DR-6). This technique detects variability by exploiting the large disparity between the g-r color and the strength of the hydrogen Balmer lines when the two observations are made at random phases. Comparison with a large sample of kno… ▽ More We describe a new RR Lyrae identification technique based on out-of-phase single-epoch photometric and spectroscopic observations contained in SDSS Data Release 6 (DR-6). This technique detects variability by exploiting the large disparity between the g-r color and the strength of the hydrogen Balmer lines when the two observations are made at random phases. Comparison with a large sample of known variables in the SDSS equatorial stripe (Stripe 82) shows that the discovery efficiency for our technique is ~85%. Analysis of stars with multiple spectroscopic observations suggests a similar efficiency throughout the entire DR-6 sample. We also develop a technique to estimate the average g apparent magnitude (over the pulsation cycle) for individual RR Lyrae stars, using the <g-r> for the entire sample and measured colors for each star. The resulting distances are found to have precisions of ~14%. Finally, we explore the properties of our DR-6 sample of N = 1087 variables, and recover portions of the Sagittarius Northern and Southern Stream. Analysis of the distance and velocity for the Southern Stream are consistent with previously published data for blue horizontal-branch stars. In a sample near the North Galactic Polar Cap, we find evidence for the descending leading Northern arm, and a possible detection of the trailing arm. △ Less

Submitted 5 December, 2007; originally announced December 2007.

Comments: 59 pages, 17 figures, 8 tables

arXiv:0706.1287 [pdf, ps, other]

Bayesian Covariance Matrix Estimation using a Mixture of Decomposable Graphical Models

Authors: Helen Armstrong, Christopher K. Carter, Kevin F. Wong, Robert Kohn

Abstract: A Bayesian approach is used to estimate the covariance matrix of Gaussian data. Ideas from Gaussian graphical models and model selection are used to construct a prior for the covariance matrix that is a mixture over all decomposable graphs. For this prior the probability of each graph size is specified by the user and graphs of equal size are assigned equal probability. Most previous approaches… ▽ More A Bayesian approach is used to estimate the covariance matrix of Gaussian data. Ideas from Gaussian graphical models and model selection are used to construct a prior for the covariance matrix that is a mixture over all decomposable graphs. For this prior the probability of each graph size is specified by the user and graphs of equal size are assigned equal probability. Most previous approaches assume that all graphs are equally probable. We show empirically that the prior that assigns equal probability over graph sizes outperforms the prior that assigns equal probability over all graphs, both in identifying the correct decomposable graph and in more efficiently estimating the covariance matrix. △ Less

Submitted 9 June, 2007; originally announced June 2007.

Comments: 28 pages and 11 figures

arXiv:math/0701937 [pdf, ps, other]

A presentation for Aut(F_n)

Authors: Heather Armstrong, Bradley Forrest, Karen Vogtmann

Abstract: We study the action of the group Aut(F_n) of automorphisms of a finitely generated free group on the degree 2 subcomplex of the spine of Auter space. Hatcher and Vogtmann showed that this subcomplex is simply connected, and we use the method described by K. S. Brown to deduce a new presentation of Aut(F_n). We study the action of the group Aut(F_n) of automorphisms of a finitely generated free group on the degree 2 subcomplex of the spine of Auter space. Hatcher and Vogtmann showed that this subcomplex is simply connected, and we use the method described by K. S. Brown to deduce a new presentation of Aut(F_n). △ Less

Submitted 5 September, 2007; v1 submitted 31 January, 2007; originally announced January 2007.

Comments: Missing relation added. Final version to appear in Journal of Group Theory

MSC Class: 20F05; 20F28

Showing 1–6 of 6 results for author: Armstrong, H