Search | arXiv e-print repository

arXiv:2405.07879 [pdf, other]

On the Relation Between Autoencoders and Non-negative Matrix Factorization, and Their Application for Mutational Signature Extraction

Authors: Ida Egendal, Rasmus Froberg Brøndum, Marta Pelizzola, Asger Hobolth, Martin Bøgsted

Abstract: The aim of this study is to provide a foundation to understand the relationship between non-negative matrix factorization (NMF) and non-negative autoencoders enabling proper interpretation and understanding of autoencoder-based alternatives to NMF. Since its introduction, NMF has been a popular tool for extracting interpretable, low-dimensional representations of high-dimensional data. However, re… ▽ More The aim of this study is to provide a foundation to understand the relationship between non-negative matrix factorization (NMF) and non-negative autoencoders enabling proper interpretation and understanding of autoencoder-based alternatives to NMF. Since its introduction, NMF has been a popular tool for extracting interpretable, low-dimensional representations of high-dimensional data. However, recently, several studies have proposed to replace NMF with autoencoders. This increasing popularity of autoencoders warrants an investigation on whether this replacement is in general valid and reasonable. Moreover, the exact relationship between non-negative autoencoders and NMF has not been thoroughly explored. Thus, a main aim of this study is to investigate in detail the relationship between non-negative autoencoders and NMF. We find that the connection between the two models can be established through convex NMF, which is a restricted case of NMF. In particular, convex NMF is a special case of an autoencoder. The performance of NMF and autoencoders is compared within the context of extraction of mutational signatures from cancer genomics data. We find that the reconstructions based on NMF are more accurate compared to autoencoders, while the signatures extracted using both methods show comparable consistencies and values when externally validated. These findings suggest that the non-negative autoencoders investigated in this article do not provide an improvement of NMF in the field of mutational signature extraction. △ Less

Submitted 13 May, 2024; originally announced May 2024.

arXiv:2207.02677 [pdf, other]

A flexible model-based framework for robust estimation of mutational signatures

Authors: Ragnhild Laursen, Lasse Maretty, Asger Hobolth

Abstract: Somatic mutations in cancer can be viewed as a mixture distribution of several mutational signatures, which can be inferred using non-negative matrix factorization (NMF). Mutational signatures have previously been parametrized using either simple mono-nucleotide interaction models or general tri-nucleotide interaction models. We describe a flexible and novel framework for identifying biologically… ▽ More Somatic mutations in cancer can be viewed as a mixture distribution of several mutational signatures, which can be inferred using non-negative matrix factorization (NMF). Mutational signatures have previously been parametrized using either simple mono-nucleotide interaction models or general tri-nucleotide interaction models. We describe a flexible and novel framework for identifying biologically plausible parametrizations of mutational signatures, and in particular for estimating di-nucleotide interaction models. The estimation procedure is based on the expectation--maximization (EM) algorithm and regression in the log-linear quasi--Poisson model. We show that di-nucleotide interaction signatures are statistically stable and sufficiently complex to fit the mutational patterns. Di-nucleotide interaction signatures often strike the right balance between appropriately fitting the data and avoiding over-fitting. They provide a better fit to data and are biologically more plausible than mono-nucleotide interaction signatures, and the parametrization is more stable than the parameter-rich tri-nucleotide interaction signatures. We illustrate our framework on three data sets of somatic mutation counts from cancer patients. △ Less

Submitted 6 July, 2022; originally announced July 2022.

arXiv:2206.03257 [pdf, other]

Model selection and robust inference of mutational signatures using Negative Binomial non-negative matrix factorization

Authors: Marta Pelizzola, Ragnhild Laursen, Asger Hobolth

Abstract: The spectrum of mutations in a collection of cancer genomes can be described by a mixture of a few mutational signatures. The mutational signatures can be found using non-negative matrix factorization (NMF). To extract the mutational signatures we have to assume a distribution for the observed mutational counts and a number of mutational signatures. In most applications, the mutational counts are… ▽ More The spectrum of mutations in a collection of cancer genomes can be described by a mixture of a few mutational signatures. The mutational signatures can be found using non-negative matrix factorization (NMF). To extract the mutational signatures we have to assume a distribution for the observed mutational counts and a number of mutational signatures. In most applications, the mutational counts are assumed to be Poisson distributed, and the rank is chosen by comparing the fit of several models with the same underlying distribution and different values for the rank using classical model selection procedures. However, the counts are often overdispersed, and thus the Negative Binomial distribution is more appropriate. We propose a Negative Binomial NMF with a patient specific dispersion parameter to capture the variation across patients. We also introduce a novel model selection procedure inspired by cross-validation to determine the number of signatures. Using simulations, we study the influence of the distributional assumption on our method together with other classical model selection procedures and we show that our model selection procedure is more robust at determining the correct number of signatures under model misspecification. We also show that our model selection procedure is more accurate than state-of-the-art methods for finding the true number of signatures. Other methods are highly overestimating the number of signatures when overdispersion is present. We apply our proposed analysis on a wide range of simulated data and on two real data sets from breast and prostate cancer patients. The code for our model selection procedure and negative binomial NMF is available in the R package SigMoS and can be found at https://github.com/MartaPelizzola/SigMoS. △ Less

Submitted 1 November, 2022; v1 submitted 7 June, 2022; originally announced June 2022.

arXiv:2101.07526 [pdf, other]

A sampling algorithm to compute the set of feasible solutions for non-negative matrix factorization with an arbitrary rank

Authors: Ragnhild Laursen, Asger Hobolth

Abstract: Non-negative Matrix Factorization (NMF) is a useful method to extract features from multivariate data, but an important and sometimes neglected concern is that NMF can result in non-unique solutions. Often, there exist a Set of Feasible Solutions (SFS), which makes it more difficult to interpret the factorization. This problem is especially ignored in cancer genomics, where NMF is used to infer in… ▽ More Non-negative Matrix Factorization (NMF) is a useful method to extract features from multivariate data, but an important and sometimes neglected concern is that NMF can result in non-unique solutions. Often, there exist a Set of Feasible Solutions (SFS), which makes it more difficult to interpret the factorization. This problem is especially ignored in cancer genomics, where NMF is used to infer information about the mutational processes present in the evolution of cancer. In this paper the extent of non-uniqueness is investigated for two mutational counts data, and a new sampling algorithm, that can find the SFS, is introduced. Our sampling algorithm is easy to implement and applies to an arbitrary rank of NMF. This is in contrast to state of the art, where the NMF rank must be smaller than or equal to four. For lower ranks we show that our algorithm performs similarly to the polygon inflation algorithm that is developed in relations to chemometrics. Furthermore, we show how the size of the SFS can have a high influence on the appearing variability of a solution. Our sampling algorithm is implemented in an R package \textbf{SFS} (\url{https://github.com/ragnhildlaursen/SFS}). △ Less

Submitted 19 January, 2021; originally announced January 2021.

Comments: 18 pages, 8 figures, 1 algorithm

MSC Class: 15A23; 62P10; 62-04

arXiv:2101.04941 [pdf, other]

Multivariate phase-type theory for the site frequency spectrum

Authors: Asger Hobolth, Mogens Bladt, Lars Nørvang Andersen

Abstract: Linear functions of the site frequency spectrum (SFS) play a major role for understanding and investigating genetic diversity. Estimators of the mutation rate (e.g. based on the total number of segregating sites or average of the pairwise differences) and tests for neutrality (e.g. Tajima's D) are perhaps the most well-known examples. The distribution of linear functions of the SFS is important fo… ▽ More Linear functions of the site frequency spectrum (SFS) play a major role for understanding and investigating genetic diversity. Estimators of the mutation rate (e.g. based on the total number of segregating sites or average of the pairwise differences) and tests for neutrality (e.g. Tajima's D) are perhaps the most well-known examples. The distribution of linear functions of the SFS is important for constructing confidence intervals for the estimators, and to determine significance thresholds for neutrality tests. These distributions are often approximated using simulation procedures. In this paper we use multivariate phase-type theory to specify, characterize and calculate the distribution of linear functions of the site frequency spectrum. In particular, we show that many of the classical estimators of the mutation rate are distributed according to a discrete phase-type distribution. Neutrality tests, however, are generally not discrete phase-type distributed. For neutrality tests we derive the probability generating function using continuous multivariate phase-type theory, and numerically invert the function to obtain the distribution. A main result is an analytically tractable formula for the probability generating function of the SFS. Software implementation of the phase-type methodology is available in the R package phasty, and R code for the reproduction of our results is available as an accompanying vignette. △ Less

Submitted 13 January, 2021; originally announced January 2021.

MSC Class: 60J90 (Primary) 60J27; 60J28; 60J95; 92D15 (Secondary)

arXiv:1806.01416 [pdf, other]

Phase-type distributions in population genetics

Authors: Asger Hobolth, Arno Siri-Jégousse, Mogens Bladt

Abstract: Probability modelling for DNA sequence evolution is well established and provides a rich framework for understanding genetic variation between samples of individuals from one or more populations. We show that both classical and more recent models for coalescence (with or without recombination) can be described in terms of the so-called phase-type theory, where complicated and tedious calculations… ▽ More Probability modelling for DNA sequence evolution is well established and provides a rich framework for understanding genetic variation between samples of individuals from one or more populations. We show that both classical and more recent models for coalescence (with or without recombination) can be described in terms of the so-called phase-type theory, where complicated and tedious calculations are circumvented by the use of matrices. The application of phase-type theory consists of describing the stochastic model as a Markov model by appropriately setting up a state space and calculating the corresponding intensity and reward matrices. Formulae of interest are then expressed in terms of these aforementioned matrices. We illustrate this by a few examples calculating the mean, variance and even higher order moments of the site frequency spectrum in the multiple merger coalescent models, and by analysing the mean and variance for the number of segregating sites for multiple samples in the two-locus ancestral recombination graph. We believe that phase-type theory has great potential as a tool for analysing probability models in population genetics. The compact matrix notation is useful for clarification of current models, in particular their formal manipulation (calculation), but also for further development or extensions. △ Less

Submitted 4 June, 2018; originally announced June 2018.

arXiv:1501.02847 [pdf, other]

The SMC' is a highly accurate approximation to the ancestral recombination graph

Authors: Peter R. Wilton, Shai Carmi, Asger Hobolth

Abstract: Two sequentially Markov coalescent models (SMC and SMC') are available as tractable approximations to the ancestral recombination graph (ARG). We present a Markov process describing coalescence at two fixed points along a pair of sequences evolving under the SMC'. Using our Markov process, we derive a number of new quantities related to the pairwise SMC', thereby analytically quantifying for the f… ▽ More Two sequentially Markov coalescent models (SMC and SMC') are available as tractable approximations to the ancestral recombination graph (ARG). We present a Markov process describing coalescence at two fixed points along a pair of sequences evolving under the SMC'. Using our Markov process, we derive a number of new quantities related to the pairwise SMC', thereby analytically quantifying for the first time the similarity between the SMC' and ARG. We use our process to show that the joint distribution of pairwise coalescence times at recombination sites under the SMC' is the same as it is marginally under the ARG, which demonstrates that the SMC' is, in a particular well-defined, intuitive sense, the most appropriate first-order sequentially Markov approximation to the ARG. Finally, we use these results to show that population size estimates under the pairwise SMC are asymptotically biased, while under the pairwise SMC' they are approximately asymptotically unbiased. △ Less

Submitted 4 March, 2015; v1 submitted 12 January, 2015; originally announced January 2015.

Comments: Revised manuscript

arXiv:1402.5790 [pdf]

Strong selective sweeps associated with ampliconic regions in great ape X chromosomes

Authors: Kiwoong Nam, Kasper Munch, Asger Hobolth, Julien Y. Dutheil, Krishna Veeramah, August Woerner, Michael F. Hammer, Great Ape Genome Diversity Project, Thomas Mailund, Mikkel H. Schierup

Abstract: The unique inheritance pattern of X chromosomes makes them preferential targets of adaptive evolution. We here investigate natural selection on the X chromosome in all species of great apes. We find that diversity is more strongly reduced around genes on the X compared with autosomes, and that a higher proportion of substitutions results from positive selection. Strikingly, the X exhibits several… ▽ More The unique inheritance pattern of X chromosomes makes them preferential targets of adaptive evolution. We here investigate natural selection on the X chromosome in all species of great apes. We find that diversity is more strongly reduced around genes on the X compared with autosomes, and that a higher proportion of substitutions results from positive selection. Strikingly, the X exhibits several megabase long regions where diversity is reduced more than five fold. These regions overlap significantly among species, and have a higher singleton proportion, population differentiation, and nonsynonymous to synonymous substitution ratio. We rule out background selection and soft selective sweeps as explanations for these observations, and conclude that several strong selective sweeps have occurred independently in similar regions in several species. Since these regions are strongly associated with ampliconic sequences we propose that intra-genomic conflict between the X and the Y chromosomes is a major driver of X chromosome evolution. △ Less

Submitted 5 March, 2014; v1 submitted 24 February, 2014; originally announced February 2014.

Comments: This the resubmitted version, with supplementary

arXiv:0910.1683 [pdf, ps, other]

doi 10.1214/09-AOAS247

Simulation from endpoint-conditioned, continuous-time Markov chains on a finite state space, with applications to molecular evolution

Authors: Asger Hobolth, Eric A. Stone

Abstract: Analyses of serially-sampled data often begin with the assumption that the observations represent discrete samples from a latent continuous-time stochastic process. The continuous-time Markov chain (CTMC) is one such generative model whose popularity extends to a variety of disciplines ranging from computational finance to human genetics and genomics. A common theme among these diverse applicati… ▽ More Analyses of serially-sampled data often begin with the assumption that the observations represent discrete samples from a latent continuous-time stochastic process. The continuous-time Markov chain (CTMC) is one such generative model whose popularity extends to a variety of disciplines ranging from computational finance to human genetics and genomics. A common theme among these diverse applications is the need to simulate sample paths of a CTMC conditional on realized data that is discretely observed. Here we present a general solution to this sampling problem when the CTMC is defined on a discrete and finite state space. Specifically, we consider the generation of sample paths, including intermediate states and times of transition, from a CTMC whose beginning and ending states are known across a time interval of length $T$. We first unify the literature through a discussion of the three predominant approaches: (1) modified rejection sampling, (2) direct sampling, and (3) uniformization. We then give analytical results for the complexity and efficiency of each method in terms of the instantaneous transition rate matrix $Q$ of the CTMC, its beginning and ending states, and the length of sampling time $T$. In doing so, we show that no method dominates the others across all model specifications, and we give explicit proof of which method prevails for any given $Q,T,$ and endpoints. Finally, we introduce and compare three applications of CTMCs to demonstrate the pitfalls of choosing an inefficient sampler. △ Less

Submitted 9 October, 2009; originally announced October 2009.

Comments: Published in at http://dx.doi.org/10.1214/09-AOAS247 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS247

Journal ref: Annals of Applied Statistics 2009, Vol. 3, No. 3, 1204-1231

arXiv:q-bio/0511034 [pdf, ps, other]

Maximum likelihood estimation of phylogenetic tree and substitution rates via generalized neighbor-joining and the EM algorithm

Authors: Asger Hobolth, Ruriko Yoshida

Abstract: A central task in the study of molecular sequence data from present-day species is the reconstruction of the ancestral relationships. The most established approach to tree reconstruction is the maximum likelihood (ML) method. In this method, evolution is described in terms of a discrete-state continuous-time Markov process on a phylogenetic tree. The substitution rate matrix, that determines the… ▽ More A central task in the study of molecular sequence data from present-day species is the reconstruction of the ancestral relationships. The most established approach to tree reconstruction is the maximum likelihood (ML) method. In this method, evolution is described in terms of a discrete-state continuous-time Markov process on a phylogenetic tree. The substitution rate matrix, that determines the Markov process, can be estimated using the expectation maximization (EM) algorithm. Unfortunately, an exhaustive search for the ML phylogenetic tree is computationally prohibitive for large data sets. In such situations, the neighbor-joining (NJ) method is frequently used because of its computational speed. The NJ method reconstructs trees by clustering neighboring sequences recursively, based on pairwise comparisons between the sequences. The NJ method can be generalized such that reconstruction is based on comparisons of subtrees rather than pairwise distances. In this paper, we present an algorithm for simultaneous substitution rate estimation and phylogenetic tree reconstruction. The algorithm iterates between the EM algorithm for estimating substitution rates and the generalized NJ method for tree reconstruction. Preliminary results of the approach are encouraging. △ Less

Submitted 19 November, 2005; originally announced November 2005.

Comments: 12 pages. To appear in Algebaic Biology 2005

Showing 1–10 of 10 results for author: Hobolth, A