Search | arXiv e-print repository

Generative Modeling for Tabular Data via Penalized Optimal Transport Network

Authors: Wenhui Sophia Lu, Chenyang Zhong, Wing Hung Wong

Abstract: The task of precisely learning the probability distribution of rows within tabular data and producing authentic synthetic samples is both crucial and non-trivial. Wasserstein generative adversarial network (WGAN) marks a notable improvement in generative modeling, addressing the challenges faced by its predecessor, generative adversarial network. However, due to the mixed data types and multimodal… ▽ More The task of precisely learning the probability distribution of rows within tabular data and producing authentic synthetic samples is both crucial and non-trivial. Wasserstein generative adversarial network (WGAN) marks a notable improvement in generative modeling, addressing the challenges faced by its predecessor, generative adversarial network. However, due to the mixed data types and multimodalities prevalent in tabular data, the delicate equilibrium between the generator and discriminator, as well as the inherent instability of Wasserstein distance in high dimensions, WGAN often fails to produce high-fidelity samples. To this end, we propose POTNet (Penalized Optimal Transport Network), a generative deep neural network based on a novel, robust, and interpretable marginally-penalized Wasserstein (MPW) loss. POTNet can effectively model tabular data containing both categorical and continuous features. Moreover, it offers the flexibility to condition on a subset of features. We provide theoretical justifications for the motivation behind the MPW loss. We also empirically demonstrate the effectiveness of our proposed method on four different benchmarks across a variety of real-world and simulated datasets. Our proposed model achieves orders of magnitude speedup during the sampling stage compared to state-of-the-art generative models for tabular data, thereby enabling efficient large-scale synthetic data generation. △ Less

Submitted 16 February, 2024; originally announced February 2024.

Comments: 37 pages, 23 figures

arXiv:2212.05925 [pdf, other]

CausalEGM: a general causal inference framework by encoding generative modeling

Authors: Qiao Liu, Zhongren Chen, Wing Hung Wong

Abstract: Although understanding and characterizing causal effects have become essential in observational studies, it is challenging when the confounders are high-dimensional. In this article, we develop a general framework $\textit{CausalEGM}$ for estimating causal effects by encoding generative modeling, which can be applied in both binary and continuous treatment settings. Under the potential outcome fra… ▽ More Although understanding and characterizing causal effects have become essential in observational studies, it is challenging when the confounders are high-dimensional. In this article, we develop a general framework $\textit{CausalEGM}$ for estimating causal effects by encoding generative modeling, which can be applied in both binary and continuous treatment settings. Under the potential outcome framework with unconfoundedness, we establish a bidirectional transformation between the high-dimensional confounders space and a low-dimensional latent space where the density is known (e.g., multivariate normal distribution). Through this, CausalEGM simultaneously decouples the dependencies of confounders on both treatment and outcome and maps the confounders to the low-dimensional latent space. By conditioning on the low-dimensional latent features, CausalEGM can estimate the causal effect for each individual or the average causal effect within a population. Our theoretical analysis shows that the excess risk for CausalEGM can be bounded through empirical process theory. Under an assumption on encoder-decoder networks, the consistency of the estimate can be guaranteed. In a series of experiments, CausalEGM demonstrates superior performance over existing methods for both binary and continuous treatments. Specifically, we find CausalEGM to be substantially more powerful than competing methods in the presence of large sample sizes and high dimensional confounders. The software of CausalEGM is freely available at https://github.com/SUwonglab/CausalEGM. △ Less

Submitted 16 March, 2023; v1 submitted 8 December, 2022; originally announced December 2022.

arXiv:2104.10633 [pdf]

A calculus for causal inference with instrumental variables

Authors: Wing Hung Wong

Abstract: Under a general structural equation framework for causal inference, we provide a definition of the causal effect of a variable X on another variable Y, and propose an approach to estimate this causal effect via the use of instrumental variables. Under a general structural equation framework for causal inference, we provide a definition of the causal effect of a variable X on another variable Y, and propose an approach to estimate this causal effect via the use of instrumental variables. △ Less

Submitted 23 April, 2021; v1 submitted 21 April, 2021; originally announced April 2021.

Comments: 10 pages

arXiv:2004.09017 [pdf, other]

doi 10.1073/pnas.2101344118

Roundtrip: A Deep Generative Neural Density Estimator

Authors: Qiao Liu, Jiaze Xu, Rui Jiang, Wing Hung Wong

Abstract: Density estimation is a fundamental problem in both statistics and machine learning. In this study, we proposed Roundtrip as a general-purpose neural density estimator based on deep generative models. Roundtrip retains the generative power of generative adversarial networks (GANs) but also provides estimates of density values. Unlike previous neural density estimators that put stringent conditions… ▽ More Density estimation is a fundamental problem in both statistics and machine learning. In this study, we proposed Roundtrip as a general-purpose neural density estimator based on deep generative models. Roundtrip retains the generative power of generative adversarial networks (GANs) but also provides estimates of density values. Unlike previous neural density estimators that put stringent conditions on the transformation from the latent space to the data space, Roundtrip enables the use of much more general map**s. In a series of experiments, Roundtrip achieves state-of-the-art performance in a diverse range of density estimation tasks. △ Less

Submitted 4 September, 2020; v1 submitted 19 April, 2020; originally announced April 2020.

Journal ref: Proceedings of the National Academy of Sciences, 2021, 118(15)

arXiv:1908.02910 [pdf, other]

Mini-batch Metropolis-Hastings MCMC with Reversible SGLD Proposal

Authors: Tung-Yu Wu, Y. X. Rachel Wang, Wing H. Wong

Abstract: Traditional MCMC algorithms are computationally intensive and do not scale well to large data. In particular, the Metropolis-Hastings (MH) algorithm requires passing over the entire dataset to evaluate the likelihood ratio in each iteration. We propose a general framework for performing MH-MCMC using mini-batches of the whole dataset and show that this gives rise to approximately a tempered statio… ▽ More Traditional MCMC algorithms are computationally intensive and do not scale well to large data. In particular, the Metropolis-Hastings (MH) algorithm requires passing over the entire dataset to evaluate the likelihood ratio in each iteration. We propose a general framework for performing MH-MCMC using mini-batches of the whole dataset and show that this gives rise to approximately a tempered stationary distribution. We prove that the algorithm preserves the modes of the original target distribution and derive an error bound on the approximation with mild assumptions on the likelihood. To further extend the utility of the algorithm to high dimensional settings, we construct a proposal with forward and reverse moves using stochastic gradient and show that the construction leads to reasonable acceptance probabilities. We demonstrate the performance of our algorithm in both low dimensional models and high dimensional neural network applications. Particularly in the latter case, compared to popular optimization methods, our method is more robust to the choice of learning rate and improves testing accuracy. △ Less

Submitted 28 August, 2019; v1 submitted 7 August, 2019; originally announced August 2019.

arXiv:1807.06776 [pdf, other]

Detecting strong signals in gene perturbation experiments: An adaptive approach with power guarantee and FDR control

Authors: Leying Guan, Xi Chen, Wing Hung Wong

Abstract: The perturbation of a transcription factor should affect the expression levels of its direct targets. However, not all genes showing changes in expression are direct targets. To increase the chance of detecting direct targets, we propose a modified two-group model where the null group corresponds to genes which are not direct targets, but can have small non-zero effects. We model the behaviour of… ▽ More The perturbation of a transcription factor should affect the expression levels of its direct targets. However, not all genes showing changes in expression are direct targets. To increase the chance of detecting direct targets, we propose a modified two-group model where the null group corresponds to genes which are not direct targets, but can have small non-zero effects. We model the behaviour of genes from the null set by a Gaussian distribution with unknown variance $τ^2$, and we discuss and compare three methods which adaptively estimate $τ^2$ from the data: the iterated empirical Bayes estimator, the truncated MLE and the central moment matching estimator. We conduct a detailed analysis of the properties of the iterated EB estimate which has the best performance in the simulations. In particular, we provide theoretical guarantee of its good performance under mild conditions. We provide simulations comparing the new modeling approach with existing methods, and the new approach shows more stable and better performance under different situations. We also apply it to a real data set from gene knock-down experiments and obtained better results compared with the original two-group model testing for non-zero effects. △ Less

Submitted 18 July, 2018; originally announced July 2018.

arXiv:1707.09705 [pdf, other]

Mini-batch Tempered MCMC

Authors: Dangna Li, Wing H Wong

Abstract: In this paper we propose a general framework of performing MCMC with only a mini-batch of data. We show by estimating the Metropolis-Hasting ratio with only a mini-batch of data, one is essentially sampling from the true posterior raised to a known temperature. We show by experiments that our method, Mini-batch Tempered MCMC (MINT-MCMC), can efficiently explore multiple modes of a posterior distri… ▽ More In this paper we propose a general framework of performing MCMC with only a mini-batch of data. We show by estimating the Metropolis-Hasting ratio with only a mini-batch of data, one is essentially sampling from the true posterior raised to a known temperature. We show by experiments that our method, Mini-batch Tempered MCMC (MINT-MCMC), can efficiently explore multiple modes of a posterior distribution. Based on the Equi-Energy sampler (Kou et al. 2006), we developed a new parallel MCMC algorithm based on the Equi-Energy sampler, which enables efficient sampling from high-dimensional multi-modal posteriors with well separated modes. △ Less

Submitted 21 May, 2018; v1 submitted 30 July, 2017; originally announced July 2017.

arXiv:1610.07213 [pdf, other]

Stochastic Modeling and Statistical Inference of Intrinsic Noise in Gene Regulation System via Chemical Master Equation

Authors: Chao Du, Wing Hong Wong

Abstract: Intrinsic noise, the stochastic cell-to-cell fluctuations in mRNAs and proteins, has been observed and proved to play important roles in cellular systems. Due to the recent development in single-cell-level measurement technology, the studies on intrinsic noise are becoming increasingly popular among scholars. The chemical master equation (CME) has been used to model the evolutions of complex chemi… ▽ More Intrinsic noise, the stochastic cell-to-cell fluctuations in mRNAs and proteins, has been observed and proved to play important roles in cellular systems. Due to the recent development in single-cell-level measurement technology, the studies on intrinsic noise are becoming increasingly popular among scholars. The chemical master equation (CME) has been used to model the evolutions of complex chemical and biological systems since 1940, and are often served as the standard tool for modeling intrinsic noise in gene regulation system. A CME-based model can capture the discrete, stochastic, and dynamical nature of gene regulation system, and may offer casual and physical explanation of the observed data at single-cell level. Nonetheless, the complexity of CME also pose serious challenge for researchers in proposing practical modeling and inference frameworks. In this article, we will review the existing works on the modelings and inference of intrinsic noise in gene regulation system within the framework of CME model. We will explore the principles in constructing a CME model for studying gene regulation system and discuss the popular approximations of CME. Then we will study the simulation simulation methods as well as the analytical and numerical approaches that can be used to obtain solution to a CME model. Finally we will summary the exiting statistical methods that can be used to infer the unknown parameters or structures in CME model using single-cell-level gene expression data. △ Less

Submitted 11 November, 2017; v1 submitted 23 October, 2016; originally announced October 2016.

Comments: 64 pages, 5 figures

arXiv:1605.06220 [pdf, other]

Convergence of Contrastive Divergence with Annealed Learning Rate in Exponential Family

Authors: Bai Jiang, Tung-yu Wu, Wing H. Wong

Abstract: In our recent paper, we showed that in exponential family, contrastive divergence (CD) with fixed learning rate will give asymptotically consistent estimates \cite{wu2016convergence}. In this paper, we establish consistency and convergence rate of CD with annealed learning rate $η_t$. Specifically, suppose CD-$m$ generates the sequence of parameters $\{θ_t\}_{t \ge 0}$ using an i.i.d. data sample… ▽ More In our recent paper, we showed that in exponential family, contrastive divergence (CD) with fixed learning rate will give asymptotically consistent estimates \cite{wu2016convergence}. In this paper, we establish consistency and convergence rate of CD with annealed learning rate $η_t$. Specifically, suppose CD-$m$ generates the sequence of parameters $\{θ_t\}_{t \ge 0}$ using an i.i.d. data sample $\mathbf{X}_1^n \sim p_{θ^*}$ of size $n$, then $δ_n(\mathbf{X}_1^n) = \limsup_{t \to \infty} \Vert \sum_{s=t_0}^t η_s θ_s / \sum_{s=t_0}^t η_s - θ^* \Vert$ converges in probability to 0 at a rate of $1/\sqrt[3]{n}$. The number ($m$) of MCMC transitions in CD only affects the coefficient factor of convergence rate. Our proof is not a simple extension of the one in \cite{wu2016convergence}. which depends critically on the fact that $\{θ_t\}_{t \ge 0}$ is a homogeneous Markov chain conditional on the observed sample $\mathbf{X}_1^n$. Under annealed learning rate, the homogeneous Markov property is not available and we have to develop an alternative approach based on super-martingales. Experiment results of CD on a fully-visible $2\times 2$ Boltzmann Machine are provided to demonstrate our theoretical results. △ Less

Submitted 20 May, 2016; originally announced May 2016.

arXiv:1603.05729 [pdf, other]

Convergence of Contrastive Divergence Algorithm in Exponential Family

Authors: Bai Jiang, Tung-Yu Wu, Yifan **, Wing H. Wong

Abstract: The Contrastive Divergence (CD) algorithm has achieved notable success in training energy-based models including Restricted Boltzmann Machines and played a key role in the emergence of deep learning. The idea of this algorithm is to approximate the intractable term in the exact gradient of the log-likelihood function by using short Markov chain Monte Carlo (MCMC) runs. The approximate gradient is… ▽ More The Contrastive Divergence (CD) algorithm has achieved notable success in training energy-based models including Restricted Boltzmann Machines and played a key role in the emergence of deep learning. The idea of this algorithm is to approximate the intractable term in the exact gradient of the log-likelihood function by using short Markov chain Monte Carlo (MCMC) runs. The approximate gradient is computationally-cheap but biased. Whether and why the CD algorithm provides an asymptotically consistent estimate are still open questions. This paper studies the asymptotic properties of the CD algorithm in canonical exponential families, which are special cases of the energy-based model. Suppose the CD algorithm runs $m$ MCMC transition steps at each iteration $t$ and iteratively generates a sequence of parameter estimates $\{θ_t\}_{t \ge 0}$ given an i.i.d. data sample $\{X_i\}_{i=1}^n \sim p_{θ_\star}$. Under conditions which are commonly obeyed by the CD algorithm in practice, we prove the existence of some bounded $m$ such that any limit point of the time average $\left. \sum_{s=0}^{t-1} θ_s \right/ t$ as $t \to \infty$ is a consistent estimate for the true parameter $θ_\star$. Our proof is based on the fact that $\{θ_t\}_{t \ge 0}$ is a homogenous Markov chain conditional on the data sample $\{X_i\}_{i=1}^n$. This chain meets the Foster-Lyapunov drift criterion and converges to a random walk around the Maximum Likelihood Estimate. The range of the random walk shrinks to zero at rate $\mathcal{O}(1/\sqrt[3]{n})$ as the sample size $n \to \infty$. △ Less

Submitted 27 February, 2018; v1 submitted 17 March, 2016; originally announced March 2016.

MSC Class: 68W48; 60J20; 93E15

arXiv:1510.02175 [pdf, other]

doi 10.5705/ss.202015.0340

Learning Summary Statistic for Approximate Bayesian Computation via Deep Neural Network

Authors: Bai Jiang, Tung-yu Wu, Charles Zheng, Wing H. Wong

Abstract: Approximate Bayesian Computation (ABC) methods are used to approximate posterior distributions in models with unknown or computationally intractable likelihoods. Both the accuracy and computational efficiency of ABC depend on the choice of summary statistic, but outside of special cases where the optimal summary statistics are known, it is unclear which guiding principles can be used to construct… ▽ More Approximate Bayesian Computation (ABC) methods are used to approximate posterior distributions in models with unknown or computationally intractable likelihoods. Both the accuracy and computational efficiency of ABC depend on the choice of summary statistic, but outside of special cases where the optimal summary statistics are known, it is unclear which guiding principles can be used to construct effective summary statistics. In this paper we explore the possibility of automating the process of constructing summary statistics by training deep neural networks to predict the parameters from artificially generated data: the resulting summary statistics are approximately posterior means of the parameters. With minimal model-specific tuning, our method constructs summary statistics for the Ising model and the moving-average model, which match or exceed theoretically-motivated summary statistics in terms of the accuracies of the resulting posteriors. △ Less

Submitted 16 March, 2017; v1 submitted 7 October, 2015; originally announced October 2015.

Comments: 27 pages, 10 figures

arXiv:1410.0726 [pdf, other]

co-BPM: a Bayesian Model for Divergence Estimation

Authors: Kun Yang, Hao Su, Wing Hung Wong

Abstract: Divergence is not only an important mathematical concept in information theory, but also applied to machine learning problems such as low-dimensional embedding, manifold learning, clustering, classification, and anomaly detection. We proposed a bayesian model---co-BPM---to characterize the discrepancy of two sample sets, i.e., to estimate the divergence of their underlying distributions. In order… ▽ More Divergence is not only an important mathematical concept in information theory, but also applied to machine learning problems such as low-dimensional embedding, manifold learning, clustering, classification, and anomaly detection. We proposed a bayesian model---co-BPM---to characterize the discrepancy of two sample sets, i.e., to estimate the divergence of their underlying distributions. In order to avoid the pitfalls of plug-in methods that estimate each density independently, our bayesian model attempts to learn a coupled binary partition of the sample space that best captures the landscapes of both distributions, then make direct inference on their divergences. The prior is constructed by leveraging the sequential buildup of the coupled binary partitions and the posterior is sampled via our specialized MCMC. Our model provides a unified way to estimate various types of divergences and enjoys convincing accuracy. We demonstrate its effectiveness through simulations, comparisons with the \emph{state-of-the-art} and a real data example. △ Less

Submitted 20 November, 2016; v1 submitted 2 October, 2014; originally announced October 2014.

Comments: Key Words: coupled binary partition, divergence, MCMC, clustering, classification

arXiv:1404.1425 [pdf, other]

Density Estimation via Discrepancy Based Adaptive Sequential Partition

Authors: Dangna Li, Kun Yang, Wing Hung Wong

Abstract: Given $iid$ observations from an unknown absolute continuous distribution defined on some domain $Ω$, we propose a nonparametric method to learn a piecewise constant function to approximate the underlying probability density function. Our density estimate is a piecewise constant function defined on a binary partition of $Ω$. The key ingredient of the algorithm is to use discrepancy, a concept orig… ▽ More Given $iid$ observations from an unknown absolute continuous distribution defined on some domain $Ω$, we propose a nonparametric method to learn a piecewise constant function to approximate the underlying probability density function. Our density estimate is a piecewise constant function defined on a binary partition of $Ω$. The key ingredient of the algorithm is to use discrepancy, a concept originates from Quasi Monte Carlo analysis, to control the partition process. The resulting algorithm is simple, efficient, and has a provable convergence rate. We empirically demonstrate its efficiency as a density estimation method. We present its applications on a wide range of tasks, including finding good initializations for k-means. △ Less

Submitted 11 March, 2018; v1 submitted 4 April, 2014; originally announced April 2014.

Comments: Binary Partition, Star Discrepancy, Density Estimation, Mode Seeking, Level Set Tree

arXiv:1403.4370 [pdf, other]

Discovering and Visualizing Hierarchy in Multivariate Data

Authors: Kun Yang, Wing Hung Wong

Abstract: How to extract useful insights from data is always a challenge, especially if the data is multidimensional. Often, the data can be organized according to certain hierarchical structure that are stemmed either from data collection process or from the information and phenomena carried by the data itself. The current study attempts to discover and visualize these underlying hierarchies. By regarding… ▽ More How to extract useful insights from data is always a challenge, especially if the data is multidimensional. Often, the data can be organized according to certain hierarchical structure that are stemmed either from data collection process or from the information and phenomena carried by the data itself. The current study attempts to discover and visualize these underlying hierarchies. By regarding each observation in the data as a draw from a (hypothetical) multidimensional joint density, our first goal is to approximate this unknown density with a piecewise constant function via binary partition, our non-parametric approach makes no assumptions on the form of the density. Given the piecewise constant density function and its corresponding binary partition, our second goal is to construct a connected graph and build up a tree representation of the data by level sets. To demonstrate that our method is a general data mining and visualization tool which can provide "multi-resolution" summaries and reveal different levels of information of the data, we apply it to two real data sets from Flow Cytometry and Social Network. △ Less

Submitted 20 April, 2016; v1 submitted 18 March, 2014; originally announced March 2014.

arXiv:1401.2597 [pdf, other]

Multivariate Density Estimation via Adaptive Partitioning (I): Sieve MLE

Authors: Linxi Liu, Wing Hung Wong

Abstract: We study a non-parametric approach to multivariate density estimation. The estimators are piecewise constant density functions supported by binary partitions. The partition of the sample space is learned by maximizing the likelihood of the corresponding histogram on that partition. We analyze the convergence rate of the sieve maximum likelihood estimator, and reach a conclusion that for a relative… ▽ More We study a non-parametric approach to multivariate density estimation. The estimators are piecewise constant density functions supported by binary partitions. The partition of the sample space is learned by maximizing the likelihood of the corresponding histogram on that partition. We analyze the convergence rate of the sieve maximum likelihood estimator, and reach a conclusion that for a relatively rich class of density functions the rate does not directly depend on the dimension. This suggests that, under certain conditions, this method is immune to the curse of dimensionality, in the sense that it is possible to get close to the parametric rate even in high dimensions. We also apply this method to several special cases, and calculate the explicit convergence rates respectively. △ Less

Submitted 19 August, 2015; v1 submitted 12 January, 2014; originally announced January 2014.

arXiv:1309.5489 [pdf, other]

Computational Aspects of Optional Pólya Tree

Authors: Hui Jiang, John C. Mu, Kun Yang, Chao Du, Luo Lu, Wing Hung Wong

Abstract: Optional Pólya Tree (OPT) is a flexible non-parametric Bayesian model for density estimation. Despite its merits, the computation for OPT inference is challenging. In this paper we present time complexity analysis for OPT inference and propose two algorithmic improvements. The first improvement, named Limited-Lookahead Optional Pólya Tree (LL-OPT), aims at greatly accelerate the computation for OP… ▽ More Optional Pólya Tree (OPT) is a flexible non-parametric Bayesian model for density estimation. Despite its merits, the computation for OPT inference is challenging. In this paper we present time complexity analysis for OPT inference and propose two algorithmic improvements. The first improvement, named Limited-Lookahead Optional Pólya Tree (LL-OPT), aims at greatly accelerate the computation for OPT inference. The second improvement modifies the output of OPT or LL-OPT and produces a continuous piecewise linear density estimate. We demonstrate the performance of these two improvements using simulations. △ Less

Submitted 21 September, 2013; originally announced September 2013.

arXiv:1207.3137 [pdf, ps, other]

doi 10.1214/13-AOAS645

Learning a nonlinear dynamical system model of gene regulation: A perturbed steady-state approach

Authors: Arwen Vanice Bradley, Ye Henry Li, Bokyung Choi, Wing Hung Wong

Abstract: Biological structure and function depend on complex regulatory interactions between many genes. A wealth of gene expression data is available from high-throughput genome-wide measurement technologies, but effective gene regulatory network inference methods are still needed. Model-based methods founded on quantitative descriptions of gene regulation are among the most promising, but many such metho… ▽ More Biological structure and function depend on complex regulatory interactions between many genes. A wealth of gene expression data is available from high-throughput genome-wide measurement technologies, but effective gene regulatory network inference methods are still needed. Model-based methods founded on quantitative descriptions of gene regulation are among the most promising, but many such methods rely on simple, local models or on ad hoc inference approaches lacking experimental interpretability. We propose an experimental design and develop an associated statistical method for inferring a gene network by learning a standard quantitative, interpretable, predictive, biophysics-based ordinary differential equation model of gene regulation. We fit the model parameters using gene expression measurements from perturbed steady-states of the system, like those following overexpression or knockdown experiments. Although the original model is nonlinear, our design allows us to transform it into a convex optimization problem by restricting attention to steady-states and using the lasso for parameter selection. Here, we describe the model and inference algorithm and apply them to a synthetic six-gene system, demonstrating that the model is detailed and flexible enough to account for activation and repression as well as synergistic and self-regulation, and the algorithm can efficiently and accurately recover the parameters used to generate the data. △ Less

Submitted 25 March, 2016; v1 submitted 12 July, 2012; originally announced July 2012.

Comments: Published in at http://dx.doi.org/10.1214/13-AOAS645 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS645

Journal ref: Annals of Applied Statistics 2013, Vol. 7, No. 3, 1311-1333

arXiv:1106.3211 [pdf, ps, other]

doi 10.1214/10-STS343

Statistical Modeling of RNA-Seq Data

Authors: Julia Salzman, Hui Jiang, Wing Hung Wong

Abstract: Recently, ultra high-throughput sequencing of RNA (RNA-Seq) has been developed as an approach for analysis of gene expression. By obtaining tens or even hundreds of millions of reads of transcribed sequences, an RNA-Seq experiment can offer a comprehensive survey of the population of genes (transcripts) in any sample of interest. This paper introduces a statistical model for estimating isoform abu… ▽ More Recently, ultra high-throughput sequencing of RNA (RNA-Seq) has been developed as an approach for analysis of gene expression. By obtaining tens or even hundreds of millions of reads of transcribed sequences, an RNA-Seq experiment can offer a comprehensive survey of the population of genes (transcripts) in any sample of interest. This paper introduces a statistical model for estimating isoform abundance from RNA-Seq data and is flexible enough to accommodate both single end and paired end RNA-Seq data and sampling bias along the length of the transcript. Based on the derivation of minimal sufficient statistics for the model, a computationally feasible implementation of the maximum likelihood estimator of the model is provided. Further, it is shown that using paired end RNA-Seq provides more accurate isoform abundance estimates than single end sequencing at fixed sequencing depth. Simulation studies are also given. △ Less

Submitted 16 June, 2011; originally announced June 2011.

Comments: Published in at http://dx.doi.org/10.1214/10-STS343 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-STS-STS343

Journal ref: Statistical Science 2011, Vol. 26, No. 1, 62-83

arXiv:1104.2210 [pdf, ps, other]

doi 10.1214/10-STS341

From EM to Data Augmentation: The Emergence of MCMC Bayesian Computation in the 1980s

Authors: Martin A. Tanner, Wing H. Wong

Abstract: It was known from Metropolis et al. [J. Chem. Phys. 21 (1953) 1087--1092] that one can sample from a distribution by performing Monte Carlo simulation from a Markov chain whose equilibrium distribution is equal to the target distribution. However, it took several decades before the statistical community embraced Markov chain Monte Carlo (MCMC) as a general computational tool in Bayesian inference.… ▽ More It was known from Metropolis et al. [J. Chem. Phys. 21 (1953) 1087--1092] that one can sample from a distribution by performing Monte Carlo simulation from a Markov chain whose equilibrium distribution is equal to the target distribution. However, it took several decades before the statistical community embraced Markov chain Monte Carlo (MCMC) as a general computational tool in Bayesian inference. The usual reasons that are advanced to explain why statisticians were slow to catch on to the method include lack of computing power and unfamiliarity with the early dynamic Monte Carlo papers in the statistical physics literature. We argue that there was a deeper reason, namely, that the structure of problems in the statistical mechanics and those in the standard statistical literature are different. To make the methods usable in standard Bayesian problems, one had to exploit the power that comes from the introduction of judiciously chosen auxiliary variables and collective moves. This paper examines the development in the critical period 1980--1990, when the ideas of Markov chain simulation from the statistical physics literature and the latent variable formulation in maximum likelihood computation (i.e., EM algorithm) came together to spark the widespread application of MCMC methods in Bayesian computation. △ Less

Submitted 12 April, 2011; originally announced April 2011.

Comments: Published in at http://dx.doi.org/10.1214/10-STS341 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-STS-STS341

Journal ref: Statistical Science 2010, Vol. 25, No. 4, 506-516

arXiv:1011.1253 [pdf, ps, other]

Coupling optional Pólya trees and the two sample problem

Authors: Li Ma, Wing H. Wong

Abstract: Testing and characterizing the difference between two data samples is of fundamental interest in statistics. Existing methods such as Kolmogorov-Smirnov and Cramer-von-Mises tests do not scale well as the dimensionality increases and provides no easy way to characterize the difference should it exist. In this work, we propose a theoretical framework for inference that addresses these challenges in… ▽ More Testing and characterizing the difference between two data samples is of fundamental interest in statistics. Existing methods such as Kolmogorov-Smirnov and Cramer-von-Mises tests do not scale well as the dimensionality increases and provides no easy way to characterize the difference should it exist. In this work, we propose a theoretical framework for inference that addresses these challenges in the form of a prior for Bayesian nonparametric analysis. The new prior is constructed based on a random-partition-and-assignment procedure similar to the one that defines the standard optional Pólya tree distribution, but has the ability to generate multiple random distributions jointly. These random probability distributions are allowed to "couple", that is to have the same conditional distribution, on subsets of the sample space. We show that this "coupling optional Pólya tree" prior provides a convenient and effective way for both the testing of two sample difference and the learning of the underlying structure of the difference. In addition, we discuss some practical issues in the computational implementation of this prior and provide several numerical examples to demonstrate its work. △ Less

Submitted 22 March, 2011; v1 submitted 4 November, 2010; originally announced November 2010.

Comments: 44 pages, 6 figures

MSC Class: 62F15; 62G99

arXiv:0901.3999 [pdf, ps, other]

doi 10.1214/08-AOAS196

Reconstructing the energy landscape of a distribution from Monte Carlo samples

Authors: Qing Zhou, Wing Hung Wong

Abstract: Defining the energy function as the negative logarithm of the density, we explore the energy landscape of a distribution via the tree of sublevel sets of its energy. This tree represents the hierarchy among the connected components of the sublevel sets. We propose ways to annotate the tree so that it provides information on both topological and statistical aspects of the distribution, such as th… ▽ More Defining the energy function as the negative logarithm of the density, we explore the energy landscape of a distribution via the tree of sublevel sets of its energy. This tree represents the hierarchy among the connected components of the sublevel sets. We propose ways to annotate the tree so that it provides information on both topological and statistical aspects of the distribution, such as the local energy minima (local modes), their local domains and volumes, and the barriers between them. We develop a computational method to estimate the tree and reconstruct the energy landscape from Monte Carlo samples simulated at a wide energy range of a distribution. This method can be applied to any arbitrary distribution on a space with defined connectedness. We test the method on multimodal distributions and posterior distributions to show that our estimated trees are accurate compared to theoretical values. When used to perform Bayesian inference of DNA sequence segmentation, this approach reveals much more information than the standard approach based on marginal posterior distributions. △ Less

Submitted 26 January, 2009; originally announced January 2009.

Comments: Published in at http://dx.doi.org/10.1214/08-AOAS196 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS196

Journal ref: Annals of Applied Statistics 2008, Vol. 2, No. 4, 1307-1331

arXiv:0708.4318 [pdf, ps, other]

doi 10.1214/07-AOAS103

Coupling hidden Markov models for the discovery of Cis-regulatory modules in multiple species

Authors: Qing Zhou, Wing Hung Wong

Abstract: Cis-regulatory modules (CRMs) composed of multiple transcription factor binding sites (TFBSs) control gene expression in eukaryotic genomes. Comparative genomic studies have shown that these regulatory elements are more conserved across species due to evolutionary constraints. We propose a statistical method to combine module structure and cross-species orthology in de novo motif discovery. We u… ▽ More Cis-regulatory modules (CRMs) composed of multiple transcription factor binding sites (TFBSs) control gene expression in eukaryotic genomes. Comparative genomic studies have shown that these regulatory elements are more conserved across species due to evolutionary constraints. We propose a statistical method to combine module structure and cross-species orthology in de novo motif discovery. We use a hidden Markov model (HMM) to capture the module structure in each species and couple these HMMs through multiple-species alignment. Evolutionary models are incorporated to consider correlated structures among aligned sequence positions across different species. Based on our model, we develop a Markov chain Monte Carlo approach, MultiModule, to discover CRMs and their component motifs simultaneously in groups of orthologous sequences from multiple species. Our method is tested on both simulated and biological data sets in mammals and Drosophila, where significant improvement over other motif and module discovery methods is observed. △ Less

Submitted 31 August, 2007; originally announced August 2007.

Comments: Published at http://dx.doi.org/10.1214/07-AOAS103 in the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS103

Journal ref: Annals of Applied Statistics 2007, Vol. 1, No. 1, 36-65

Showing 1–22 of 22 results for author: Wong, W H