-
A Variational Spike-and-Slab Approach for Group Variable Selection
Authors:
Buyu Lin,
Changhao Ge,
Jun S. Liu
Abstract:
We introduce a class of generic spike-and-slab priors for high-dimensional linear regression with grouped variables and present a Coordinate-ascent Variational Inference (CAVI) algorithm for obtaining an optimal variational Bayes approximation. Using parameter expansion for a specific, yet comprehensive, family of slab distributions, we obtain a further gain in computational efficiency. The method…
▽ More
We introduce a class of generic spike-and-slab priors for high-dimensional linear regression with grouped variables and present a Coordinate-ascent Variational Inference (CAVI) algorithm for obtaining an optimal variational Bayes approximation. Using parameter expansion for a specific, yet comprehensive, family of slab distributions, we obtain a further gain in computational efficiency. The method can be easily extended to fitting additive models. Theoretically, we present general conditions on the generic spike-and-slab priors that enable us to derive the contraction rates for both the true posterior and the VB posterior for linear regression and additive models, of which some previous theoretical results can be viewed as special cases. Our simulation studies and real data application demonstrate that the proposed method is superior to existing methods in both variable selection and parameter estimation. Our algorithm is implemented in the R package GVSSB.
△ Less
Submitted 28 September, 2023;
originally announced September 2023.
-
On the Optimality of Functional Sliced Inverse Regression
Authors:
Rui Chen,
Songtao Tian,
Dongming Huang,
Qian Lin,
Jun S. Liu
Abstract:
In this paper, we prove that functional sliced inverse regression (FSIR) achieves the optimal (minimax) rate for estimating the central space in functional sufficient dimension reduction problems. First, we provide a concentration inequality for the FSIR estimator of the covariance of the conditional mean, i.e., $\var(\E[\boldsymbol{X}\mid Y])$. Based on this inequality, we establish the root-$n$…
▽ More
In this paper, we prove that functional sliced inverse regression (FSIR) achieves the optimal (minimax) rate for estimating the central space in functional sufficient dimension reduction problems. First, we provide a concentration inequality for the FSIR estimator of the covariance of the conditional mean, i.e., $\var(\E[\boldsymbol{X}\mid Y])$. Based on this inequality, we establish the root-$n$ consistency of the FSIR estimator of the image of $\var(\E[\boldsymbol{X}\mid Y])$. Second, we apply the most widely used truncated scheme to estimate the inverse of the covariance operator and identify
the truncation parameter which ensures
that FSIR can achieve the optimal minimax convergence rate for estimating the central space. Finally, we conduct simulations to demonstrate the optimal choice of truncation parameter and the estimation efficiency of FSIR. To the best of our knowledge, this is the first paper to rigorously prove the minimax optimality of FSIR in estimating the central space for multiple-index models and general $Y$ (not necessarily discrete).
△ Less
Submitted 6 July, 2023;
originally announced July 2023.
-
On Gibbs Sampling for Structured Bayesian Models Discussion of paper by Zanella and Roberts
Authors:
Xiaodong Yang,
Jun S. Liu
Abstract:
This article is a discussion of Zanella and Roberts' paper: Multilevel linear models, gibbs samplers and multigrid decompositions. We consider several extensions in which the multigrid decomposition would bring us interesting insights, including vector hierarchical models, linear mixed effects models and partial centering parametrizations.
This article is a discussion of Zanella and Roberts' paper: Multilevel linear models, gibbs samplers and multigrid decompositions. We consider several extensions in which the multigrid decomposition would bring us interesting insights, including vector hierarchical models, linear mixed effects models and partial centering parametrizations.
△ Less
Submitted 16 December, 2021;
originally announced December 2021.
-
Convergence Rate of Multiple-try Metropolis Independent sampler
Authors:
Xiaodong Yang,
Jun S. Liu
Abstract:
The Multiple-try Metropolis (MTM) method is an interesting extension of the classical Metropolis-Hastings algorithm. However, theoretical understandings of its convergence behavior as well as whether and how it may help are still unknown. This paper derives the exact convergence rate for Multiple-try Metropolis Independent sampler (MTM-IS) via an explicit eigen analysis. As a by-product, we prove…
▽ More
The Multiple-try Metropolis (MTM) method is an interesting extension of the classical Metropolis-Hastings algorithm. However, theoretical understandings of its convergence behavior as well as whether and how it may help are still unknown. This paper derives the exact convergence rate for Multiple-try Metropolis Independent sampler (MTM-IS) via an explicit eigen analysis. As a by-product, we prove that MTM-IS is less efficient than the simpler approach of repeated independent Metropolis-Hastings method at the same computational cost. We further explore more variations and find it possible to design more efficient MTM algorithms by creating correlated multiple trials.
△ Less
Submitted 3 February, 2023; v1 submitted 29 November, 2021;
originally announced November 2021.
-
Power of Knockoff: The Impact of Ranking Algorithm, Augmented Design, and Symmetric Statistic
Authors:
Zheng Tracy Ke,
Jun S. Liu,
Yucong Ma
Abstract:
The knockoff filter is a recent false discovery rate (FDR) control method for high-dimensional linear models. We point out that knockoff has three key components: ranking algorithm, augmented design, and symmetric statistic, and each component admits multiple choices. By considering various combinations of the three components, we obtain a collection of variants of knockoff. All these variants gua…
▽ More
The knockoff filter is a recent false discovery rate (FDR) control method for high-dimensional linear models. We point out that knockoff has three key components: ranking algorithm, augmented design, and symmetric statistic, and each component admits multiple choices. By considering various combinations of the three components, we obtain a collection of variants of knockoff. All these variants guarantee finite-sample FDR control, and our goal is to compare their power. We assume a Rare and Weak signal model on regression coefficients and compare the power of different variants of knockoff by deriving explicit formulas of false positive rate and false negative rate. Our results provide new insights on how to improve power when controlling FDR at a targeted level. We also compare the power of knockoff with its propotype - a method that uses the same ranking algorithm but has access to an ideal threshold. The comparison reveals the additional price one pays by finding a data-driven threshold to control FDR.
△ Less
Submitted 13 February, 2024; v1 submitted 15 October, 2020;
originally announced October 2020.
-
Stratification and Optimal Resampling for Sequential Monte Carlo
Authors:
Yichao Li,
Wenshuo Wang,
Ke Deng,
Jun S Liu
Abstract:
Sequential Monte Carlo (SMC), also known as particle filters, has been widely accepted as a powerful computational tool for making inference with dynamical systems. A key step in SMC is resampling, which plays the role of steering the algorithm towards the future dynamics. Several strategies have been proposed and used in practice, including multinomial resampling, residual resampling (Liu and Che…
▽ More
Sequential Monte Carlo (SMC), also known as particle filters, has been widely accepted as a powerful computational tool for making inference with dynamical systems. A key step in SMC is resampling, which plays the role of steering the algorithm towards the future dynamics. Several strategies have been proposed and used in practice, including multinomial resampling, residual resampling (Liu and Chen 1998), optimal resampling (Fearnhead and Clifford 2003), stratified resampling (Kitagawa 1996), and optimal transport resampling (Reich 2013). We show that, in the one dimensional case, optimal transport resampling is equivalent to stratified resampling on the sorted particles, and they both minimize the resampling variance as well as the expected squared energy distance between the original and resampled empirical distributions; in the multidimensional case, the variance of stratified resampling after sorting particles using Hilbert curve (Gerber et al. 2019) in $\mathbb{R}^d$ is $O(m^{-(1+2/d)})$, an improved rate compared to the original $O(m^{-(1+1/d)})$, where $m$ is the number of resampled particles. This improved rate is the lowest for ordered stratified resampling schemes, as conjectured in Gerber et al. (2019). We also present an almost sure bound on the Wasserstein distance between the original and Hilbert-curve-resampled empirical distributions. In light of these theoretical results, we propose the stratified multiple-descendant growth (SMG) algorithm, which allows us to explore the sample space more efficiently compared to the standard i.i.d. multiple-descendant sampling-resampling approach as measured by the Wasserstein metric. Numerical evidence is provided to demonstrate the effectiveness of our proposed method.
△ Less
Submitted 7 December, 2020; v1 submitted 4 April, 2020;
originally announced April 2020.
-
Minimax Nonparametric Two-sample Test under Smoothing
Authors:
Xin Xing,
Zuofeng Shang,
Pang Du,
** Ma,
Wenxuan Zhong,
Jun S. Liu
Abstract:
We consider the problem of comparing probability densities between two groups. A new probabilistic tensor product smoothing spline framework is developed to model the joint density of two variables. Under such a framework, the probability density comparison is equivalent to testing the presence/absence of interactions. We propose a penalized likelihood ratio test for such interaction testing and s…
▽ More
We consider the problem of comparing probability densities between two groups. A new probabilistic tensor product smoothing spline framework is developed to model the joint density of two variables. Under such a framework, the probability density comparison is equivalent to testing the presence/absence of interactions. We propose a penalized likelihood ratio test for such interaction testing and show that the test statistic is asymptotically chi-square distributed under the null hypothesis. Furthermore, we derive a sharp minimax testing rate based on the Bernstein width for nonparametric two-sample tests and show that our proposed test statistics is minimax optimal. In addition, a data-adaptive tuning criterion is developed to choose the penalty parameter. Simulations and real applications demonstrate that the proposed test outperforms the conventional approaches under various scenarios.
△ Less
Submitted 11 January, 2021; v1 submitted 5 November, 2019;
originally announced November 2019.
-
The Wang-Landau Algorithm as Stochastic Optimization and Its Acceleration
Authors:
Chenguang Dai,
Jun S. Liu
Abstract:
We show that the Wang-Landau algorithm can be formulated as a stochastic gradient descent algorithm minimizing a smooth and convex objective function, of which the gradient is estimated using Markov chain Monte Carlo iterations. The optimization formulation provides us a new way to establish the convergence rate of the Wang-Landau algorithm, by exploiting the fact that almost surely, the density e…
▽ More
We show that the Wang-Landau algorithm can be formulated as a stochastic gradient descent algorithm minimizing a smooth and convex objective function, of which the gradient is estimated using Markov chain Monte Carlo iterations. The optimization formulation provides us a new way to establish the convergence rate of the Wang-Landau algorithm, by exploiting the fact that almost surely, the density estimates (on the logarithmic scale) remain in a compact set, upon which the objective function is strongly convex. The optimization viewpoint motivates us to improve the efficiency of the Wang-Landau algorithm using popular tools including the momentum method and the adaptive learning rate method. We demonstrate the accelerated Wang-Landau algorithm on a two-dimensional Ising model and a two-dimensional ten-state Potts model.
△ Less
Submitted 2 February, 2020; v1 submitted 27 July, 2019;
originally announced July 2019.
-
Global testing under the sparse alternatives for single index models
Authors:
Qian Lin,
Zhigen Zhao,
Jun S. Liu
Abstract:
For the single index model $y=f(β^τx,ε)$ with Gaussian design, %satisfying that rank $var(\mathbb{E}[x\mid y])=1$ where $f$ is unknown and $β$ is a sparse $p$-dimensional unit vector with at most $s$ nonzero entries, we are interested in testing the null hypothesis that $β$, when viewed as a whole vector, is zero against the alternative that some entries of $β$ is nonzero. Assuming that…
▽ More
For the single index model $y=f(β^τx,ε)$ with Gaussian design, %satisfying that rank $var(\mathbb{E}[x\mid y])=1$ where $f$ is unknown and $β$ is a sparse $p$-dimensional unit vector with at most $s$ nonzero entries, we are interested in testing the null hypothesis that $β$, when viewed as a whole vector, is zero against the alternative that some entries of $β$ is nonzero. Assuming that $var(\mathbb{E}[x \mid y])$ is non-vanishing, we define the generalized signal-to-noise ratio (gSNR) $λ$ of the model as the unique non-zero eigenvalue of $var(\mathbb{E}[x \mid y])$. We show that if $s^{2}\log^2(p)\wedge p$ is of a smaller order of $n$, denoted as $s^{2}\log^2(p)\wedge p\prec n$, where $n$ is the sample size, one can detect the existence of signals if and only if gSNR$\succ\frac{p^{1/2}}{n}\wedge \frac{s\log(p)}{n}$. Furthermore, if the noise is additive (i.e., $y=f(β^τx)+ε$), one can detect the existence of the signal if and only if gSNR$\succ\frac{p^{1/2}}{n}\wedge \frac{s\log(p)}{n} \wedge \frac{1}{\sqrt{n}}$. It is rather surprising that the detection boundary for the single index model with additive noise matches that for linear regression models.
These results pave the road for thorough theoretical analysis of single/multiple index models in high dimensions.
△ Less
Submitted 4 May, 2018;
originally announced May 2018.
-
On the optimality of sliced inverse regression in high dimensions
Authors:
Qian Lin,
Xinran Li,
Dongming Huang,
Jun S. Liu
Abstract:
The central subspace of a pair of random variables $(y,x) \in \mathbb{R}^{p+1}$ is the minimal subspace $\mathcal{S}$ such that $y \perp \hspace{-2mm} \perp x\mid P_{\mathcal{S}}x$. In this paper, we consider the minimax rate of estimating the central space of the multiple index models $y=f(β_{1}^τx,β_{2}^τx,...,β_{d}^τx,ε)$ with at most $s$ active predictors where $x \sim N(0,I_{p})$. We first in…
▽ More
The central subspace of a pair of random variables $(y,x) \in \mathbb{R}^{p+1}$ is the minimal subspace $\mathcal{S}$ such that $y \perp \hspace{-2mm} \perp x\mid P_{\mathcal{S}}x$. In this paper, we consider the minimax rate of estimating the central space of the multiple index models $y=f(β_{1}^τx,β_{2}^τx,...,β_{d}^τx,ε)$ with at most $s$ active predictors where $x \sim N(0,I_{p})$. We first introduce a large class of models depending on the smallest non-zero eigenvalue $λ$ of $var(\mathbb{E}[x|y])$, over which we show that an aggregated estimator based on the SIR procedure converges at rate $d\wedge((sd+s\log(ep/s))/(nλ))$. We then show that this rate is optimal in two scenarios: the single index models; and the multiple index models with fixed central dimension $d$ and fixed $λ$. By assuming a technical conjecture, we can show that this rate is also optimal for multiple index models with bounded dimension of the central space. We believe that these (conditional) optimal rate results bring us meaningful insights of general SDR problems in high dimensions.
△ Less
Submitted 23 January, 2017; v1 submitted 21 January, 2017;
originally announced January 2017.
-
Sparse Sliced Inverse Regression Via Lasso
Authors:
Qian Lin,
Zhigen Zhao,
Jun S. Liu
Abstract:
For multiple index models, it has recently been shown that the sliced inverse regression (SIR) is consistent for estimating the sufficient dimension reduction (SDR) space if and only if $ρ=\lim\frac{p}{n}=0$, where $p$ is the dimension and $n$ is the sample size. Thus, when $p$ is of the same or a higher order of $n$, additional assumptions such as sparsity must be imposed in order to ensure consi…
▽ More
For multiple index models, it has recently been shown that the sliced inverse regression (SIR) is consistent for estimating the sufficient dimension reduction (SDR) space if and only if $ρ=\lim\frac{p}{n}=0$, where $p$ is the dimension and $n$ is the sample size. Thus, when $p$ is of the same or a higher order of $n$, additional assumptions such as sparsity must be imposed in order to ensure consistency for SIR. By constructing artificial response variables made up from top eigenvectors of the estimated conditional covariance matrix, we introduce a simple Lasso regression method to obtain an estimate of the SDR space. The resulting algorithm, Lasso-SIR, is shown to be consistent and achieve the optimal convergence rate under certain sparsity conditions when $p$ is of order $o(n^2λ^2)$, where $λ$ is the generalized signal-to-noise ratio. We also demonstrate the superior performance of Lasso-SIR compared with existing approaches via extensive numerical studies and several real data examples.
△ Less
Submitted 17 June, 2018; v1 submitted 21 November, 2016;
originally announced November 2016.
-
L1-Regularized Least Squares for Support Recovery of High Dimensional Single Index Models with Gaussian Designs
Authors:
Matey Neykov,
Jun S. Liu,
Tianxi Cai
Abstract:
It is known that for a certain class of single index models (SIMs) $Y = f(\boldsymbol{X}_{p \times 1}^\intercal\boldsymbolβ_0, \varepsilon)$, support recovery is impossible when $\boldsymbol{X} \sim \mathcal{N}(0, \mathbb{I}_{p \times p})$ and a model complexity adjusted sample size is below a critical threshold. Recently, optimal algorithms based on Sliced Inverse Regression (SIR) were suggested.…
▽ More
It is known that for a certain class of single index models (SIMs) $Y = f(\boldsymbol{X}_{p \times 1}^\intercal\boldsymbolβ_0, \varepsilon)$, support recovery is impossible when $\boldsymbol{X} \sim \mathcal{N}(0, \mathbb{I}_{p \times p})$ and a model complexity adjusted sample size is below a critical threshold. Recently, optimal algorithms based on Sliced Inverse Regression (SIR) were suggested. These algorithms work provably under the assumption that the design $\boldsymbol{X}$ comes from an i.i.d. Gaussian distribution. In the present paper we analyze algorithms based on covariance screening and least squares with $L_1$ penalization (i.e. LASSO) and demonstrate that they can also enjoy optimal (up to a scalar) rescaled sample size in terms of support recovery, albeit under slightly different assumptions on $f$ and $\varepsilon$ compared to the SIR based algorithms. Furthermore, we show more generally, that LASSO succeeds in recovering the signed support of $\boldsymbolβ_0$ if $\boldsymbol{X} \sim \mathcal{N}(0, \boldsymbolΣ)$, and the covariance $\boldsymbolΣ$ satisfies the irrepresentable condition. Our work extends existing results on the support recovery of LASSO for the linear model, to a more general class of SIMs.
△ Less
Submitted 22 June, 2016; v1 submitted 25 November, 2015;
originally announced November 2015.
-
Signed Support Recovery for Single Index Models in High-Dimensions
Authors:
Matey Neykov,
Qian Lin,
Jun S. Liu
Abstract:
In this paper we study the support recovery problem for single index models $Y=f(\boldsymbol{X}^{\intercal} \boldsymbolβ,\varepsilon)$, where $f$ is an unknown link function, $\boldsymbol{X}\sim N_p(0,\mathbb{I}_{p})$ and $\boldsymbolβ$ is an $s$-sparse unit vector such that $\boldsymbolβ_{i}\in \{\pm\frac{1}{\sqrt{s}},0\}$. In particular, we look into the performance of two computationally inexpe…
▽ More
In this paper we study the support recovery problem for single index models $Y=f(\boldsymbol{X}^{\intercal} \boldsymbolβ,\varepsilon)$, where $f$ is an unknown link function, $\boldsymbol{X}\sim N_p(0,\mathbb{I}_{p})$ and $\boldsymbolβ$ is an $s$-sparse unit vector such that $\boldsymbolβ_{i}\in \{\pm\frac{1}{\sqrt{s}},0\}$. In particular, we look into the performance of two computationally inexpensive algorithms: (a) the diagonal thresholding sliced inverse regression (DT-SIR) introduced by Lin et al. (2015); and (b) a semi-definite programming (SDP) approach inspired by Amini & Wainwright (2008). When $s=O(p^{1-δ})$ for some $δ>0$, we demonstrate that both procedures can succeed in recovering the support of $\boldsymbolβ$ as long as the rescaled sample size $κ=\frac{n}{s\log(p-s)}$ is larger than a certain critical threshold. On the other hand, when $κ$ is smaller than a critical value, any algorithm fails to recover the support with probability at least $\frac{1}{2}$ asymptotically. In other words, we demonstrate that both DT-SIR and the SDP approach are optimal (up to a scalar) for recovering the support of $\boldsymbolβ$ in terms of sample size. We provide extensive simulations, as well as a real dataset application to help verify our theoretical observations.
△ Less
Submitted 22 June, 2016; v1 submitted 6 November, 2015;
originally announced November 2015.
-
A Unified Theory of Confidence Regions and Testing for High Dimensional Estimating Equations
Authors:
Matey Neykov,
Yang Ning,
Jun S. Liu,
Han Liu
Abstract:
We propose a new inferential framework for constructing confidence regions and testing hypotheses in statistical models specified by a system of high dimensional estimating equations. We construct an influence function by projecting the fitted estimating equations to a sparse direction obtained by solving a large-scale linear program. Our main theoretical contribution is to establish a unified Z-e…
▽ More
We propose a new inferential framework for constructing confidence regions and testing hypotheses in statistical models specified by a system of high dimensional estimating equations. We construct an influence function by projecting the fitted estimating equations to a sparse direction obtained by solving a large-scale linear program. Our main theoretical contribution is to establish a unified Z-estimation theory of confidence regions for high dimensional problems.
Different from existing methods, all of which require the specification of the likelihood or pseudo-likelihood, our framework is likelihood-free. As a result, our approach provides valid inference for a broad class of high dimensional constrained estimating equation problems, which are not covered by existing methods.
Such examples include, noisy compressed sensing, instrumental variable regression, undirected graphical models, discriminant analysis and vector autoregressive models. We present detailed theoretical results for all these examples. Finally, we conduct thorough numerical simulations, and a real dataset analysis to back up the developed theoretical results.
△ Less
Submitted 22 June, 2016; v1 submitted 30 October, 2015;
originally announced October 2015.
-
On consistency and sparsity for sliced inverse regression in high dimensions
Authors:
Qian Lin,
Zhigen Zhao,
Jun S. Liu
Abstract:
We provide here a framework to analyze the phase transition phenomenon of slice inverse regression (SIR), a supervised dimension reduction technique introduced by \cite{Li:1991}. Under mild conditions, the asymptotic ratio $ρ= \lim p/n$ is the phase transition parameter and the SIR estimator is consistent if and only if $ρ= 0$. When dimension $p$ is greater than $n$, we propose a diagonal threshol…
▽ More
We provide here a framework to analyze the phase transition phenomenon of slice inverse regression (SIR), a supervised dimension reduction technique introduced by \cite{Li:1991}. Under mild conditions, the asymptotic ratio $ρ= \lim p/n$ is the phase transition parameter and the SIR estimator is consistent if and only if $ρ= 0$. When dimension $p$ is greater than $n$, we propose a diagonal thresholding screening SIR (DT-SIR) algorithm. This method provides us with an estimate of the eigen-space of the covariance matrix of the conditional expectation $var(\mathbf{E}[\boldsymbol{x}|y])$. The desired dimension reduction space is then obtained by multiplying the inverse of the covariance matrix on the eigen-space. Under certain sparsity assumptions on both the covariance matrix of predictors and the loadings of the directions, we prove the consistency of DT-SIR in estimating the dimension reduction space in high dimensional data analysis. Extensive numerical experiments demonstrate superior performances of the proposed method in comparison to its competitors.
△ Less
Submitted 21 November, 2016; v1 submitted 14 July, 2015;
originally announced July 2015.
-
Model Selection Principles in Misspecified Models
Authors:
**chi Lv,
Jun S. Liu
Abstract:
Model selection is of fundamental importance to high dimensional modeling featured in many contemporary applications. Classical principles of model selection include the Kullback-Leibler divergence principle and the Bayesian principle, which lead to the Akaike information criterion and Bayesian information criterion when models are correctly specified. Yet model misspecification is unavoidable whe…
▽ More
Model selection is of fundamental importance to high dimensional modeling featured in many contemporary applications. Classical principles of model selection include the Kullback-Leibler divergence principle and the Bayesian principle, which lead to the Akaike information criterion and Bayesian information criterion when models are correctly specified. Yet model misspecification is unavoidable when we have no knowledge of the true model or when we have the correct family of distributions but miss some true predictor. In this paper, we propose a family of semi-Bayesian principles for model selection in misspecified models, which combine the strengths of the two well-known principles. We derive asymptotic expansions of the semi-Bayesian principles in misspecified generalized linear models, which give the new semi-Bayesian information criteria (SIC). A specific form of SIC admits a natural decomposition into the negative maximum quasi-log-likelihood, a penalty on model dimensionality, and a penalty on model misspecification directly. Numerical studies demonstrate the advantage of the newly proposed SIC methodology for model selection in both correctly specified and misspecified models.
△ Less
Submitted 11 May, 2016; v1 submitted 29 May, 2010;
originally announced May 2010.
-
Discussion of "Equi-energy sampler" by Kou, Zhou and Wong
Authors:
Yves F. Atchadé,
Jun S. Liu
Abstract:
We congratulate Samuel Kou, Qing Zhou and Wing Wong [math.ST/0507080] (referred to subsequently as KZW) for this beautifully written paper, which opens a new direction in Monte Carlo computation. This discussion has two parts. First, we describe a very closely related method, multicanonical sampling (MCS), and report a simulation example that compares the equi-energy (EE) sampler with MCS. Overa…
▽ More
We congratulate Samuel Kou, Qing Zhou and Wing Wong [math.ST/0507080] (referred to subsequently as KZW) for this beautifully written paper, which opens a new direction in Monte Carlo computation. This discussion has two parts. First, we describe a very closely related method, multicanonical sampling (MCS), and report a simulation example that compares the equi-energy (EE) sampler with MCS. Overall, we found the two algorithms to be of comparable efficiency for the simulation problem considered. In the second part, we develop some additional convergence results for the EE sampler.
△ Less
Submitted 8 November, 2006;
originally announced November 2006.
-
Bayesian Clustering of Transcription Factor Binding Motifs
Authors:
Shane T. Jensen,
Jun S. Liu
Abstract:
Genes are often regulated in living cells by proteins called transcription factors (TFs) that bind directly to short segments of DNA in close proximity to specific genes. These binding sites have a conserved nucleotide appearance, which is called a motif. Several recent studies of transcriptional regulation require the reduction of a large collection of motifs into clusters based on the similari…
▽ More
Genes are often regulated in living cells by proteins called transcription factors (TFs) that bind directly to short segments of DNA in close proximity to specific genes. These binding sites have a conserved nucleotide appearance, which is called a motif. Several recent studies of transcriptional regulation require the reduction of a large collection of motifs into clusters based on the similarity of their nucleotide composition. We present a principled approach to this clustering problem based upon a Bayesian hierarchical model that accounts for both within- and between-motif variability. We use a Dirichlet process prior distribution that allows the number of clusters to vary and we also present a novel generalization that allows the core width of each motif to vary. This clustering model is implemented, using a Gibbs sampling strategy, on several collections of transcription factor motif matrices. Our clusters provide a means by which to organize transcription factors based on binding motif similarities, which can be used to reduce motif redundancy within large databases such as JASPAR and TRANSFAC. Finally, our clustering procedure has been used in combination with discovery of evolutionarily-conserved motifs to predict co-regulated genes. An alternative to our Dirichlet process prior distribution is explored but shows no substantive difference in the clustering results for our datasets. Our Bayesian clustering model based on the Dirichlet process has several advantages over traditional clustering methods that could make our procedure appropriate and useful for many clustering applications.
△ Less
Submitted 21 October, 2006;
originally announced October 2006.