Search | arXiv e-print repository

Modifying Gibbs sampling to avoid self transitions

Abstract: Gibbs sampling repeatedly samples from the conditional distribution of one variable, x_i, given other variables, either choosing i randomly, or updating sequentially using some systematic or random order. When x_i is discrete, a Gibbs sampling update may choose a new value that is the same as the old value. A theorem of Peskun indicates that, when i is chosen randomly, a reversible method that red… ▽ More Gibbs sampling repeatedly samples from the conditional distribution of one variable, x_i, given other variables, either choosing i randomly, or updating sequentially using some systematic or random order. When x_i is discrete, a Gibbs sampling update may choose a new value that is the same as the old value. A theorem of Peskun indicates that, when i is chosen randomly, a reversible method that reduces the probability of such self transitions, while increasing the probabilities of transitioning to each of the other values, will decrease the asymptotic variance of estimates. This has inspired two modified Gibbs sampling methods, originally due to Frigessi, et al and to Liu, though these do not always reduce self transitions to the minimum possible. Methods that do reduce the probability of self transitions to the minimum, but do not satisfy the conditions of Peskun's theorem, have also been devised, by Suwa and Todo. I review past methods, and introduce a broader class of reversible methods, based on what I call "antithetic modification", which also reduce asymptotic variance compared to Gibbs sampling, even when not satisfying the conditions of Peskun's theorem. A modification of one method in this class reduces self transitions to the minimum possible, while still always reducing asymptotic variance compared to Gibbs sampling. I introduce another new class of non-reversible methods based on slice sampling that can also minimize self transition probabilities. I provide explicit, efficient implementations of all these methods, and compare their performance in simulations of a 2D Potts model, a Bayesian mixture model, and a belief network with unobserved variables. The non-reversibility produced by sequential updating can be beneficial, but no consistent benefit is seen from the individual updates being done by a non-reversible method. △ Less

Submitted 26 March, 2024; originally announced March 2024.

arXiv:2305.18268 [pdf, ps, other]

Efficiency of reversible MCMC methods: elementary derivations and applications to composite methods

Authors: Radford M. Neal, Jeffrey S. Rosenthal

Abstract: We review criteria for comparing the efficiency of Markov chain Monte Carlo (MCMC) methods with respect to the asymptotic variance of estimates of expectations of functions of state, and show how such criteria can justify ways of combining improvements to MCMC methods. We say that a chain on a finite state space with transition matrix $P$ efficiency-dominates one with transition matrix $Q$ if for… ▽ More We review criteria for comparing the efficiency of Markov chain Monte Carlo (MCMC) methods with respect to the asymptotic variance of estimates of expectations of functions of state, and show how such criteria can justify ways of combining improvements to MCMC methods. We say that a chain on a finite state space with transition matrix $P$ efficiency-dominates one with transition matrix $Q$ if for every function of state it has lower (or equal) asymptotic variance. We give elementary proofs of some previous results regarding efficiency dominance, leading to a self-contained demonstration that a reversible chain with transition matrix $P$ efficiency-dominates a reversible chain with transition matrix $Q$ if and only if none of the eigenvalues of $Q-P$ are negative. This allows us to conclude that modifying a reversible MCMC method to improve its efficiency will also improve the efficiency of a method that randomly chooses either this or some other reversible method, and to conclude that improving the efficiency of a reversible update for one component of state (as in Gibbs sampling) will improve the overall efficiency of a reversible method that combines this and other updates. It also explains how antithetic MCMC can be more efficient than i.i.d. sampling. We also establish conditions that can guarantee that a method is not efficiency-dominated by any other method. △ Less

Submitted 27 March, 2024; v1 submitted 29 May, 2023; originally announced May 2023.

Comments: 24 pages

arXiv:2001.11950 [pdf, ps, other]

Non-reversibly updating a uniform [0,1] value for Metropolis accept/reject decisions

Authors: Radford M. Neal

Abstract: I show how it can be beneficial to express Metropolis accept/reject decisions in terms of comparison with a uniform [0,1] value, u, and to then update u non-reversibly, as part of the Markov chain state, rather than sampling it independently each iteration. This provides a small improvement for random walk Metropolis and Langevin updates in high dimensions. It produces a larger improvement when us… ▽ More I show how it can be beneficial to express Metropolis accept/reject decisions in terms of comparison with a uniform [0,1] value, u, and to then update u non-reversibly, as part of the Markov chain state, rather than sampling it independently each iteration. This provides a small improvement for random walk Metropolis and Langevin updates in high dimensions. It produces a larger improvement when using Langevin updates with persistent momentum, giving performance comparable to that of Hamiltonian Monte Carlo (HMC) with long trajectories. This is of significance when some variables are updated by other methods, since if HMC is used, these updates can be done only between trajectories, whereas they can be done more often with Langevin updates. I demonstrate that for a problem with some continuous variables, updated by HMC or Langevin updates, and also discrete variables, updated by Gibbs sampling between updates of the continuous variables, Langevin with persistent momentum and non-reversible updates to u samples nearly a factor of two more efficiently than HMC. Benefits are also seen for a Bayesian neural network model in which hyperparameters are updated by Gibbs sampling. △ Less

Submitted 31 January, 2020; originally announced January 2020.

arXiv:1711.04399 [pdf, ps, other]

Circularly-Coupled Markov Chain Sampling

Authors: Radford M. Neal

Abstract: I show how to run an N-time-step Markov chain simulation in a circular fashion, so that the state at time 0 follows the state at time N-1 in the same way as states at times t follow those at times t-1 for 0<t<N. This wrap-around of the chain is achieved using a coupling procedure, and produces states that all have close to the equilibrium distribution of the Markov chain, under the assumption that… ▽ More I show how to run an N-time-step Markov chain simulation in a circular fashion, so that the state at time 0 follows the state at time N-1 in the same way as states at times t follow those at times t-1 for 0<t<N. This wrap-around of the chain is achieved using a coupling procedure, and produces states that all have close to the equilibrium distribution of the Markov chain, under the assumption that coupled chains are likely to coalesce in less than N/2 iterations. This procedure therefore automatically eliminates the initial portion of the chain that would otherwise need to be discarded to get good estimates of equilibrium averages. The assumption of rapid coalescence can be tested using auxiliary chains started at times spaced between 0 and N. When multiple processors are available, such auxiliary chains can be simulated in parallel, and pieced together to give the circularly-coupled chain, in less time than a sequential simulation would have taken, provided that coalescence is indeed rapid. The practical utility of these procedures is dependent on the development of good coupling schemes. I show how a specialized random-grid Metropolis algorithm can be used to produce the required exact coalescence. On its own, this method is not efficient in high dimensions, but it can be used to produce exact coalescence once other methods have brought the coupled chains close together. I investigate how well this combined scheme works with standard Metropolis, Langevin, and Gibbs sampling updates. Using such strategies, I show that circular coupling can work effectively in a Bayesian logistic regression problem. △ Less

Submitted 12 November, 2017; originally announced November 2017.

arXiv:1602.06030 [pdf, other]

Sampling latent states for high-dimensional non-linear state space models with the embedded HMM method

Authors: Alexander Y. Shestopaloff, Radford M. Neal

Abstract: We propose a new scheme for selecting pool states for the embedded Hidden Markov Model (HMM) Markov Chain Monte Carlo (MCMC) method. This new scheme allows the embedded HMM method to be used for efficient sampling in state space models where the state can be high-dimensional. Previously, embedded HMM methods were only applied to models with a one-dimensional state space. We demonstrate that using… ▽ More We propose a new scheme for selecting pool states for the embedded Hidden Markov Model (HMM) Markov Chain Monte Carlo (MCMC) method. This new scheme allows the embedded HMM method to be used for efficient sampling in state space models where the state can be high-dimensional. Previously, embedded HMM methods were only applied to models with a one-dimensional state space. We demonstrate that using our proposed pool state selection scheme, an embedded HMM sampler can have similar performance to a well-tuned sampler that uses a combination of Particle Gibbs with Backward Sampling (PGBS) and Metropolis updates. The scaling to higher dimensions is made possible by selecting pool states locally near the current value of the state sequence. The proposed pool state selection scheme also allows each iteration of the embedded HMM sampler to take time linear in the number of the pool states, as opposed to quadratic as in the original embedded HMM sampler. We also consider a model with a multimodal posterior, and show how a technique we term "mirroring" can be used to efficiently move between the modes. △ Less

Submitted 11 July, 2016; v1 submitted 18 February, 2016; originally announced February 2016.

Comments: Revision has some changes to the paper, and now includes the program used as ancillary information

arXiv:1505.05571 [pdf, other]

Fast exact summation using small and large superaccumulators

Authors: Radford M. Neal

Abstract: I present two new methods for exactly summing a set of floating-point numbers, and then correctly rounding to the nearest floating-point number. Higher accuracy than simple summation (rounding after each addition) is important in many applications, such as finding the sample mean of data. Exact summation also guarantees identical results with parallel and serial implementations, since the exact su… ▽ More I present two new methods for exactly summing a set of floating-point numbers, and then correctly rounding to the nearest floating-point number. Higher accuracy than simple summation (rounding after each addition) is important in many applications, such as finding the sample mean of data. Exact summation also guarantees identical results with parallel and serial implementations, since the exact sum is independent of order. The new methods use variations on the concept of a "superaccumulator" - a large fixed-point number that can exactly represent the sum of any reasonable number of floating-point values. One method uses a "small" superaccumulator with sixty-seven 64-bit chunks, each with 32-bit overlap with the next chunk, allowing carry propagation to be done infrequently. The small superaccumulator is used alone when summing a small number of terms. For big summations, a "large" superaccumulator is used as well. It consists of 4096 64-bit chunks, one for every possible combination of exponent bits and sign bit, plus counts of when each chunk needs to be transferred to the small superaccumulator. To add a term to the large superaccumulator, only a single chunk and its associated count need to be updated, which takes very few instructions if carefully implemented. On modern 64-bit processors, exactly summing a large array using this combination of large and small superaccumulators takes less than twice the time of simple, inexact, ordered summation, with a serial implementation. A parallel implementation using a small number of processor cores can be expected to perform exact summation of large arrays at a speed that reaches the limit imposed by memory bandwidth. Some common methods that attempt to improve accuracy without being exact may therefore be pointless, at least for large summations, since they are slower than computing the sum exactly. △ Less

Submitted 20 May, 2015; originally announced May 2015.

ACM Class: G.1.0

arXiv:1504.02914 [pdf, other]

Representing numeric data in 32 bits while preserving 64-bit precision

Authors: Radford M. Neal

Abstract: Data files often consist of numbers having only a few significant decimal digits, whose information content would allow storage in only 32 bits. However, we may require that arithmetic operations involving these numbers be done with 64-bit floating-point precision, which precludes simply representing the data as 32-bit floating-point values. Decimal floating point gives a compact and exact represe… ▽ More Data files often consist of numbers having only a few significant decimal digits, whose information content would allow storage in only 32 bits. However, we may require that arithmetic operations involving these numbers be done with 64-bit floating-point precision, which precludes simply representing the data as 32-bit floating-point values. Decimal floating point gives a compact and exact representation, but requires conversion with a slow division operation before it can be used. Here, I show that interesting subsets of 64-bit floating-point values can be compactly and exactly represented by the 32 bits consisting of the sign, exponent, and high-order part of the mantissa, with the lower-order 32 bits of the mantissa filled in by table lookup, indexed by bits from the part of the mantissa retained, and possibly from the exponent. For example, decimal data with 4 or fewer digits to the left of the decimal point and 2 or fewer digits to the right of the decimal point can be represented in this way using the lower-order 5 bits of the retained part of the mantissa as the index. Data consisting of 6 decimal digits with the decimal point in any of the 7 positions before or after one of the digits can also be represented this way, and decoded using 19 bits from the mantissa and exponent as the index. Encoding with such a scheme is a simple copy of half the 64-bit value, followed if necessary by verification that the value can be represented, by checking that it decodes correctly. Decoding requires only extraction of index bits and a table lookup. Lookup in a small table will usually reference cache; even with larger tables, decoding is still faster than conversion from decimal floating point with a division operation. I discuss how such schemes perform on recent computer systems, and how they might be used to automatically compress large arrays in interpretive languages such as R. △ Less

Submitted 11 April, 2015; originally announced April 2015.

arXiv:1412.3013 [pdf, other]

Efficient Bayesian inference for stochastic volatility models with ensemble MCMC methods

Authors: Alexander Y. Shestopaloff, Radford M. Neal

Abstract: In this paper, we introduce efficient ensemble Markov Chain Monte Carlo (MCMC) sampling methods for Bayesian computations in the univariate stochastic volatility model. We compare the performance of our ensemble MCMC methods with an improved version of a recent sampler of Kastner and Fruwirth-Schnatter (2014). We show that ensemble samplers are more efficient than this state of the art sampler by… ▽ More In this paper, we introduce efficient ensemble Markov Chain Monte Carlo (MCMC) sampling methods for Bayesian computations in the univariate stochastic volatility model. We compare the performance of our ensemble MCMC methods with an improved version of a recent sampler of Kastner and Fruwirth-Schnatter (2014). We show that ensemble samplers are more efficient than this state of the art sampler by a factor of about 3.1, on a data set simulated from the stochastic volatility model. This performance gain is achieved without the ensemble MCMC sampler relying on the assumption that the latent process is linear and Gaussian, unlike the sampler of Kastner and Fruwirth-Schnatter. △ Less

Submitted 9 December, 2014; originally announced December 2014.

arXiv:1401.5548 [pdf, ps, other]

On Bayesian inference for the M/G/1 queue with efficient MCMC sampling

Authors: Alexander Y. Shestopaloff, Radford M. Neal

Abstract: We introduce an efficient MCMC sampling scheme to perform Bayesian inference in the M/G/1 queueing model given only observations of interdeparture times. Our MCMC scheme uses a combination of Gibbs sampling and simple Metropolis updates together with three novel "shift" and "scale" updates. We show that our novel updates improve the speed of sampling considerably, by factors of about 60 to about 1… ▽ More We introduce an efficient MCMC sampling scheme to perform Bayesian inference in the M/G/1 queueing model given only observations of interdeparture times. Our MCMC scheme uses a combination of Gibbs sampling and simple Metropolis updates together with three novel "shift" and "scale" updates. We show that our novel updates improve the speed of sampling considerably, by factors of about 60 to about 180 on a variety of simulated data sets. △ Less

Submitted 21 January, 2014; originally announced January 2014.

arXiv:1305.2235 [pdf, other]

MCMC methods for Gaussian process models using fast approximations for the likelihood

Authors: Chunyi Wang, Radford M. Neal

Abstract: Gaussian Process (GP) models are a powerful and flexible tool for non-parametric regression and classification. Computation for GP models is intensive, since computing the posterior density, $π$, for covariance function parameters requires computation of the covariance matrix, C, a $pn^2$ operation, where p is the number of covariates and n is the number of training cases, and then inversion of C,… ▽ More Gaussian Process (GP) models are a powerful and flexible tool for non-parametric regression and classification. Computation for GP models is intensive, since computing the posterior density, $π$, for covariance function parameters requires computation of the covariance matrix, C, a $pn^2$ operation, where p is the number of covariates and n is the number of training cases, and then inversion of C, an $n^3$ operation. We introduce MCMC methods based on the "temporary map** and caching" framework, using a fast approximation, $π^*$, as the distribution needed to construct the temporary space. We propose two implementations under this scheme: "map** to a discretizing chain", and "map** with tempered transitions", both of which are exactly correct MCMC methods for sampling $π$, even though their transitions are constructed using an approximation. These methods are equivalent when their tuning parameters are set at the simplest values, but differ in general. We compare how well these methods work when using several approximations, finding on synthetic datasets that a $π^*$ based on the "Subset of Data" (SOD) method is almost always more efficient than standard MCMC using only $π$. On some datasets, a more sophisticated $π^*$ based on the "Nyström-Cholesky" method works better than SOD. △ Less

Submitted 9 May, 2013; originally announced May 2013.

arXiv:1305.0320 [pdf, other]

MCMC for non-linear state space models using ensembles of latent sequences

Authors: Alexander Y. Shestopaloff, Radford M. Neal

Abstract: Non-linear state space models are a widely-used class of models for biological, economic, and physical processes. Fitting these models to observed data is a difficult inference problem that has no straightforward solution. We take a Bayesian approach to the inference of unknown parameters of a non-linear state model; this, in turn, requires the availability of efficient Markov Chain Monte Carlo (M… ▽ More Non-linear state space models are a widely-used class of models for biological, economic, and physical processes. Fitting these models to observed data is a difficult inference problem that has no straightforward solution. We take a Bayesian approach to the inference of unknown parameters of a non-linear state model; this, in turn, requires the availability of efficient Markov Chain Monte Carlo (MCMC) sampling methods for the latent (hidden) variables and model parameters. Using the ensemble technique of Neal (2010) and the embedded HMM technique of Neal (2003), we introduce a new Markov Chain Monte Carlo method for non-linear state space models. The key idea is to perform parameter updates conditional on an enormously large ensemble of latent sequences, as opposed to a single sequence, as with existing methods. We look at the performance of this ensemble method when doing Bayesian inference in the Ricker model of population dynamics. We show that for this problem, the ensemble method is vastly more efficient than a simple Metropolis method, as well as 1.9 to 12.0 times more efficient than a single-sequence embedded HMM method, when all methods are tuned appropriately. We also introduce a way of speeding up the ensemble method by performing partial backward passes to discard poor proposals at low computational cost, resulting in a final efficiency gain of 3.4 to 20.4 times over the single-sequence method. △ Less

Submitted 1 May, 2013; originally announced May 2013.

arXiv:1301.3861 [pdf]

Inference for Belief Networks Using Coupling From the Past

Authors: Michael Harvey, Radford M. Neal

Abstract: Inference for belief networks using Gibbs sampling produces a distribution for unobserved variables that differs from the correct distribution by a (usually) unknown error, since convergence to the right distribution occurs only asymptotically. The method of "coupling from the past" samples from exactly the correct distribution by (conceptually) running dependent Gibbs sampling simulations from… ▽ More Inference for belief networks using Gibbs sampling produces a distribution for unobserved variables that differs from the correct distribution by a (usually) unknown error, since convergence to the right distribution occurs only asymptotically. The method of "coupling from the past" samples from exactly the correct distribution by (conceptually) running dependent Gibbs sampling simulations from every possible starting state from a time far enough in the past that all runs reach the same state at time t=0. Explicitly considering every possible state is intractable for large networks, however. We propose a method for layered noisy-or networks that uses a compact, but often imprecise, summary of a set of states. This method samples from exactly the correct distribution, and requires only about twice the time per step as ordinary Gibbs sampling, but it may require more simulation steps than would be needed if chains were tracked exactly. △ Less

Submitted 16 January, 2013; originally announced January 2013.

Comments: Appears in Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence (UAI2000)

Report number: UAI-P-2000-PG-256-263

arXiv:1212.6246 [pdf, other]

Gaussian Process Regression with Heteroscedastic or Non-Gaussian Residuals

Authors: Chunyi Wang, Radford M. Neal

Abstract: Gaussian Process (GP) regression models typically assume that residuals are Gaussian and have the same variance for all observations. However, applications with input-dependent noise (heteroscedastic residuals) frequently arise in practice, as do applications in which the residuals do not have a Gaussian distribution. In this paper, we propose a GP Regression model with a latent variable that serv… ▽ More Gaussian Process (GP) regression models typically assume that residuals are Gaussian and have the same variance for all observations. However, applications with input-dependent noise (heteroscedastic residuals) frequently arise in practice, as do applications in which the residuals do not have a Gaussian distribution. In this paper, we propose a GP Regression model with a latent variable that serves as an additional unobserved covariate for the regression. This model (which we call GPLC) allows for heteroscedasticity since it allows the function to have a changing partial derivative with respect to this unobserved covariate. With a suitable covariance function, our GPLC model can handle (a) Gaussian residuals with input-dependent variance, or (b) non-Gaussian residuals with input-dependent variance, or (c) Gaussian residuals with constant variance. We compare our model, using synthetic datasets, with a model proposed by Goldberg, Williams and Bishop (1998), which we refer to as GPLV, which only deals with case (a), as well as a standard GP model which can handle only case (c). Markov Chain Monte Carlo methods are developed for both modelsl. Experiments show that when the data is heteroscedastic, both GPLC and GPLV give better results (smaller mean squared error and negative log-probability density) than standard GP regression. In addition, when the residual are Gaussian, our GPLC model is generally nearly as good as GPLV, while when the residuals are non-Gaussian, our GPLC model is better than GPLV. △ Less

Submitted 26 December, 2012; originally announced December 2012.

arXiv:1206.1901 [pdf, ps, other]

doi 10.1201/b10905

MCMC using Hamiltonian dynamics

Authors: Radford M. Neal

Abstract: Hamiltonian dynamics can be used to produce distant proposals for the Metropolis algorithm, thereby avoiding the slow exploration of the state space that results from the diffusive behaviour of simple random-walk proposals. Though originating in physics, Hamiltonian dynamics can be applied to most problems with continuous state spaces by simply introducing fictitious "momentum" variables. A key to… ▽ More Hamiltonian dynamics can be used to produce distant proposals for the Metropolis algorithm, thereby avoiding the slow exploration of the state space that results from the diffusive behaviour of simple random-walk proposals. Though originating in physics, Hamiltonian dynamics can be applied to most problems with continuous state spaces by simply introducing fictitious "momentum" variables. A key to its usefulness is that Hamiltonian dynamics preserves volume, and its trajectories can thus be used to define complex map**s without the need to account for a hard-to-compute Jacobian factor - a property that can be exactly maintained even when the dynamics is approximated by discretizing time. In this review, I discuss theoretical and practical aspects of Hamiltonian Monte Carlo, and present some of its variations, including using windows of states for deciding on acceptance or rejection, computing trajectories using fast approximations, tempering during the course of a trajectory to handle isolated modes, and short-cut methods that prevent useless trajectories from taking much computation time. △ Less

Submitted 8 June, 2012; originally announced June 2012.

arXiv:1205.0070 [pdf, ps, other]

How to view an MCMC simulation as a permutation, with applications to parallel simulation and improved importance sampling

Authors: Radford M. Neal

Abstract: Consider a Markov chain defined on a finite state space, X, that leaves invariant the uniform distribution on X, and whose transition probabilities are integer multiples of 1/Q, for some integer Q. I show how a simulation of n transitions of this chain starting at x_0 can be viewed as applying a random permutation on the space XxU, where U={0,1,...,Q-1}, to the start state (x_0,u_0), with u_0 draw… ▽ More Consider a Markov chain defined on a finite state space, X, that leaves invariant the uniform distribution on X, and whose transition probabilities are integer multiples of 1/Q, for some integer Q. I show how a simulation of n transitions of this chain starting at x_0 can be viewed as applying a random permutation on the space XxU, where U={0,1,...,Q-1}, to the start state (x_0,u_0), with u_0 drawn uniformly from U. This result can be applied to a non-uniform distribution with probabilities that are integer multiples of 1/P, for some integer P, by representing it as the marginal distribution for X from the uniform distribution on a suitably-defined subset of XxY, where Y={0,1,...,P-1}. By letting Q, P, and the cardinality of X go to infinity, this result can be generalized to non-rational probabilities and to continuous state spaces, with permutations on a finite space replaced by volume-preserving one-to-one maps from a continuous space to itself. These constructions can be efficiently implemented for chains commonly used in Markov chain Monte Carlo (MCMC) simulations. I present two applications in this context - simulation of K realizations of a chain from K initial states, but with transitions defined by a single stream of random numbers, as may be efficient with a vector processor or multiple processors, and use of MCMC to improve an importance sampling distribution that already has substantial overlap with the distribution of interest. I also discuss the implications of this "permutation MCMC" method regarding the role of randomness in MCMC simulation, and the potential use of non-random and quasi-random numbers. △ Less

Submitted 30 April, 2012; originally announced May 2012.

Report number: Tech. Rep. No. 1201, Dept. of Statistics, University of Toronto

arXiv:1106.5941 [pdf, other]

Split Hamiltonian Monte Carlo

Authors: Babak Shahbaba, Shiwei Lan, Wesley O. Johnson, Radford M. Neal

Abstract: We show how the Hamiltonian Monte Carlo algorithm can sometimes be speeded up by "splitting" the Hamiltonian in a way that allows much of the movement around the state space to be done at low computational cost. One context where this is possible is when the log density of the distribution of interest (the potential energy function) can be written as the log of a Gaussian density, which is a quadr… ▽ More We show how the Hamiltonian Monte Carlo algorithm can sometimes be speeded up by "splitting" the Hamiltonian in a way that allows much of the movement around the state space to be done at low computational cost. One context where this is possible is when the log density of the distribution of interest (the potential energy function) can be written as the log of a Gaussian density, which is a quadratic function, plus a slowly varying function. Hamiltonian dynamics for quadratic energy functions can be analytically solved. With the splitting technique, only the slowly-varying part of the energy needs to be handled numerically, and this can be done with a larger stepsize (and hence fewer steps) than would be necessary with a direct simulation of the dynamics. Another context where splitting helps is when the most important terms of the potential energy function and its gradient can be evaluated quickly, with only a slowly-varying part requiring costly computations. With splitting, the quick portion can be handled with a small stepsize, while the costly portion uses a larger stepsize. We show that both of these splitting approaches can reduce the computational cost of sampling from the posterior distribution for a logistic regression model, using either a Gaussian approximation centered on the posterior mode, or a Hamiltonian split into a term that depends on only a small number of critical cases, and another term that involves the larger number of cases whose influence on the posterior distribution is small. Supplemental materials for this paper are available online. △ Less

Submitted 14 July, 2012; v1 submitted 29 June, 2011; originally announced June 2011.

arXiv:1106.0237 [pdf, ps]

doi 10.1613/jair.689

On Deducing Conditional Independence from d-Separation in Causal Graphs with Feedback (Research Note)

Authors: R. M. Neal

Abstract: Pearl and Dechter (1996) claimed that the d-separation criterion for conditional independence in acyclic causal networks also applies to networks of discrete variables that have feedback cycles, provided that the variables of the system are uniquely determined by the random disturbances. I show by example that this is not true in general. Some condition stronger than uniqueness is… ▽ More Pearl and Dechter (1996) claimed that the d-separation criterion for conditional independence in acyclic causal networks also applies to networks of discrete variables that have feedback cycles, provided that the variables of the system are uniquely determined by the random disturbances. I show by example that this is not true in general. Some condition stronger than uniqueness is needed, such as the existence of a causal dynamics guaranteed to lead to the unique solution. △ Less

Submitted 1 June, 2011; originally announced June 2011.

Journal ref: Journal Of Artificial Intelligence Research, Volume 12, pages 87-91, 2000

arXiv:1101.0387 [pdf, ps, other]

MCMC Using Ensembles of States for Problems with Fast and Slow Variables such as Gaussian Process Regression

Authors: Radford M. Neal

Abstract: I introduce a Markov chain Monte Carlo (MCMC) scheme in which sampling from a distribution with density pi(x) is done using updates operating on an "ensemble" of states. The current state x is first stochastically mapped to an ensemble, x^{(1)},...,x^{(K)}. This ensemble is then updated using MCMC updates that leave invariant a suitable ensemble density, rho(x^{(1)},...,x^{(K)}), defined in terms… ▽ More I introduce a Markov chain Monte Carlo (MCMC) scheme in which sampling from a distribution with density pi(x) is done using updates operating on an "ensemble" of states. The current state x is first stochastically mapped to an ensemble, x^{(1)},...,x^{(K)}. This ensemble is then updated using MCMC updates that leave invariant a suitable ensemble density, rho(x^{(1)},...,x^{(K)}), defined in terms of pi(x^{(i)}) for i=1,...,K. Finally a single state is stochastically selected from the ensemble after these updates. Such ensemble MCMC updates can be useful when characteristics of pi and the ensemble permit pi(x^{(i)}) for all i in {1,...,K}, to be computed in less than K times the amount of computation time needed to compute pi(x) for a single x. One common situation of this type is when changes to some "fast" variables allow for quick re-computation of the density, whereas changes to other "slow" variables do not. Gaussian process regression models are an example of this sort of problem, with an overall scaling factor for covariances and the noise variance being fast variables. I show that ensemble MCMC for Gaussian process regression models can indeed substantially improve sampling performance. Finally, I discuss other possible applications of ensemble MCMC, and its relationship to the "multiple-try Metropolis" method of Liu, Liang, and Wong and the "multiset sampler" of Leman, Chen, and Lavine. △ Less

Submitted 2 January, 2011; originally announced January 2011.

arXiv:1011.4722 [pdf, other]

Slice Sampling with Adaptive Multivariate Steps: The Shrinking-Rank Method

Authors: Madeleine B. Thompson, Radford M. Neal

Abstract: The shrinking rank method is a variation of slice sampling that is efficient at sampling from multivariate distributions with highly correlated parameters. It requires that the gradient of the log-density be computable. At each individual step, it approximates the current slice with a Gaussian occupying a shrinking-dimension subspace. The dimension of the approximation is shrunk orthogonally to th… ▽ More The shrinking rank method is a variation of slice sampling that is efficient at sampling from multivariate distributions with highly correlated parameters. It requires that the gradient of the log-density be computable. At each individual step, it approximates the current slice with a Gaussian occupying a shrinking-dimension subspace. The dimension of the approximation is shrunk orthogonally to the gradient at rejected proposals, since the gradients at points outside the current slice tend to point towards the slice. This causes the proposal distribution to converge rapidly to an estimate of the longest axis of the slice, resulting in states that are less correlated than those generated by related methods. After describing the method, we compare it to two other methods on several distributions and obtain favorable results. △ Less

Submitted 21 November, 2010; originally announced November 2010.

ACM Class: G.3

Journal ref: Proceedings of the 2010 Joint Statistical Meetings, Section on Statistical Computing, pages 3890-3896

arXiv:1003.3201 [pdf, other]

Covariance-Adaptive Slice Sampling

Authors: Madeleine Thompson, Radford M. Neal

Abstract: We describe two slice sampling methods for taking multivariate steps using the crumb framework. These methods use the gradients at rejected proposals to adapt to the local curvature of the log-density surface, a technique that can produce much better proposals when parameters are highly correlated. We evaluate our methods on four distributions and compare their performance to that of a non-adapt… ▽ More We describe two slice sampling methods for taking multivariate steps using the crumb framework. These methods use the gradients at rejected proposals to adapt to the local curvature of the log-density surface, a technique that can produce much better proposals when parameters are highly correlated. We evaluate our methods on four distributions and compare their performance to that of a non-adaptive slice sampling method and a Metropolis method. The adaptive methods perform favorably on low-dimensional target distributions with highly-correlated parameters. △ Less

Submitted 16 March, 2010; originally announced March 2010.

Report number: Tech. Rep. 1002, Dept. of Statistics, Univ. of Toronto MSC Class: 65C05

arXiv:0711.4983 [pdf, ps, other]

doi 10.1214/08-BA330

A Method for Compressing Parameters in Bayesian Models with Application to Logistic Sequence Prediction Models

Authors: Longhai Li, Radford M. Neal

Abstract: Bayesian classification and regression with high order interactions is largely infeasible because Markov chain Monte Carlo (MCMC) would need to be applied with a great many parameters, whose number increases rapidly with the order. In this paper we show how to make it feasible by effectively reducing the number of parameters, exploiting the fact that many interactions have the same values for al… ▽ More Bayesian classification and regression with high order interactions is largely infeasible because Markov chain Monte Carlo (MCMC) would need to be applied with a great many parameters, whose number increases rapidly with the order. In this paper we show how to make it feasible by effectively reducing the number of parameters, exploiting the fact that many interactions have the same values for all training cases. Our method uses a single ``compressed'' parameter to represent the sum of all parameters associated with a set of patterns that have the same value for all training cases. Using symmetric stable distributions as the priors of the original parameters, we can easily find the priors of these compressed parameters. We therefore need to deal only with a much smaller number of compressed parameters when training the model with MCMC. The number of compressed parameters may have converged before considering the highest possible order. After training the model, we can split these compressed parameters into the original ones as needed to make predictions for test cases. We show in detail how to compress parameters for logistic sequence prediction models. Experiments on both simulated and real data demonstrate that a huge number of parameters can indeed be reduced by our compression method. △ Less

Submitted 30 November, 2007; originally announced November 2007.

Comments: 29 pages

Journal ref: Bayesian Analysis, 2008, 3(4), 793-822

arXiv:math/0703292 [pdf, ps, other]

Nonlinear Models Using Dirichlet Process Mixtures

Authors: Babak Shahbaba, Radford M. Neal

Abstract: We introduce a new nonlinear model for classification, in which we model the joint distribution of response variable, y, and covariates, x, non-parametrically using Dirichlet process mixtures. We keep the relationship between y and x linear within each component of the mixture. The overall relationship becomes nonlinear if the mixture contains more than one component. We use simulated data to co… ▽ More We introduce a new nonlinear model for classification, in which we model the joint distribution of response variable, y, and covariates, x, non-parametrically using Dirichlet process mixtures. We keep the relationship between y and x linear within each component of the mixture. The overall relationship becomes nonlinear if the mixture contains more than one component. We use simulated data to compare the performance of this new approach to a simple multinomial logit (MNL) model, an MNL model with quadratic terms, and a decision tree model. We also evaluate our approach on a protein fold classification problem, and find that our model provides substantial improvement over previous methods, which were based on Neural Networks (NN) and Support Vector Machines (SVM). Folding classes of protein have a hierarchical structure. We extend our method to classification problems where a class hierarchy is available. We find that using the prior information regarding the hierarchical structure of protein folds can result in higher predictive accuracy. △ Less

Submitted 10 March, 2007; originally announced March 2007.

MSC Class: 62H30

arXiv:math/0702591 [pdf, ps, other]

A Method for Avoiding Bias from Feature Selection with Application to Naive Bayes Classification Models

Authors: Longhai Li, Jianguo Zhang, Radford M. Neal

Abstract: For many classification and regression problems, a large number of features are available for possible use - this is typical of DNA microarray data on gene expression, for example. Often, for computational or other reasons, only a small subset of these features are selected for use in a model, based on some simple measure such as correlation with the response variable. This procedure may introdu… ▽ More For many classification and regression problems, a large number of features are available for possible use - this is typical of DNA microarray data on gene expression, for example. Often, for computational or other reasons, only a small subset of these features are selected for use in a model, based on some simple measure such as correlation with the response variable. This procedure may introduce an optimistic bias, however, in which the response variable appears to be more predictable than it actually is, because the high correlation of the selected features with the response may be partly or wholely due to chance. We show how this bias can be avoided when using a Bayesian model for the joint distribution of features and response. The crucial insight is that even if we forget the exact values of the unselected features, we should retain, and condition on, the knowledge that their correlation with the response was too small for them to be selected. In this paper we describe how this idea can be implemented for ``naive Bayes'' models of binary data. Experiments with simulated data confirm that this method avoids bias due to feature selection. We also apply the naive Bayes model to subsets of data relating gene expression to colon cancer, and find that correcting for bias from feature selection does improve predictive performance. △ Less

Submitted 20 February, 2007; originally announced February 2007.

MSC Class: 62H30

arXiv:math/0608592 [pdf, ps, other]

Puzzles of Anthropic Reasoning Resolved Using Full Non-indexical Conditioning

Authors: Radford M. Neal

Abstract: I consider the puzzles arising from four interrelated problems involving `anthropic' reasoning, and in particular the `Self-Sampling Assumption' (SSA) - that one should reason as if one were randomly chosen from the set of all observers in a suitable reference class. The problem of Freak Observers might appear to force acceptance of SSA if any empirical evidence is to be credited. The Slee** B… ▽ More I consider the puzzles arising from four interrelated problems involving `anthropic' reasoning, and in particular the `Self-Sampling Assumption' (SSA) - that one should reason as if one were randomly chosen from the set of all observers in a suitable reference class. The problem of Freak Observers might appear to force acceptance of SSA if any empirical evidence is to be credited. The Slee** Beauty problem arguably shows that one should also accept the `Self-Indication Assumption' (SIA) - that one should take one's own existence as evidence that the number of observers is more likely to be large than small. But this assumption produces apparently absurd results in the Presumptuous Philosopher problem. Without SIA, however, a definitive refutation of the counterintuitive Doomsday Argument seems difficult. I show that these problems are satisfyingly resolved by applying the principle that one should always condition on all evidence - not just on the fact that you are an intelligent observer, or that you are human, but on the fact that you are a human with a specific set of memories. This `Full Non-indexical Conditioning' (FNC) approach usually produces the same results as assuming both SSA and SIA, with a sufficiently broad reference class, while avoiding their ad hoc aspects. I argue that the results of FNC are correct using the device of hypothetical ``companion'' observers, whose existence clarifies what principles of reasoning are valid. I conclude by discussing how one can use FNC to infer how densely we should expect intelligent species to occur, and by examining recent anthropic arguments in inflationary and string theory cosmology. △ Less

Submitted 23 August, 2006; originally announced August 2006.

arXiv:q-bio/0605015 [pdf, ps, other]

Gene Function Classification Using Bayesian Models with Hierarchy-Based Priors

Authors: Babak Shahbaba, Radford M. Neal

Abstract: We investigate the application of hierarchical classification schemes to the annotation of gene function based on several characteristics of protein sequences including phylogenic descriptors, sequence based attributes, and predicted secondary structure. We discuss three Bayesian models and compare their performance in terms of predictive accuracy. These models are the ordinary multinomial logit… ▽ More We investigate the application of hierarchical classification schemes to the annotation of gene function based on several characteristics of protein sequences including phylogenic descriptors, sequence based attributes, and predicted secondary structure. We discuss three Bayesian models and compare their performance in terms of predictive accuracy. These models are the ordinary multinomial logit (MNL) model, a hierarchical model based on a set of nested MNL models, and a MNL model with a prior that introduces correlations between the parameters for classes that are nearby in the hierarchy. We also provide a new scheme for combining different sources of information. We use these models to predict the functional class of Open Reading Frames (ORFs) from the E. coli genome. The results from all three models show substantial improvement over previous methods, which were based on the C5 algorithm. The MNL model using a prior based on the hierarchy outperforms both the non-hierarchical MNL model and the nested MNL model. In contrast to previous attempts at combining these sources of information, our approach results in a higher accuracy rate when compared to models that use each data source alone. Together, these results show that gene function can be predicted with higher accuracy than previously achieved, using Bayesian models that incorporate suitable prior information. △ Less

Submitted 10 May, 2006; originally announced May 2006.

arXiv:math/0511216 [pdf, ps, other]

Estimating Ratios of Normalizing Constants Using Linked Importance Sampling

Authors: Radford M. Neal

Abstract: Ratios of normalizing constants for two distributions are needed in both Bayesian statistics, where they are used to compare models, and in statistical physics, where they correspond to differences in free energy. Two approaches have long been used to estimate ratios of normalizing constants. The `simple importance sampling' (SIS) or `free energy perturbation' method uses a sample drawn from jus… ▽ More Ratios of normalizing constants for two distributions are needed in both Bayesian statistics, where they are used to compare models, and in statistical physics, where they correspond to differences in free energy. Two approaches have long been used to estimate ratios of normalizing constants. The `simple importance sampling' (SIS) or `free energy perturbation' method uses a sample drawn from just one of the two distributions. The `bridge sampling' or `acceptance ratio' estimate can be viewed as the ratio of two SIS estimates involving a bridge distribution. For both methods, difficult problems must be handled by introducing a sequence of intermediate distributions linking the two distributions of interest, with the final ratio of normalizing constants being estimated by the product of estimates of ratios for adjacent distributions in this sequence. Recently, work by Jarzynski, and independently by Neal, has shown how one can view such a product of estimates, each based on simple importance sampling using a single point, as an SIS estimate on an extended state space. This `Annealed Importance Sampling' (AIS) method produces an exactly unbiased estimate for the ratio of normalizing constants even when the Markov transitions used do not reach equilibrium. In this paper, I show how a corresponding `Linked Importance Sampling' (LIS) method can be constructed in which the estimates for individual ratios are similar to bridge sampling estimates. I show empirically that for some problems, LIS estimates are much more accurate than AIS estimates found using the same computation time, although for other problems the two methods have similar performance. Linked sampling methods similar to LIS are useful for other purposes as well. △ Less

Submitted 8 November, 2005; originally announced November 2005.

arXiv:math/0510449 [pdf, ps, other]

Improving Classification When a Class Hierarchy is Available Using a Hierarchy-Based Prior

Authors: Babak Shahbaba, Radford M. Neal

Abstract: We introduce a new method for building classification models when we have prior knowledge of how the classes can be arranged in a hierarchy, based on how easily they can be distinguished. The new method uses a Bayesian form of the multinomial logit (MNL, a.k.a. ``softmax'') model, with a prior that introduces correlations between the parameters for classes that are nearby in the tree. We compare… ▽ More We introduce a new method for building classification models when we have prior knowledge of how the classes can be arranged in a hierarchy, based on how easily they can be distinguished. The new method uses a Bayesian form of the multinomial logit (MNL, a.k.a. ``softmax'') model, with a prior that introduces correlations between the parameters for classes that are nearby in the tree. We compare the performance on simulated data of the new method, the ordinary MNL model, and a model that uses the hierarchy in different way. We also test the new method on a document labelling problem, and find that it performs better than the other methods, particularly when the amount of training data is small. △ Less

Submitted 20 October, 2005; originally announced October 2005.

arXiv:math/0508060 [pdf, ps, other]

The Short-Cut Metropolis Method

Authors: Radford M. Neal

Abstract: I show how one can modify the random-walk Metropolis MCMC method in such a way that a sequence of modified Metropolis updates takes little computation time when the rejection rate is outside a desired interval. This allows one to effectively adapt the scale of the Metropolis proposal distribution, by performing several such "short-cut" Metropolis sequences with varying proposal stepsizes. Unlike… ▽ More I show how one can modify the random-walk Metropolis MCMC method in such a way that a sequence of modified Metropolis updates takes little computation time when the rejection rate is outside a desired interval. This allows one to effectively adapt the scale of the Metropolis proposal distribution, by performing several such "short-cut" Metropolis sequences with varying proposal stepsizes. Unlike other adaptive Metropolis schemes, this method converges to the correct distribution in the same fashion as the standard Metropolis method. △ Less

Submitted 2 August, 2005; originally announced August 2005.

arXiv:math/0502099 [pdf, ps, other]

Taking Bigger Metropolis Steps by Dragging Fast Variables

Authors: Radford M. Neal

Abstract: I show how Markov chain sampling with the Metropolis-Hastings algorithm can be modified so as to take bigger steps when the distribution being sampled from has the characteristic that its density can be quickly recomputed for a new point if this point differs from a previous point only with respect to a subset of 'fast' variables. I show empirically that when using this method, the efficiency of… ▽ More I show how Markov chain sampling with the Metropolis-Hastings algorithm can be modified so as to take bigger steps when the distribution being sampled from has the characteristic that its density can be quickly recomputed for a new point if this point differs from a previous point only with respect to a subset of 'fast' variables. I show empirically that when using this method, the efficiency of sampling for the remaining 'slow' variables can approach what would be possible using Metropolis updates based on the marginal distribution for the slow variables. △ Less

Submitted 6 February, 2005; originally announced February 2005.

MSC Class: 65C05; 65C60

arXiv:math/0407281 [pdf, ps, other]

Improving Asymptotic Variance of MCMC Estimators: Non-reversible Chains are Better

Authors: Radford M. Neal

Abstract: I show how any reversible Markov chain on a finite state space that is irreducible, and hence suitable for estimating expectations with respect to its invariant distribution, can be used to construct a non-reversible Markov chain on a related state space that can also be used to estimate these expectations, with asymptotic variance at least as small as that using the reversible chain (typically… ▽ More I show how any reversible Markov chain on a finite state space that is irreducible, and hence suitable for estimating expectations with respect to its invariant distribution, can be used to construct a non-reversible Markov chain on a related state space that can also be used to estimate these expectations, with asymptotic variance at least as small as that using the reversible chain (typically smaller). The non-reversible chain achieves this improvement by avoiding (to the extent possible) transitions that backtrack to the state from which the chain just came. The proof that this modification cannot increase the asymptotic variance of an MCMC estimator uses a new technique that can also be used to prove Peskun's (1973) theorem that modifying a reversible chain to reduce the probability of staying in the same state cannot increase asymptotic variance. A non-reversible chain that avoids backtracking will often take little or no more computation time per transition than the original reversible chain, and can sometime produce a large reduction in asymptotic variance, though for other chains the improvement is slight. In addition to being of some practical interest, this construction demonstrates that non-reversible chains have a fundamental advantage over reversible chains for MCMC estimation. Research into better MCMC methods may therefore best be focused on non-reversible chains. △ Less

Submitted 15 July, 2004; originally announced July 2004.

arXiv:math/0305039 [pdf, ps, other]

Markov Chain Sampling for Non-linear State Space Models Using Embedded Hidden Markov Models

Authors: Radford M. Neal

Abstract: I describe a new Markov chain method for sampling from the distribution of the state sequences in a non-linear state space model, given the observation sequence. This method updates all states in the sequence simultaneously using an embedded Hidden Markov model (HMM). An update begins with the creation of a ``pool'' of K states at each time, by applying some Markov chain update to the current st… ▽ More I describe a new Markov chain method for sampling from the distribution of the state sequences in a non-linear state space model, given the observation sequence. This method updates all states in the sequence simultaneously using an embedded Hidden Markov model (HMM). An update begins with the creation of a ``pool'' of K states at each time, by applying some Markov chain update to the current state. These pools define an embedded HMM whose states are indexes within this pool. Using the forward-backward dynamic programming algorithm, we can then efficiently choose a state sequence at random with the appropriate probabilities from the exponentially large number of state sequences that pass through states in these pools. I show empirically that when states at nearby times are strongly dependent, embedded HMM sampling can perform better than Metropolis methods that update one state at a time. △ Less

Submitted 1 May, 2003; originally announced May 2003.

arXiv:physics/0009028 [pdf, ps, other]

Slice Sampling

Authors: Radford M. Neal

Abstract: Markov chain sampling methods that automatically adapt to characteristics of the distribution being sampled can be constructed by exploiting the principle that one can sample from a distribution by sampling uniformly from the region under the plot of its density function. A Markov chain that converges to this uniform distribution can be constructed by alternating uniform sampling in the vertical… ▽ More Markov chain sampling methods that automatically adapt to characteristics of the distribution being sampled can be constructed by exploiting the principle that one can sample from a distribution by sampling uniformly from the region under the plot of its density function. A Markov chain that converges to this uniform distribution can be constructed by alternating uniform sampling in the vertical direction with uniform sampling from the horizontal `slice' defined by the current vertical position, or more generally, with some update that leaves the uniform distribution over this slice invariant. Variations on such `slice sampling' methods are easily implemented for univariate distributions, and can be used to sample from a multivariate distribution by updating each variable in turn. This approach is often easier to implement than Gibbs sampling, and more efficient than simple Metropolis updates, due to the ability of slice sampling to adaptively choose the magnitude of changes made. It is therefore attractive for routine and automated use. Slice sampling methods that update all variables simultaneously are also possible. These methods can adaptively choose the magnitudes of changes made to each variable, based on the local properties of the density function. More ambitiously, such methods could potentially allow the sampling to adapt to dependencies between variables by constructing local quadratic approximations. Another approach is to improve sampling efficiency by suppressing random walks. This can be done using `overrelaxed' versions of univariate slice sampling procedures, or by using `reflective' multivariate slice sampling methods, which bounce off the edges of the slice. △ Less

Submitted 7 September, 2000; originally announced September 2000.

Comments: 40 pages. Written for statisticians, but of interest to physicists who use Monte Carlo methods

arXiv:physics/9803008 [pdf, ps, other]

Annealed Importance Sampling

Authors: Radford M. Neal

Abstract: Simulated annealing - moving from a tractable distribution to a distribution of interest via a sequence of intermediate distributions - has traditionally been used as an inexact method of handling isolated modes in Markov chain samplers. Here, it is shown how one can use the Markov chain transitions for such an annealing sequence to define an importance sampler. The Markov chain aspect allows th… ▽ More Simulated annealing - moving from a tractable distribution to a distribution of interest via a sequence of intermediate distributions - has traditionally been used as an inexact method of handling isolated modes in Markov chain samplers. Here, it is shown how one can use the Markov chain transitions for such an annealing sequence to define an importance sampler. The Markov chain aspect allows this method to perform acceptably even for high-dimensional problems, where finding good importance sampling distributions would otherwise be very difficult, while the use of importance weights ensures that the estimates found converge to the correct values as the number of annealing runs increases. This annealed importance sampling procedure resembles the second half of the previously-studied tempered transitions, and can be seen as a generalization of a recently-proposed variant of sequential importance sampling. It is also related to thermodynamic integration methods for estimating ratios of normalizing constants. Annealed importance sampling is most attractive when isolated modes are present, or when estimates of normalizing constants are required, but it may also be more generally useful, since its independent sampling allows one to bypass some of the problems of assessing convergence and autocorrelation in Markov chain samplers. △ Less

Submitted 4 September, 1998; v1 submitted 8 March, 1998; originally announced March 1998.

Report number: TR 9805, Dept. of Statistics, Toronto

arXiv:physics/9701026 [pdf, ps, other]

Monte Carlo Implementation of Gaussian Process Models for Bayesian Regression and Classification

Authors: Radford M. Neal

Abstract: Gaussian processes are a natural way of defining prior distributions over functions of one or more input variables. In a simple nonparametric regression problem, where such a function gives the mean of a Gaussian distribution for an observed response, a Gaussian process model can easily be implemented using matrix computations that are feasible for datasets of up to about a thousand cases. Hyper… ▽ More Gaussian processes are a natural way of defining prior distributions over functions of one or more input variables. In a simple nonparametric regression problem, where such a function gives the mean of a Gaussian distribution for an observed response, a Gaussian process model can easily be implemented using matrix computations that are feasible for datasets of up to about a thousand cases. Hyperparameters that define the covariance function of the Gaussian process can be sampled using Markov chain methods. Regression models where the noise has a t distribution and logistic or probit models for classification applications can be implemented by sampling as well for latent values underlying the observations. Software is now available that implements these methods using covariance functions with hierarchical parameterizations. Models defined in this way can discover high-level properties of the data, such as which inputs are relevant to predicting the response. △ Less

Submitted 27 January, 1997; v1 submitted 27 January, 1997; originally announced January 1997.

Report number: 9702

arXiv:bayes-an/9506004 [pdf, ps]

Suppressing Random Walks in Markov Chain Monte Carlo Using Ordered Overrelaxation

Authors: R. M. Neal

Abstract: Markov chain Monte Carlo methods such as Gibbs sampling and simple forms of the Metropolis algorithm typically move about the distribution being sampled via a random walk. For the complex, high-dimensional distributions commonly encountered in Bayesian inference and statistical physics, the distance moved in each iteration of these algorithms will usually be small, because it is difficult or imp… ▽ More Markov chain Monte Carlo methods such as Gibbs sampling and simple forms of the Metropolis algorithm typically move about the distribution being sampled via a random walk. For the complex, high-dimensional distributions commonly encountered in Bayesian inference and statistical physics, the distance moved in each iteration of these algorithms will usually be small, because it is difficult or impossible to transform the problem to eliminate dependencies between variables. The inefficiency inherent in taking such small steps is greatly exacerbated when the algorithm operates via a random walk, as in such a case moving to a point n steps away will typically take around n^2 iterations. Such random walks can sometimes be suppressed using ``overrelaxed'' variants of Gibbs sampling (a.k.a. the heatbath algorithm), but such methods have hitherto been largely restricted to problems where all the full conditional distributions are Gaussian. I present an overrelaxed Markov chain Monte Carlo algorithm based on order statistics that is more widely applicable. In particular, the algorithm can be applied whenever the full conditional distributions are such that their cumulative distribution functions and inverse cumulative distribution functions can be efficiently computed. The method is demonstrated on an inference problem for a simple hierarchical Bayesian model. △ Less

Submitted 22 June, 1995; originally announced June 1995.

Comments: uuencoded compressed postscript (with instructions on decoding)

Report number: Technical Report 9508

arXiv:hep-lat/9208011 [pdf, ps, other]

An Improved Acceptance Procedure for the Hybrid Monte Carlo Algorithm

Authors: R. M. Neal

Abstract: The probability of accepting a candidate move in the hybrid Monte Carlo algorithm can be increased by considering a transition to be between windows of several states at the beginning and end of the trajectory, with a state within the selected window being chosen according to the Boltzmann probabilities. The detailed balance condition used to justify the algorithm still holds with this procedure… ▽ More The probability of accepting a candidate move in the hybrid Monte Carlo algorithm can be increased by considering a transition to be between windows of several states at the beginning and end of the trajectory, with a state within the selected window being chosen according to the Boltzmann probabilities. The detailed balance condition used to justify the algorithm still holds with this procedure, provided the start state is randomly positioned within its window. The new procedure is shown empirically to significantly improve performance for a test system of uncoupled oscillators. △ Less

Submitted 20 August, 1992; v1 submitted 12 August, 1992; originally announced August 1992.

Comments: 15 pages, 4 figures (only one of which is present), New version with corrected LaTex, Submitted to J. of Comp. Physics

Showing 1–36 of 36 results for author: Neal, R M