Search | arXiv e-print repository

Variational inference based on a subclass of closed skew normals

Abstract: Gaussian distributions are widely used in Bayesian variational inference to approximate intractable posterior densities, but the ability to accommodate skewness can improve approximation accuracy significantly, especially when data or prior information is scarce. We study the properties of a subclass of closed skew normals constructed using affine transformation of independent standardized univari… ▽ More Gaussian distributions are widely used in Bayesian variational inference to approximate intractable posterior densities, but the ability to accommodate skewness can improve approximation accuracy significantly, especially when data or prior information is scarce. We study the properties of a subclass of closed skew normals constructed using affine transformation of independent standardized univariate skew normals as the variational density, and illustrate how this subclass provides increased flexibility and accuracy in approximating the joint posterior density in a variety of applications by overcoming limitations in existing skew normal variational approximations. The evidence lower bound is optimized using stochastic gradient ascent, where analytic natural gradient updates are derived. We also demonstrate how problems in maximum likelihood estimation of skew normal parameters occur similarly in stochastic variational inference and can be resolved using the centered parametrization. △ Less

Submitted 5 June, 2023; originally announced June 2023.

Comments: keywords: Closed skew normal; Gaussian variational approximation; natural gradient; centered parametrization; LU decomposition

arXiv:2305.05529 [pdf, other]

Accelerate Langevin Sampling with Birth-Death process and Exploration Component

Authors: Lezhi Tan, Jianfeng Lu

Abstract: Sampling a probability distribution with known likelihood is a fundamental task in computational science and engineering. Aiming at multimodality, we propose a new sampling method that takes advantage of both birth-death process and exploration component. The main idea of this method is \textit{look before you leap}. We keep two sets of samplers, one at warmer temperature and one at original tempe… ▽ More Sampling a probability distribution with known likelihood is a fundamental task in computational science and engineering. Aiming at multimodality, we propose a new sampling method that takes advantage of both birth-death process and exploration component. The main idea of this method is \textit{look before you leap}. We keep two sets of samplers, one at warmer temperature and one at original temperature. The former one serves as pioneer in exploring new modes and passing useful information to the other, while the latter one samples the target distribution after receiving the information. We derive a mean-field limit and show how the exploration process determines sampling efficiency. Moreover, we prove exponential asymptotic convergence under mild assumption. Finally, we test on experiments from previous literature and compared our methodology to previous ones. △ Less

Submitted 6 May, 2023; originally announced May 2023.

Comments: 23 pages, 10 figures

arXiv:2303.16208 [pdf, ps, other]

Lifting uniform learners via distributional decomposition

Authors: Guy Blanc, Jane Lange, Ali Malik, Li-Yang Tan

Abstract: We show how any PAC learning algorithm that works under the uniform distribution can be transformed, in a blackbox fashion, into one that works under an arbitrary and unknown distribution $\mathcal{D}$. The efficiency of our transformation scales with the inherent complexity of $\mathcal{D}$, running in $\mathrm{poly}(n, (md)^d)$ time for distributions over $\{\pm 1\}^n$ whose pmfs are computed by… ▽ More We show how any PAC learning algorithm that works under the uniform distribution can be transformed, in a blackbox fashion, into one that works under an arbitrary and unknown distribution $\mathcal{D}$. The efficiency of our transformation scales with the inherent complexity of $\mathcal{D}$, running in $\mathrm{poly}(n, (md)^d)$ time for distributions over $\{\pm 1\}^n$ whose pmfs are computed by depth-$d$ decision trees, where $m$ is the sample complexity of the original algorithm. For monotone distributions our transformation uses only samples from $\mathcal{D}$, and for general ones it uses subcube conditioning samples. A key technical ingredient is an algorithm which, given the aforementioned access to $\mathcal{D}$, produces an optimal decision tree decomposition of $\mathcal{D}$: an approximation of $\mathcal{D}$ as a mixture of uniform distributions over disjoint subcubes. With this decomposition in hand, we run the uniform-distribution learner on each subcube and combine the hypotheses using the decision tree. This algorithmic decomposition lemma also yields new algorithms for learning decision tree distributions with runtimes that exponentially improve on the prior state of the art -- results of independent interest in distribution learning. △ Less

Submitted 29 March, 2023; v1 submitted 27 March, 2023; originally announced March 2023.

Comments: To appear in STOC 2023

arXiv:2302.10175 [pdf, other]

doi 10.3905/jfds.2023.1.130

Spatio-Temporal Momentum: Jointly Learning Time-Series and Cross-Sectional Strategies

Authors: Wee Ling Tan, Stephen Roberts, Stefan Zohren

Abstract: We introduce Spatio-Temporal Momentum strategies, a class of models that unify both time-series and cross-sectional momentum strategies by trading assets based on their cross-sectional momentum features over time. While both time-series and cross-sectional momentum strategies are designed to systematically capture momentum risk premia, these strategies are regarded as distinct implementations and… ▽ More We introduce Spatio-Temporal Momentum strategies, a class of models that unify both time-series and cross-sectional momentum strategies by trading assets based on their cross-sectional momentum features over time. While both time-series and cross-sectional momentum strategies are designed to systematically capture momentum risk premia, these strategies are regarded as distinct implementations and do not consider the concurrent relationship and predictability between temporal and cross-sectional momentum features of different assets. We model spatio-temporal momentum with neural networks of varying complexities and demonstrate that a simple neural network with only a single fully connected layer learns to simultaneously generate trading signals for all assets in a portfolio by incorporating both their time-series and cross-sectional momentum features. Backtesting on portfolios of 46 actively-traded US equities and 12 equity index futures contracts, we demonstrate that the model is able to retain its performance over benchmarks in the presence of high transaction costs of up to 5-10 basis points. In particular, we find that the model when coupled with least absolute shrinkage and turnover regularization results in the best performance over various transaction cost scenarios. △ Less

Submitted 20 February, 2023; originally announced February 2023.

Journal ref: The Journal of Financial Data Science, Summer 2023

arXiv:2210.10566 [pdf, other]

Second order stochastic gradient update for Cholesky factor in Gaussian variational approximation from Stein's Lemma

Authors: Linda S. L. Tan

Abstract: In stochastic variational inference, use of the reparametrization trick for the multivariate Gaussian gives rise to efficient updates for the mean and Cholesky factor of the covariance matrix, which depend on the first order derivative of the log joint model density. In this article, we show that an alternative unbiased gradient estimate for the Cholesky factor which depends on the second order de… ▽ More In stochastic variational inference, use of the reparametrization trick for the multivariate Gaussian gives rise to efficient updates for the mean and Cholesky factor of the covariance matrix, which depend on the first order derivative of the log joint model density. In this article, we show that an alternative unbiased gradient estimate for the Cholesky factor which depends on the second order derivative of the log joint model density can be derived using Stein's Lemma. This leads to a second order stochastic gradient update for the Cholesky factor which is able to improve convergence, as it has variance lower than the first order update (almost negligible) when close to the mode. We also derive second order update for the Cholesky factor of the precision matrix, which is useful when the precision matrix has a sparse structure reflecting conditional independence in the true posterior distribution. Our results can be used to obtain second order natural gradient updates for the Cholesky factor as well, which are more robust compared to updates based on Euclidean gradients. △ Less

Submitted 19 October, 2022; originally announced October 2022.

Comments: 15 pages, 2 figures

arXiv:2206.14431 [pdf, other]

Open Problem: Properly learning decision trees in polynomial time?

Authors: Guy Blanc, Jane Lange, Mingda Qiao, Li-Yang Tan

Abstract: The authors recently gave an $n^{O(\log\log n)}$ time membership query algorithm for properly learning decision trees under the uniform distribution (Blanc et al., 2021). The previous fastest algorithm for this problem ran in $n^{O(\log n)}$ time, a consequence of Ehrenfeucht and Haussler (1989)'s classic algorithm for the distribution-free setting. In this article we highlight the natural open pr… ▽ More The authors recently gave an $n^{O(\log\log n)}$ time membership query algorithm for properly learning decision trees under the uniform distribution (Blanc et al., 2021). The previous fastest algorithm for this problem ran in $n^{O(\log n)}$ time, a consequence of Ehrenfeucht and Haussler (1989)'s classic algorithm for the distribution-free setting. In this article we highlight the natural open problem of obtaining a polynomial-time algorithm, discuss possible avenues towards obtaining it, and state intermediate milestones that we believe are of independent interest. △ Less

Submitted 29 June, 2022; originally announced June 2022.

Comments: 5 pages, to appear at the Open Problem sessions at COLT 2022

arXiv:2109.00375 [pdf, other]

Analytic natural gradient updates for Cholesky factor in Gaussian variational approximation

Authors: Linda S. L. Tan

Abstract: Natural gradients can improve convergence in stochastic variational inference significantly but inverting the Fisher information matrix is daunting in high dimensions. Moreover, in Gaussian variational approximation, natural gradient updates of the precision matrix do not ensure positive definiteness. To tackle this issue, we derive analytic natural gradient updates of the Cholesky factor of the c… ▽ More Natural gradients can improve convergence in stochastic variational inference significantly but inverting the Fisher information matrix is daunting in high dimensions. Moreover, in Gaussian variational approximation, natural gradient updates of the precision matrix do not ensure positive definiteness. To tackle this issue, we derive analytic natural gradient updates of the Cholesky factor of the covariance or precision matrix, and consider sparsity constraints representing different posterior correlation structures. Stochastic normalized natural gradient ascent with momentum is proposed for implementation in generalized linear mixed models and deep neural networks. △ Less

Submitted 19 May, 2024; v1 submitted 1 September, 2021; originally announced September 2021.

Comments: 47 pages, 10 figures

arXiv:2107.00819 [pdf, other]

Decision tree heuristics can fail, even in the smoothed setting

Authors: Guy Blanc, Jane Lange, Mingda Qiao, Li-Yang Tan

Abstract: Greedy decision tree learning heuristics are mainstays of machine learning practice, but theoretical justification for their empirical success remains elusive. In fact, it has long been known that there are simple target functions for which they fail badly (Kearns and Mansour, STOC 1996). Recent work of Brutzkus, Daniely, and Malach (COLT 2020) considered the smoothed analysis model as a possibl… ▽ More Greedy decision tree learning heuristics are mainstays of machine learning practice, but theoretical justification for their empirical success remains elusive. In fact, it has long been known that there are simple target functions for which they fail badly (Kearns and Mansour, STOC 1996). Recent work of Brutzkus, Daniely, and Malach (COLT 2020) considered the smoothed analysis model as a possible avenue towards resolving this disconnect. Within the smoothed setting and for targets $f$ that are $k$-juntas, they showed that these heuristics successfully learn $f$ with depth-$k$ decision tree hypotheses. They conjectured that the same guarantee holds more generally for targets that are depth-$k$ decision trees. We provide a counterexample to this conjecture: we construct targets that are depth-$k$ decision trees and show that even in the smoothed setting, these heuristics build trees of depth $2^{Ω(k)}$ before achieving high accuracy. We also show that the guarantees of Brutzkus et al. cannot extend to the agnostic setting: there are targets that are very close to $k$-juntas, for which these heuristics build trees of depth $2^{Ω(k)}$ before achieving high accuracy. △ Less

Submitted 2 July, 2021; originally announced July 2021.

Comments: To appear in RANDOM 2021

arXiv:2105.03594 [pdf, ps, other]

Learning stochastic decision trees

Authors: Guy Blanc, Jane Lange, Li-Yang Tan

Abstract: We give a quasipolynomial-time algorithm for learning stochastic decision trees that is optimally resilient to adversarial noise. Given an $η$-corrupted set of uniform random samples labeled by a size-$s$ stochastic decision tree, our algorithm runs in time $n^{O(\log(s/\varepsilon)/\varepsilon^2)}$ and returns a hypothesis with error within an additive $2η+ \varepsilon$ of the Bayes optimal. An a… ▽ More We give a quasipolynomial-time algorithm for learning stochastic decision trees that is optimally resilient to adversarial noise. Given an $η$-corrupted set of uniform random samples labeled by a size-$s$ stochastic decision tree, our algorithm runs in time $n^{O(\log(s/\varepsilon)/\varepsilon^2)}$ and returns a hypothesis with error within an additive $2η+ \varepsilon$ of the Bayes optimal. An additive $2η$ is the information-theoretic minimum. Previously no non-trivial algorithm with a guarantee of $O(η) + \varepsilon$ was known, even for weaker noise models. Our algorithm is furthermore proper, returning a hypothesis that is itself a decision tree; previously no such algorithm was known even in the noiseless setting. △ Less

Submitted 8 May, 2021; originally announced May 2021.

Comments: To appear in ICALP 2021

arXiv:2101.07392 [pdf]

Powering population health research: Considerations for plausible and actionable effect sizes

Authors: Ellicott C. Matthay, Erin Hagan, Laura M. Gottlieb, May Lynn Tan, David Vlahov, Nancy Adler, M. Maria Glymour

Abstract: Evidence for Action (E4A), a signature program of the Robert Wood Johnson Foundation, funds investigator-initiated research on the impacts of social programs and policies on population health and health inequities. Across thousands of letters of intent and full proposals E4A has received since 2015, one of the most common methodological challenges faced by applicants is selecting realistic effect… ▽ More Evidence for Action (E4A), a signature program of the Robert Wood Johnson Foundation, funds investigator-initiated research on the impacts of social programs and policies on population health and health inequities. Across thousands of letters of intent and full proposals E4A has received since 2015, one of the most common methodological challenges faced by applicants is selecting realistic effect sizes to inform power and sample size calculations. E4A prioritizes health studies that are both (1) adequately powered to detect effect sizes that may reasonably be expected for the given intervention and (2) likely to achieve intervention effects sizes that, if demonstrated, correspond to actionable evidence for population health stakeholders. However, little guidance exists to inform the selection of effect sizes for population health research proposals. We draw on examples of five rigorously evaluated population health interventions. These examples illustrate considerations for selecting realistic and actionable effect sizes as inputs to power and sample size calculations for research proposals to study population health interventions. We show that plausible effects sizes for population health inteventions may be smaller than commonly cited guidelines suggest. Effect sizes achieved with population health interventions depend on the characteristics of the intervention, the target population, and the outcomes studied. Population health impact depends on the proportion of the population receiving the intervention. When adequately powered, even studies of interventions with small effect sizes can offer valuable evidence to inform population health if such interventions can be implemented broadly. Demonstrating the effectiveness of such interventions, however, requires large sample sizes. △ Less

Submitted 18 January, 2021; originally announced January 2021.

Comments: 24 pages, 1 figure

arXiv:2010.08633 [pdf, ps, other]

Universal guarantees for decision tree induction via a higher-order splitting criterion

Authors: Guy Blanc, Neha Gupta, Jane Lange, Li-Yang Tan

Abstract: We propose a simple extension of top-down decision tree learning heuristics such as ID3, C4.5, and CART. Our algorithm achieves provable guarantees for all target functions $f: \{-1,1\}^n \to \{-1,1\}$ with respect to the uniform distribution, circumventing impossibility results showing that existing heuristics fare poorly even for simple target functions. The crux of our extension is a new splitt… ▽ More We propose a simple extension of top-down decision tree learning heuristics such as ID3, C4.5, and CART. Our algorithm achieves provable guarantees for all target functions $f: \{-1,1\}^n \to \{-1,1\}$ with respect to the uniform distribution, circumventing impossibility results showing that existing heuristics fare poorly even for simple target functions. The crux of our extension is a new splitting criterion that takes into account the correlations between $f$ and small subsets of its attributes. The splitting criteria of existing heuristics (e.g. Gini impurity and information gain), in contrast, are based solely on the correlations between $f$ and its individual attributes. Our algorithm satisfies the following guarantee: for all target functions $f : \{-1,1\}^n \to \{-1,1\}$, sizes $s\in \mathbb{N}$, and error parameters $ε$, it constructs a decision tree of size $s^{\tilde{O}((\log s)^2/ε^2)}$ that achieves error $\le O(\mathsf{opt}_s) + ε$, where $\mathsf{opt}_s$ denotes the error of the optimal size $s$ decision tree. A key technical notion that drives our analysis is the noise stability of $f$, a well-studied smoothness measure. △ Less

Submitted 16 October, 2020; originally announced October 2020.

arXiv:2007.07176 [pdf, other]

Robustifying Reinforcement Learning Agents via Action Space Adversarial Training

Authors: Kai Liang Tan, Yasaman Esfandiari, Xian Yeow Lee, Aakanksha, Soumik Sarkar

Abstract: Adoption of machine learning (ML)-enabled cyber-physical systems (CPS) are becoming prevalent in various sectors of modern society such as transportation, industrial, and power grids. Recent studies in deep reinforcement learning (DRL) have demonstrated its benefits in a large variety of data-driven decisions and control applications. As reliance on ML-enabled systems grows, it is imperative to st… ▽ More Adoption of machine learning (ML)-enabled cyber-physical systems (CPS) are becoming prevalent in various sectors of modern society such as transportation, industrial, and power grids. Recent studies in deep reinforcement learning (DRL) have demonstrated its benefits in a large variety of data-driven decisions and control applications. As reliance on ML-enabled systems grows, it is imperative to study the performance of these systems under malicious state and actuator attacks. Traditional control systems employ resilient/fault-tolerant controllers that counter these attacks by correcting the system via error observations. However, in some applications, a resilient controller may not be sufficient to avoid a catastrophic failure. Ideally, a robust approach is more useful in these scenarios where a system is inherently robust (by design) to adversarial attacks. While robust control has a long history of development, robust ML is an emerging research area that has already demonstrated its relevance and urgency. However, the majority of robust ML research has focused on perception tasks and not on decision and control tasks, although the ML (specifically RL) models used for control applications are equally vulnerable to adversarial attacks. In this paper, we show that a well-performing DRL agent that is initially susceptible to action space perturbations (e.g. actuator attacks) can be robustified against similar perturbations through adversarial training. △ Less

Submitted 14 July, 2020; originally announced July 2020.

Comments: Accepted for publication in American Control Conference 2020, 6 Pages

arXiv:1912.01203 [pdf]

Music Style Classification with Compared Methods in XGB and BPNN

Authors: Lifeng Tan, Cong **, Zhiyuan Cheng, Xin Lv, Leiyu Song

Abstract: Scientists have used many different classification methods to solve the problem of music classification. But the efficiency of each classification is different. In this paper, we propose two compared methods on the task of music style classification. More specifically, feature extraction for representing timbral texture, rhythmic content and pitch content are proposed. Comparative evaluations on p… ▽ More Scientists have used many different classification methods to solve the problem of music classification. But the efficiency of each classification is different. In this paper, we propose two compared methods on the task of music style classification. More specifically, feature extraction for representing timbral texture, rhythmic content and pitch content are proposed. Comparative evaluations on performances of two classifiers were conducted for music classification with different styles. The result shows that XGB is better suited for small datasets than BPNN △ Less

Submitted 3 December, 2019; originally announced December 2019.

Comments: 5 pages, 1 figures

arXiv:1909.02583 [pdf, other]

Spatiotemporally Constrained Action Space Attacks on Deep Reinforcement Learning Agents

Authors: Xian Yeow Lee, Sambit Ghadai, Kai Liang Tan, Chinmay Hegde, Soumik Sarkar

Abstract: Robustness of Deep Reinforcement Learning (DRL) algorithms towards adversarial attacks in real world applications such as those deployed in cyber-physical systems (CPS) are of increasing concern. Numerous studies have investigated the mechanisms of attacks on the RL agent's state space. Nonetheless, attacks on the RL agent's action space (AS) (corresponding to actuators in engineering systems) are… ▽ More Robustness of Deep Reinforcement Learning (DRL) algorithms towards adversarial attacks in real world applications such as those deployed in cyber-physical systems (CPS) are of increasing concern. Numerous studies have investigated the mechanisms of attacks on the RL agent's state space. Nonetheless, attacks on the RL agent's action space (AS) (corresponding to actuators in engineering systems) are equally perverse; such attacks are relatively less studied in the ML literature. In this work, we first frame the problem as an optimization problem of minimizing the cumulative reward of an RL agent with decoupled constraints as the budget of attack. We propose a white-box Myopic Action Space (MAS) attack algorithm that distributes the attacks across the action space dimensions. Next, we reformulate the optimization problem above with the same objective function, but with a temporally coupled constraint on the attack budget to take into account the approximated dynamics of the agent. This leads to the white-box Look-ahead Action Space (LAS) attack algorithm that distributes the attacks across the action and temporal dimensions. Our results shows that using the same amount of resources, the LAS attack deteriorates the agent's performance significantly more than the MAS attack. This reveals the possibility that with limited resource, an adversary can utilize the agent's dynamics to malevolently craft attacks that causes the agent to fail. Additionally, we leverage these attack strategies as a possible tool to gain insights on the potential vulnerabilities of DRL agents. △ Less

Submitted 18 November, 2019; v1 submitted 5 September, 2019; originally announced September 2019.

Comments: Version 2 with supplementary materials

arXiv:1905.13409 [pdf, other]

Bypassing Backdoor Detection Algorithms in Deep Learning

Authors: Te Juin Lester Tan, Reza Shokri

Abstract: Deep learning models are vulnerable to various adversarial manipulations of their training data, parameters, and input sample. In particular, an adversary can modify the training data and model parameters to embed backdoors into the model, so the model behaves according to the adversary's objective if the input contains the backdoor features, referred to as the backdoor trigger (e.g., a stamp on a… ▽ More Deep learning models are vulnerable to various adversarial manipulations of their training data, parameters, and input sample. In particular, an adversary can modify the training data and model parameters to embed backdoors into the model, so the model behaves according to the adversary's objective if the input contains the backdoor features, referred to as the backdoor trigger (e.g., a stamp on an image). The poisoned model's behavior on clean data, however, remains unchanged. Many detection algorithms are designed to detect backdoors on input samples or model parameters, through the statistical difference between the latent representations of adversarial and clean input samples in the poisoned model. In this paper, we design an adversarial backdoor embedding algorithm that can bypass the existing detection algorithms including the state-of-the-art techniques. We design an adaptive adversarial training algorithm that optimizes the original loss function of the model, and also maximizes the indistinguishability of the hidden representations of poisoned data and clean data. This work calls for designing adversary-aware defense mechanisms for backdoor detection. △ Less

Submitted 6 June, 2020; v1 submitted 31 May, 2019; originally announced May 2019.

Comments: IEEE European Symposium on Security and Privacy 2020

arXiv:1904.09591 [pdf, other]

Conditionally structured variational Gaussian approximation with importance weights

Authors: Linda S. L. Tan, Aishwarya Bhaskaran, David J. Nott

Abstract: We develop flexible methods of deriving variational inference for models with complex latent variable structure. By splitting the variables in these models into "global" parameters and "local" latent variables, we define a class of variational approximations that exploit this partitioning and go beyond Gaussian variational approximation. This approximation is motivated by the fact that in many hie… ▽ More We develop flexible methods of deriving variational inference for models with complex latent variable structure. By splitting the variables in these models into "global" parameters and "local" latent variables, we define a class of variational approximations that exploit this partitioning and go beyond Gaussian variational approximation. This approximation is motivated by the fact that in many hierarchical models, there are global variance parameters which determine the scale of local latent variables in their posterior conditional on the global parameters. We also consider parsimonious parametrizations by using conditional independence structure, and improved estimation of the log marginal likelihood and variational density using importance weights. These methods are shown to improve significantly on Gaussian variational approximation methods for a similar computational cost. Application of the methodology is illustrated using generalized linear mixed models and state space models. △ Less

Submitted 21 April, 2019; originally announced April 2019.

Comments: 18 pages, 7 figures

arXiv:1811.07886 [pdf, other]

Chemical Structure Elucidation from Mass Spectrometry by Matching Substructures

Authors: **g Lim, Joshua Wong, Minn Xuan Wong, Lee Han Eric Tan, Hai Leong Chieu, Davin Choo, Neng Kai Nigel Neo

Abstract: Chemical structure elucidation is a serious bottleneck in analytical chemistry today. We address the problem of identifying an unknown chemical threat given its mass spectrum and its chemical formula, a task which might take well trained chemists several days to complete. Given a chemical formula, there could be over a million possible candidate structures. We take a data driven approach to rank t… ▽ More Chemical structure elucidation is a serious bottleneck in analytical chemistry today. We address the problem of identifying an unknown chemical threat given its mass spectrum and its chemical formula, a task which might take well trained chemists several days to complete. Given a chemical formula, there could be over a million possible candidate structures. We take a data driven approach to rank these structures by using neural networks to predict the presence of substructures given the mass spectrum, and matching these substructures to the candidate structures. Empirically, we evaluate our approach on a data set of chemical agents built for unknown chemical threat identification. We show that our substructure classifiers can attain over 90% micro F1-score, and we can find the correct structure among the top 20 candidates in 88% and 71% of test cases for two compound classes. △ Less

Submitted 17 November, 2018; originally announced November 2018.

arXiv:1811.06100 [pdf, ps, other]

Newton Methods for Convolutional Neural Networks

Authors: Chien-Chih Wang, Kent Loong Tan, Chih-Jen Lin

Abstract: Deep learning involves a difficult non-convex optimization problem, which is often solved by stochastic gradient (SG) methods. While SG is usually effective, it may not be robust in some situations. Recently, Newton methods have been investigated as an alternative optimization technique, but nearly all existing studies consider only fully-connected feedforward neural networks. They do not investig… ▽ More Deep learning involves a difficult non-convex optimization problem, which is often solved by stochastic gradient (SG) methods. While SG is usually effective, it may not be robust in some situations. Recently, Newton methods have been investigated as an alternative optimization technique, but nearly all existing studies consider only fully-connected feedforward neural networks. They do not investigate other types of networks such as Convolutional Neural Networks (CNN), which are more commonly used in deep-learning applications. One reason is that Newton methods for CNN involve complicated operations, and so far no works have conducted a thorough investigation. In this work, we give details of all building blocks including function, gradient, and Jacobian evaluation, and Gauss-Newton matrix-vector products. These basic components are very important because with them further developments of Newton methods for CNN become possible. We show that an efficient MATLAB implementation can be done in just several hundred lines of code and demonstrate that the Newton method gives competitive test accuracy. △ Less

Submitted 14 November, 2018; originally announced November 2018.

Comments: Supplementary materials, experimental code and an efficient MATLAB implementation are available at https://www.csie.ntu.edu.tw/~cjlin/cnn/

arXiv:1811.04249 [pdf, other]

Bayesian variational inference for exponential random graph models

Authors: Linda S. L. Tan, Nial Friel

Abstract: Deriving Bayesian inference for exponential random graph models (ERGMs) is a challenging "doubly intractable" problem as the normalizing constants of the likelihood and posterior density are both intractable. Markov chain Monte Carlo (MCMC) methods which yield Bayesian inference for ERGMs, such as the exchange algorithm, are asymptotically exact but computationally intensive, as a network has to b… ▽ More Deriving Bayesian inference for exponential random graph models (ERGMs) is a challenging "doubly intractable" problem as the normalizing constants of the likelihood and posterior density are both intractable. Markov chain Monte Carlo (MCMC) methods which yield Bayesian inference for ERGMs, such as the exchange algorithm, are asymptotically exact but computationally intensive, as a network has to be drawn from the likelihood at every step using, for instance, a "tie no tie" sampler. In this article, we develop a variety of variational methods for Gaussian approximation of the posterior density and model selection. These include nonconjugate variational message passing based on an adjusted pseudolikelihood and stochastic variational inference. To overcome the computational hurdle of drawing a network from the likelihood at each iteration, we propose stochastic gradient ascent with biased but consistent gradient estimates computed using adaptive self-normalized importance sampling. These methods provide attractive fast alternatives to MCMC for posterior approximation. We illustrate the variational methods using real networks and compare their accuracy with results obtained via MCMC and Laplace approximation. △ Less

Submitted 23 November, 2019; v1 submitted 10 November, 2018; originally announced November 2018.

Comments: 45 pages

arXiv:1805.07267 [pdf, ps, other]

Use of model reparametrization to improve variational Bayes

Authors: Linda S. L. Tan

Abstract: We propose using model reparametrization to improve variational Bayes inference for hierarchical models whose variables can be classified as global (shared across observations) or local (observation specific). Posterior dependence between local and global variables is minimized by applying an invertible affine transformation on the local variables. The functional form of this transformation is ded… ▽ More We propose using model reparametrization to improve variational Bayes inference for hierarchical models whose variables can be classified as global (shared across observations) or local (observation specific). Posterior dependence between local and global variables is minimized by applying an invertible affine transformation on the local variables. The functional form of this transformation is deduced by approximating the posterior distribution of each local variable conditional on the global variables by a Gaussian density via a second order Taylor expansion. Variational Bayes inference for the reparametrized model is then obtained using stochastic approximation. Our approach can be readily extended to large datasets via a divide and recombine strategy. Using generalized linear mixed models, we demonstrate that reparametrized variational Bayes (RVB) provides improvements in both accuracy and convergence rate compared to state of the art Gaussian variational approximation methods. △ Less

Submitted 7 March, 2020; v1 submitted 18 May, 2018; originally announced May 2018.

Journal ref: JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 2020

arXiv:1802.00130 [pdf, other]

Distributed Newton Methods for Deep Neural Networks

Authors: Chien-Chih Wang, Kent Loong Tan, Chun-Ting Chen, Yu-Hsiang Lin, S. Sathiya Keerthi, Dhruv Mahajan, S. Sundararajan, Chih-Jen Lin

Abstract: Deep learning involves a difficult non-convex optimization problem with a large number of weights between any two adjacent layers of a deep structure. To handle large data sets or complicated networks, distributed training is needed, but the calculation of function, gradient, and Hessian is expensive. In particular, the communication and the synchronization cost may become a bottleneck. In this pa… ▽ More Deep learning involves a difficult non-convex optimization problem with a large number of weights between any two adjacent layers of a deep structure. To handle large data sets or complicated networks, distributed training is needed, but the calculation of function, gradient, and Hessian is expensive. In particular, the communication and the synchronization cost may become a bottleneck. In this paper, we focus on situations where the model is distributedly stored, and propose a novel distributed Newton method for training deep neural networks. By variable and feature-wise data partitions, and some careful designs, we are able to explicitly use the Jacobian matrix for matrix-vector products in the Newton method. Some techniques are incorporated to reduce the running time as well as the memory consumption. First, to reduce the communication cost, we propose a diagonalization method such that an approximate Newton direction can be obtained without communication between machines. Second, we consider subsampled Gauss-Newton matrices for reducing the running time as well as the communication cost. Third, to reduce the synchronization cost, we terminate the process of finding an approximate Newton direction even though some nodes have not finished their tasks. Details of some implementation issues in distributed environments are thoroughly investigated. Experiments demonstrate that the proposed method is effective for the distributed training of deep neural networks. In compared with stochastic gradient methods, it is more robust and may give better test accuracy. △ Less

Submitted 31 January, 2018; originally announced February 2018.

Comments: Supplementary materials and experimental code are available at https://www.csie.ntu.edu.tw/~cjlin/papers/dnn

arXiv:1712.08887 [pdf, other]

Efficient data augmentation techniques for some classes of state space models

Authors: Linda S. L. Tan

Abstract: Data augmentation improves the convergence of iterative algorithms, such as the EM algorithm and Gibbs sampler by introducing carefully designed latent variables. In this article, we first propose a data augmentation scheme for the first-order autoregression plus noise model, where optimal values of working parameters introduced for recentering and rescaling of the latent states, can be derived an… ▽ More Data augmentation improves the convergence of iterative algorithms, such as the EM algorithm and Gibbs sampler by introducing carefully designed latent variables. In this article, we first propose a data augmentation scheme for the first-order autoregression plus noise model, where optimal values of working parameters introduced for recentering and rescaling of the latent states, can be derived analytically by minimizing the fraction of missing information in the EM algorithm. The proposed data augmentation scheme is then utilized to design efficient Markov chain Monte Carlo (MCMC) algorithms for Bayesian inference of some non-Gaussian and nonlinear state space models, via a mixture of normals approximation coupled with a block-specific reparametrization strategy. Applications on simulated and benchmark real datasets indicate that the proposed MCMC sampler can yield improvements in simulation efficiency compared with centering, noncentering and even the ancillarity-sufficiency interweaving strategy. △ Less

Submitted 4 July, 2022; v1 submitted 24 December, 2017; originally announced December 2017.

Comments: Keywords: Data augmentation, State space model, Stochastic volatility model, EM algorithm, Reparametrization, Markov chain Monte Carlo, Ancillarity-sufficiency interweaving strategy

arXiv:1705.09088 [pdf, other]

doi 10.1177/1471082X18770760

Dynamic degree-corrected blockmodels for social networks: a nonparametric approach

Authors: Linda S. L. Tan, Maria De Iorio

Abstract: A nonparametric approach to the modeling of social networks using degree-corrected stochastic blockmodels is proposed. The model for static network consists of a stochastic blockmodel using a probit regression formulation and popularity parameters are incorporated to account for degree heterogeneity. Dirichlet processes are used to detect community structure as well as induce clustering in the pop… ▽ More A nonparametric approach to the modeling of social networks using degree-corrected stochastic blockmodels is proposed. The model for static network consists of a stochastic blockmodel using a probit regression formulation and popularity parameters are incorporated to account for degree heterogeneity. Dirichlet processes are used to detect community structure as well as induce clustering in the popularity parameters. This approach is flexible yet parsimonious as it allows the appropriate number of communities and popularity clusters to be determined automatically by the data. We further discuss some ways of extending the static model to dynamic networks. We consider a Bayesian approach and derive Gibbs samplers for posterior inference. The models are illustrated using several real-world benchmark social networks. △ Less

Submitted 25 May, 2017; originally announced May 2017.

Journal ref: Statistical Modelling (2019), 19, 386-411

arXiv:1606.04995 [pdf, other]

Joint Data Compression and MAC Protocol Design for Smartgrids with Renewable Energy

Authors: Le Thanh Tan, Long Bao Le

Abstract: In this paper, we consider the joint design of data compression and 802.15.4-based medium access control (MAC) protocol for smartgrids with renewable energy. We study the setting where a number of nodes, each of which comprises electricity load and/or renewable sources, report periodically their injected powers to a data concentrator. Our design exploits the correlation of the reported data in bot… ▽ More In this paper, we consider the joint design of data compression and 802.15.4-based medium access control (MAC) protocol for smartgrids with renewable energy. We study the setting where a number of nodes, each of which comprises electricity load and/or renewable sources, report periodically their injected powers to a data concentrator. Our design exploits the correlation of the reported data in both time and space to efficiently design the data compression using the compressed sensing (CS) technique and theMAC protocol so that the reported data can be recovered reliably within minimum reporting time. Specifically, we perform the following design tasks: i) we employ the two-dimensional (2D) CS technique to compress the reported data in the distributed manner; ii) we propose to adapt the 802.15.4 MAC protocol frame structure to enable efficient data transmission and reliable data reconstruction; and iii) we develop an analytical model based on which we can obtain efficient MAC parameter configuration to minimize the reporting delay. Finally, numerical results are presented to demonstrate the effectiveness of our proposed framework compared to existing solutions. △ Less

Submitted 15 June, 2016; originally announced June 2016.

Comments: https://arxiv.longhoe.net/admin/q/1589135, Wireless Communications and Mobile Computing, 2016. arXiv admin note: substantial text overlap with arXiv:1506.08318

arXiv:1605.05622 [pdf, other]

doi 10.1007/s11222-017-9729-7

Gaussian variational approximation with sparse precision matrices

Authors: Linda S. L. Tan, David J. Nott

Abstract: We consider the problem of learning a Gaussian variational approximation to the posterior distribution for a high-dimensional parameter, where we impose sparsity in the precision matrix to reflect appropriate conditional independence structure in the model. Incorporating sparsity in the precision matrix allows the Gaussian variational distribution to be both flexible and parsimonious, and the spar… ▽ More We consider the problem of learning a Gaussian variational approximation to the posterior distribution for a high-dimensional parameter, where we impose sparsity in the precision matrix to reflect appropriate conditional independence structure in the model. Incorporating sparsity in the precision matrix allows the Gaussian variational distribution to be both flexible and parsimonious, and the sparsity is achieved through parameterization in terms of the Cholesky factor. Efficient stochastic gradient methods which make appropriate use of gradient information for the target distribution are developed for the optimization. We consider alternative estimators of the stochastic gradients which have lower variation and are more stable. Our approach is illustrated using generalized linear mixed models and state space models for time series. △ Less

Submitted 12 April, 2017; v1 submitted 18 May, 2016; originally announced May 2016.

Comments: 18 pages, 9 figures

Journal ref: Statistics and Computing 28 (2018) 259-275

arXiv:1604.07087 [pdf, other]

Optimal Estimation of Slope Vector in High-dimensional Linear Transformation Model

Authors: Xin Lu Tan

Abstract: In a linear transformation model, there exists an unknown monotone nonlinear transformation function such that the transformed response variable and the predictor variables satisfy a linear regression model. In this paper, we present CENet, a new method for estimating the slope vector and simultaneously performing variable selection in the high-dimensional sparse linear transformation model. CENet… ▽ More In a linear transformation model, there exists an unknown monotone nonlinear transformation function such that the transformed response variable and the predictor variables satisfy a linear regression model. In this paper, we present CENet, a new method for estimating the slope vector and simultaneously performing variable selection in the high-dimensional sparse linear transformation model. CENet is the solution to a convex optimization problem and can be computed efficiently from an algorithm with guaranteed convergence to the global optimum. We show that under a pairwise elliptical distribution assumption on each predictor-transformed-response pair and some regularity conditions, CENet attains the same optimal rate of convergence as the best regression method in the high-dimensional sparse linear regression model. To the best of our limited knowledge, this is the first such result in the literature. We demonstrate the empirical performance of CENet on both simulated and real datasets. We also discuss the connection of CENet with some nonlinear regression/multivariate methods proposed in the literature. △ Less

Submitted 24 April, 2016; originally announced April 2016.

Comments: 25 pages, 7 figures, 1 table

arXiv:1603.06358 [pdf, other]

doi 10.1214/17-AOAS1076

Bayesian inference for multiple Gaussian graphical models with application to metabolic association networks

Authors: Linda S. L. Tan, Ajay Jasra, Maria De Iorio, Timothy M. D. Ebbels

Abstract: We investigate the effect of cadmium (a toxic environmental pollutant) on the correlation structure of a number of urinary metabolites using Gaussian graphical models (GGMs). The inferred metabolic associations can provide important information on the physiological state of a metabolic system and insights on complex metabolic relationships. Using the fitted GGMs, we construct differential networks… ▽ More We investigate the effect of cadmium (a toxic environmental pollutant) on the correlation structure of a number of urinary metabolites using Gaussian graphical models (GGMs). The inferred metabolic associations can provide important information on the physiological state of a metabolic system and insights on complex metabolic relationships. Using the fitted GGMs, we construct differential networks, which highlight significant changes in metabolite interactions under different experimental conditions. The analysis of such metabolic association networks can reveal differences in the underlying biological reactions caused by cadmium exposure. We consider Bayesian inference and propose using the multiplicative (or Chung-Lu random graph) model as a prior on the graphical space. In the multiplicative model, each edge is chosen independently with probability equal to the product of the connectivities of the end nodes. This class of prior is parsimonious yet highly flexible; it can be used to encourage sparsity or graphs with a pre-specified degree distribution when such prior knowledge is available. We extend the multiplicative model to multiple GGMs linking the probability of edge inclusion through logistic regression and demonstrate how this leads to joint inference for multiple GGMs. A sequential Monte Carlo (SMC) algorithm is developed for estimating the posterior distribution of the graphs. △ Less

Submitted 13 April, 2017; v1 submitted 21 March, 2016; originally announced March 2016.

Journal ref: Ann. Appl. Stat. 11 (2017) 2222-2251

arXiv:1511.06821 [pdf, other]

Kernel Additive Principal Components

Authors: Xin Lu Tan, Andreas Buja, Zongming Ma

Abstract: Additive principal components (APCs for short) are a nonlinear generalization of linear principal components. We focus on smallest APCs to describe additive nonlinear constraints that are approximately satisfied by the data. Thus APCs fit data with implicit equations that treat the variables symmetrically, as opposed to regression analyses which fit data with explicit equations that treat the data… ▽ More Additive principal components (APCs for short) are a nonlinear generalization of linear principal components. We focus on smallest APCs to describe additive nonlinear constraints that are approximately satisfied by the data. Thus APCs fit data with implicit equations that treat the variables symmetrically, as opposed to regression analyses which fit data with explicit equations that treat the data asymmetrically by singling out a response variable. We propose a regularized data-analytic procedure for APC estimation using kernel methods. In contrast to existing approaches to APCs that are based on regularization through subspace restriction, kernel methods achieve regularization through shrinkage and therefore grant distinctive flexibility in APC estimation by allowing the use of infinite-dimensional functions spaces for searching APC transformation while retaining computational feasibility. To connect population APCs and kernelized finite-sample APCs, we study kernelized population APCs and their associated eigenproblems, which eventually lead to the establishment of consistency of the estimated APCs. Lastly, we discuss an iterative algorithm for computing kernelized finite-sample APCs. △ Less

Submitted 20 November, 2015; originally announced November 2015.

Comments: 54 pages including appendices

arXiv:1502.07190 [pdf, other]

doi 10.1214/15-AOAS887

Topic-adjusted visibility metric for scientific articles

Authors: Linda S. L. Tan, Aik Hui Chan, Tian Zheng

Abstract: Measuring the impact of scientific articles is important for evaluating the research output of individual scientists, academic institutions and journals. While citations are raw data for constructing impact measures, there exist biases and potential issues if factors affecting citation patterns are not properly accounted for. In this work, we address the problem of field variation and introduce an… ▽ More Measuring the impact of scientific articles is important for evaluating the research output of individual scientists, academic institutions and journals. While citations are raw data for constructing impact measures, there exist biases and potential issues if factors affecting citation patterns are not properly accounted for. In this work, we address the problem of field variation and introduce an article level metric useful for evaluating individual articles' visibility. This measure derives from joint probabilistic modeling of the content in the articles and the citations amongst them using latent Dirichlet allocation (LDA) and the mixed membership stochastic blockmodel (MMSB). Our proposed model provides a visibility metric for individual articles adjusted for field variation in citation rates, a structural understanding of citation behavior in different fields, and article recommendations which take into account article visibility and citation patterns. We develop an efficient algorithm for model fitting using variational methods. To scale up to large networks, we develop an online variant using stochastic gradient methods and case-control likelihood approximation. We apply our methods to the benchmark KDD Cup 2003 dataset with approximately 30,000 high energy physics papers. △ Less

Submitted 16 October, 2015; v1 submitted 25 February, 2015; originally announced February 2015.

Journal ref: Annals of Applied Statistics, Volume 10, Number 1 (2016), 1-31

arXiv:1405.5623 [pdf, ps, other]

doi 10.1007/s11222-015-9618-x

Stochastic variational inference for large-scale discrete choice models using adaptive batch sizes

Authors: Linda S. L. Tan

Abstract: Discrete choice models describe the choices made by decision makers among alternatives and play an important role in transportation planning, marketing research and other applications. The mixed multinomial logit (MMNL) model is a popular discrete choice model that captures heterogeneity in the preferences of decision makers through random coefficients. While Markov chain Monte Carlo methods provi… ▽ More Discrete choice models describe the choices made by decision makers among alternatives and play an important role in transportation planning, marketing research and other applications. The mixed multinomial logit (MMNL) model is a popular discrete choice model that captures heterogeneity in the preferences of decision makers through random coefficients. While Markov chain Monte Carlo methods provide the Bayesian analogue to classical procedures for estimating MMNL models, computations can be prohibitively expensive for large datasets. Approximate inference can be obtained using variational methods at a lower computational cost with competitive accuracy. In this paper, we develop variational methods for estimating MMNL models that allow random coefficients to be correlated in the posterior and can be extended easily to large-scale datasets. We explore three alternatives: (1) Laplace variational inference, (2) nonconjugate variational message passing and (3) stochastic linear regression. Their performances are compared using real and simulated data. To accelerate convergence for large datasets, we develop stochastic variational inference for MMNL models using each of the above alternatives. Stochastic variational inference allows data to be processed in minibatches by optimizing global variational parameters using stochastic gradient approximation. A novel strategy for increasing minibatch sizes adaptively within stochastic variational inference is proposed. △ Less

Submitted 8 October, 2015; v1 submitted 21 May, 2014; originally announced May 2014.

Journal ref: Statistics and Computing (2017) 27 pp 237-257

arXiv:1306.1999 [pdf, ps, other]

doi 10.1007/s11222-015-9600-7

Variational inference for sparse spectrum Gaussian process regression

Authors: Linda S. L. Tan, Victor M. H. Ong, David J. Nott, Ajay Jasra

Abstract: We develop a fast variational approximation scheme for Gaussian process (GP) regression, where the spectrum of the covariance function is subjected to a sparse approximation. Our approach enables uncertainty in covariance function hyperparameters to be treated without using Monte Carlo methods and is robust to overfitting. Our article makes three contributions. First, we present a variational Baye… ▽ More We develop a fast variational approximation scheme for Gaussian process (GP) regression, where the spectrum of the covariance function is subjected to a sparse approximation. Our approach enables uncertainty in covariance function hyperparameters to be treated without using Monte Carlo methods and is robust to overfitting. Our article makes three contributions. First, we present a variational Bayes algorithm for fitting sparse spectrum GP regression models that uses nonconjugate variational message passing to derive fast and efficient updates. Second, we propose a novel adaptive neighbourhood technique for obtaining predictive inference that is effective in dealing with nonstationarity. Regression is performed locally at each point to be predicted and the neighbourhood is determined using a measure defined based on lengthscales estimated from an initial fit. Weighting dimensions according to lengthscales, this downweights variables of little relevance, leading to automatic variable selection and improved prediction. Third, we introduce a technique for accelerating convergence in nonconjugate variational message passing by adapting step sizes in the direction of the natural gradient of the lower bound. Our adaptive strategy can be easily implemented and empirical results indicate significant speedups. △ Less

Submitted 26 January, 2015; v1 submitted 9 June, 2013; originally announced June 2013.

Comments: 20 pages, 11 figures, 1 table

Journal ref: Statistics and Computing (2016) 26 pp 1243-1261

arXiv:1208.4949 [pdf, other]

doi 10.1214/14-BA885

A stochastic variational framework for fitting and diagnosing generalized linear mixed models

Authors: Linda S. L. Tan, David J. Nott

Abstract: In stochastic variational inference, the variational Bayes objective function is optimized using stochastic gradient approximation, where gradients computed on small random subsets of data are used to approximate the true gradient over the whole data set. This enables complex models to be fit to large data sets as data can be processed in mini-batches. In this article, we extend stochastic variati… ▽ More In stochastic variational inference, the variational Bayes objective function is optimized using stochastic gradient approximation, where gradients computed on small random subsets of data are used to approximate the true gradient over the whole data set. This enables complex models to be fit to large data sets as data can be processed in mini-batches. In this article, we extend stochastic variational inference for conjugate-exponential models to nonconjugate models and present a stochastic nonconjugate variational message passing algorithm for fitting generalized linear mixed models that is scalable to large data sets. In addition, we show that diagnostics for prior-likelihood conflict, which are useful for Bayesian model criticism, can be obtained from nonconjugate variational message passing automatically, as an alternative to simulation-based Markov chain Monte Carlo methods. Finally, we demonstrate that for moderate-sized data sets, convergence can be accelerated by using the stochastic version of nonconjugate variational message passing in the initial stage of optimization before switching to the standard version. △ Less

Submitted 28 March, 2014; v1 submitted 24 August, 2012; originally announced August 2012.

Comments: 42 pages, 13 figures, 9 tables

Journal ref: Bayesian Analysis (2014), 9, 963-1004

arXiv:1207.4155 [pdf]

Similarity-Driven Cluster Merging Method for Unsupervised Fuzzy Clustering

Authors: Xuejian Xiong, Kap Chan, Kian Lee Tan

Abstract: In this paper, a similarity-driven cluster merging method is proposed for unsuper-vised fuzzy clustering. The cluster merging method is used to resolve the problem of cluster validation. Starting with an overspecified number of clusters in the data, pairs of similar clusters are merged based on the proposed similarity-driven cluster merging criterion. The similarity between clusters is calculated… ▽ More In this paper, a similarity-driven cluster merging method is proposed for unsuper-vised fuzzy clustering. The cluster merging method is used to resolve the problem of cluster validation. Starting with an overspecified number of clusters in the data, pairs of similar clusters are merged based on the proposed similarity-driven cluster merging criterion. The similarity between clusters is calculated by a fuzzy cluster similarity matrix, while an adaptive threshold is used for merging. In addition, a modified generalized ob- jective function is used for prototype-based fuzzy clustering. The function includes the p-norm distance measure as well as principal components of the clusters. The number of the principal components is determined automatically from the data being clustered. The properties of this unsupervised fuzzy clustering algorithm are illustrated by several experiments. △ Less

Submitted 11 July, 2012; originally announced July 2012.

Comments: Appears in Proceedings of the Twentieth Conference on Uncertainty in Artificial Intelligence (UAI2004)

Report number: UAI-P-2004-PG-611-618

arXiv:1205.3906 [pdf, ps, other]

doi 10.1214/13-STS418

Variational Inference for Generalized Linear Mixed Models Using Partially Noncentered Parametrizations

Authors: Linda S. L. Tan, David J. Nott

Abstract: The effects of different parametrizations on the convergence of Bayesian computational algorithms for hierarchical models are well explored. Techniques such as centering, noncentering and partial noncentering can be used to accelerate convergence in MCMC and EM algorithms but are still not well studied for variational Bayes (VB) methods. As a fast deterministic approach to posterior approximation,… ▽ More The effects of different parametrizations on the convergence of Bayesian computational algorithms for hierarchical models are well explored. Techniques such as centering, noncentering and partial noncentering can be used to accelerate convergence in MCMC and EM algorithms but are still not well studied for variational Bayes (VB) methods. As a fast deterministic approach to posterior approximation, VB is attracting increasing interest due to its suitability for large high-dimensional data. Use of different parametrizations for VB has not only computational but also statistical implications, as different parametrizations are associated with different factorized posterior approximations. We examine the use of partially noncentered parametrizations in VB for generalized linear mixed models (GLMMs). Our paper makes four contributions. First, we show how to implement an algorithm called nonconjugate variational message passing for GLMMs. Second, we show that the partially noncentered parametrization can adapt to the quantity of information in the data and determine a parametrization close to optimal. Third, we show that partial noncentering can accelerate convergence and produce more accurate posterior approximations than centering or noncentering. Finally, we demonstrate how the variational lower bound, produced as part of the computation, can be useful for model selection. △ Less

Submitted 11 June, 2013; v1 submitted 17 May, 2012; originally announced May 2012.

Comments: Published in at http://dx.doi.org/10.1214/13-STS418 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-STS-STS418

Journal ref: Statistical Science 2013, Vol. 28, No. 2, 168-188

arXiv:1112.4675 [pdf, other]

doi 10.1080/10618600.2012.761138

Variational approximation for mixtures of linear mixed models

Authors: Siew Li Tan, David J. Nott

Abstract: Mixtures of linear mixed models (MLMMs) are useful for clustering grouped data and can be estimated by likelihood maximization through the EM algorithm. The conventional approach to determining a suitable number of components is to compare different mixture models using penalized log-likelihood criteria such as BIC.We propose fitting MLMMs with variational methods which can perform parameter estim… ▽ More Mixtures of linear mixed models (MLMMs) are useful for clustering grouped data and can be estimated by likelihood maximization through the EM algorithm. The conventional approach to determining a suitable number of components is to compare different mixture models using penalized log-likelihood criteria such as BIC.We propose fitting MLMMs with variational methods which can perform parameter estimation and model selection simultaneously. A variational approximation is described where the variational lower bound and parameter updates are in closed form, allowing fast evaluation. A new variational greedy algorithm is developed for model selection and learning of the mixture components. This approach allows an automatic initialization of the algorithm and returns a plausible number of mixture components automatically. In cases of weak identifiability of certain model parameters, we use hierarchical centering to reparametrize the model and show empirically that there is a gain in efficiency by variational algorithms similar to that in MCMC algorithms. Related to this, we prove that the approximate rate of convergence of variational algorithms by Gaussian approximation is equal to that of the corresponding Gibbs sampler which suggests that reparametrizations can lead to improved convergence in variational algorithms as well. △ Less

Submitted 29 August, 2012; v1 submitted 20 December, 2011; originally announced December 2011.

Comments: 36 pages, 5 figures, 2 tables, submitted to JCGS

Journal ref: Journal of Computational and Graphical Statistics. Volume 23, Issue 2, 2014, pages 564-585

Showing 1–35 of 35 results for author: Tan, L