Search | arXiv e-print repository

Standard Gaussian Process Can Be Excellent for High-Dimensional Bayesian Optimization

Abstract: There has been a long-standing and widespread belief that Bayesian Optimization (BO) with standard Gaussian process (GP), referred to as standard BO, is ineffective in high-dimensional optimization problems. While this belief sounds reasonable, strong empirical evidence is lacking. In this paper, we systematically investigated BO with standard GP regression across a variety of synthetic and real-w… ▽ More There has been a long-standing and widespread belief that Bayesian Optimization (BO) with standard Gaussian process (GP), referred to as standard BO, is ineffective in high-dimensional optimization problems. While this belief sounds reasonable, strong empirical evidence is lacking. In this paper, we systematically investigated BO with standard GP regression across a variety of synthetic and real-world benchmark problems for high-dimensional optimization. We found that, surprisingly, when using Matérn kernels and Upper Confidence Bound (UCB), standard BO consistently achieves top-tier performance, often outperforming other BO methods specifically designed for high-dimensional optimization. Contrary to the stereotype, we found that standard GP equipped with Matérn kernels can serve as a capable surrogate for learning high-dimensional functions. Without strong structural assumptions, BO with standard GP not only excels in high-dimensional optimization but also is robust in accommodating various structures within target functions. Furthermore, with standard GP, achieving promising optimization performance is possible via maximum a posterior (MAP) estimation with diffuse priors or merely maximum likelihood estimation, eliminating the need for expensive Markov-Chain Monte Carlo (MCMC) sampling that might be required by more complex surrogate models. In parallel, we also investigated and analyzed alternative popular settings in running standard BO, which, however, often fail in high-dimensional optimization. This might link to the a few failure cases reported in literature. We thus advocate for a re-evaluation and in-depth study of the potential of standard BO in addressing high-dimensional problems. △ Less

Submitted 15 May, 2024; v1 submitted 5 February, 2024; originally announced February 2024.

arXiv:2311.04829 [pdf, other]

Functional Bayesian Tucker Decomposition for Continuous-indexed Tensor Data

Authors: Shikai Fang, Xin Yu, Zheng Wang, Shibo Li, Mike Kirby, Shandian Zhe

Abstract: Tucker decomposition is a powerful tensor model to handle multi-aspect data. It demonstrates the low-rank property by decomposing the grid-structured data as interactions between a core tensor and a set of object representations (factors). A fundamental assumption of such decomposition is that there are finite objects in each aspect or mode, corresponding to discrete indexes of data entries. Howev… ▽ More Tucker decomposition is a powerful tensor model to handle multi-aspect data. It demonstrates the low-rank property by decomposing the grid-structured data as interactions between a core tensor and a set of object representations (factors). A fundamental assumption of such decomposition is that there are finite objects in each aspect or mode, corresponding to discrete indexes of data entries. However, real-world data is often not naturally posed in this setting. For example, geographic data is represented as continuous indexes of latitude and longitude coordinates, and cannot fit tensor models directly. To generalize Tucker decomposition to such scenarios, we propose Functional Bayesian Tucker Decomposition (FunBaT). We treat the continuous-indexed data as the interaction between the Tucker core and a group of latent functions. We use Gaussian processes (GP) as functional priors to model the latent functions. Then, we convert each GP into a state-space prior by constructing an equivalent stochastic differential equation (SDE) to reduce computational cost. An efficient inference algorithm is developed for scalable posterior approximation based on advanced message-passing techniques. The advantage of our method is shown in both synthetic data and several real-world applications. We release the code of FunBaT at \url{https://github.com/xuangu-fang/Functional-Bayesian-Tucker-Decomposition}. △ Less

Submitted 18 March, 2024; v1 submitted 8 November, 2023; originally announced November 2023.

Journal ref: The Twelfth International Conference on Learning Representations (ICLR 2024)

arXiv:2310.19666 [pdf, other]

Dynamic Tensor Decomposition via Neural Diffusion-Reaction Processes

Authors: Zheng Wang, Shikai Fang, Shibo Li, Shandian Zhe

Abstract: Tensor decomposition is an important tool for multiway data analysis. In practice, the data is often sparse yet associated with rich temporal information. Existing methods, however, often under-use the time information and ignore the structural knowledge within the sparsely observed tensor entries. To overcome these limitations and to better capture the underlying temporal structure, we propose Dy… ▽ More Tensor decomposition is an important tool for multiway data analysis. In practice, the data is often sparse yet associated with rich temporal information. Existing methods, however, often under-use the time information and ignore the structural knowledge within the sparsely observed tensor entries. To overcome these limitations and to better capture the underlying temporal structure, we propose Dynamic EMbedIngs fOr dynamic Tensor dEcomposition (DEMOTE). We develop a neural diffusion-reaction process to estimate dynamic embeddings for the entities in each tensor mode. Specifically, based on the observed tensor entries, we build a multi-partite graph to encode the correlation between the entities. We construct a graph diffusion process to co-evolve the embedding trajectories of the correlated entities and use a neural network to construct a reaction process for each individual entity. In this way, our model can capture both the commonalities and personalities during the evolution of the embeddings for different entities. We then use a neural network to model the entry value as a nonlinear function of the embedding trajectories. For model estimation, we combine ODE solvers to develop a stochastic mini-batch learning algorithm. We propose a stratified sampling method to balance the cost of processing each mini-batch so as to improve the overall efficiency. We show the advantage of our approach in both simulation study and real-world applications. The code is available at https://github.com/wzhut/Dynamic-Tensor-Decomposition-via-Neural-Diffusion-Reaction-Processes. △ Less

Submitted 30 October, 2023; originally announced October 2023.

arXiv:2310.05387 [pdf, other]

Equation Discovery with Bayesian Spike-and-Slab Priors and Efficient Kernels

Authors: Da Long, Wei W. Xing, Aditi S. Krishnapriyan, Robert M. Kirby, Shandian Zhe, Michael W. Mahoney

Abstract: Discovering governing equations from data is important to many scientific and engineering applications. Despite promising successes, existing methods are still challenged by data sparsity and noise issues, both of which are ubiquitous in practice. Moreover, state-of-the-art methods lack uncertainty quantification and/or are costly in training. To overcome these limitations, we propose a novel equa… ▽ More Discovering governing equations from data is important to many scientific and engineering applications. Despite promising successes, existing methods are still challenged by data sparsity and noise issues, both of which are ubiquitous in practice. Moreover, state-of-the-art methods lack uncertainty quantification and/or are costly in training. To overcome these limitations, we propose a novel equation discovery method based on Kernel learning and BAyesian Spike-and-Slab priors (KBASS). We use kernel regression to estimate the target function, which is flexible, expressive, and more robust to data sparsity and noises. We combine it with a Bayesian spike-and-slab prior -- an ideal Bayesian sparse distribution -- for effective operator selection and uncertainty quantification. We develop an expectation-propagation expectation-maximization (EP-EM) algorithm for efficient posterior inference and function estimation. To overcome the computational challenge of kernel regression, we place the function values on a mesh and induce a Kronecker product construction, and we use tensor algebra to enable efficient computation and optimization. We show the advantages of KBASS on a list of benchmark ODE and PDE discovery tasks. △ Less

Submitted 21 April, 2024; v1 submitted 8 October, 2023; originally announced October 2023.

arXiv:2308.14906 [pdf, other]

BayOTIDE: Bayesian Online Multivariate Time series Imputation with functional decomposition

Authors: Shikai Fang, Qingsong Wen, Yingtao Luo, Shandian Zhe, Liang Sun

Abstract: In real-world scenarios like traffic and energy, massive time-series data with missing values and noises are widely observed, even sampled irregularly. While many imputation methods have been proposed, most of them work with a local horizon, which means models are trained by splitting the long sequence into batches of fit-sized patches. This local horizon can make models ignore global trends or pe… ▽ More In real-world scenarios like traffic and energy, massive time-series data with missing values and noises are widely observed, even sampled irregularly. While many imputation methods have been proposed, most of them work with a local horizon, which means models are trained by splitting the long sequence into batches of fit-sized patches. This local horizon can make models ignore global trends or periodic patterns. More importantly, almost all methods assume the observations are sampled at regular time stamps, and fail to handle complex irregular sampled time series arising from different applications. Thirdly, most existing methods are learned in an offline manner. Thus, it is not suitable for many applications with fast-arriving streaming data. To overcome these limitations, we propose BayOTIDE: Bayesian Online Multivariate Time series Imputation with functional decomposition. We treat the multivariate time series as the weighted combination of groups of low-rank temporal factors with different patterns. We apply a group of Gaussian Processes (GPs) with different kernels as functional priors to fit the factors. For computational efficiency, we further convert the GPs into a state-space prior by constructing an equivalent stochastic differential equation (SDE), and develo** a scalable algorithm for online inference. The proposed method can not only handle imputation over arbitrary time stamps, but also offer uncertainty quantification and interpretability for the downstream application. We evaluate our method on both synthetic and real-world datasets.We release the code at {https://github.com/xuangu-fang/BayOTIDE} △ Less

Submitted 30 May, 2024; v1 submitted 28 August, 2023; originally announced August 2023.

Comments: Accepted by The 41st International Conference on Machine Learning (ICML 2024)

arXiv:2210.08140 [pdf, other]

A Kernel Approach for PDE Discovery and Operator Learning

Authors: Da Long, Nicole Mrvaljevic, Shandian Zhe, Bamdad Hosseini

Abstract: This article presents a three-step framework for learning and solving partial differential equations (PDEs) using kernel methods. Given a training set consisting of pairs of noisy PDE solutions and source/boundary terms on a mesh, kernel smoothing is utilized to denoise the data and approximate derivatives of the solution. This information is then used in a kernel regression model to learn the alg… ▽ More This article presents a three-step framework for learning and solving partial differential equations (PDEs) using kernel methods. Given a training set consisting of pairs of noisy PDE solutions and source/boundary terms on a mesh, kernel smoothing is utilized to denoise the data and approximate derivatives of the solution. This information is then used in a kernel regression model to learn the algebraic form of the PDE. The learned PDE is then used within a kernel based solver to approximate the solution of the PDE with a new source/boundary term, thereby constituting an operator learning framework. Numerical experiments compare the method to state-of-the-art algorithms and demonstrate its competitive performance. △ Less

Submitted 30 March, 2023; v1 submitted 14 October, 2022; originally announced October 2022.

arXiv:2207.03639 [pdf, other]

Nonparametric Embeddings of Sparse High-Order Interaction Events

Authors: Zheng Wang, Yiming Xu, Conor Tillinghast, Shibo Li, Akil Narayan, Shandian Zhe

Abstract: High-order interaction events are common in real-world applications. Learning embeddings that encode the complex relationships of the participants from these events is of great importance in knowledge mining and predictive tasks. Despite the success of existing approaches, e.g. Poisson tensor factorization, they ignore the sparse structure underlying the data, namely the occurred interactions are… ▽ More High-order interaction events are common in real-world applications. Learning embeddings that encode the complex relationships of the participants from these events is of great importance in knowledge mining and predictive tasks. Despite the success of existing approaches, e.g. Poisson tensor factorization, they ignore the sparse structure underlying the data, namely the occurred interactions are far less than the possible interactions among all the participants. In this paper, we propose Nonparametric Embeddings of Sparse High-order interaction events (NESH). We hybridize a sparse hypergraph (tensor) process and a matrix Gaussian process to capture both the asymptotic structural sparsity within the interactions and nonlinear temporal relationships between the participants. We prove strong asymptotic bounds (including both a lower and an upper bound) of the sparsity ratio, which reveals the asymptotic properties of the sampled structure. We use batch-normalization, stick-breaking construction, and sparse variational GP approximations to develop an efficient, scalable model inference algorithm. We demonstrate the advantage of our approach in several real-world applications. △ Less

Submitted 7 July, 2022; originally announced July 2022.

Comments: 9 pages, ICML 2022

arXiv:2110.10082 [pdf, other]

Nonparametric Sparse Tensor Factorization with Hierarchical Gamma Processes

Authors: Conor Tillinghast, Zheng Wang, Shandian Zhe

Abstract: We propose a nonparametric factorization approach for sparsely observed tensors. The sparsity does not mean zero-valued entries are massive or dominated. Rather, it implies the observed entries are very few, and even fewer with the growth of the tensor; this is ubiquitous in practice. Compared with the existent works, our model not only leverages the structural information underlying the observed… ▽ More We propose a nonparametric factorization approach for sparsely observed tensors. The sparsity does not mean zero-valued entries are massive or dominated. Rather, it implies the observed entries are very few, and even fewer with the growth of the tensor; this is ubiquitous in practice. Compared with the existent works, our model not only leverages the structural information underlying the observed entry indices, but also provides extra interpretability and flexibility -- it can simultaneously estimate a set of location factors about the intrinsic properties of the tensor nodes, and another set of sociability factors reflecting their extrovert activity in interacting with others; users are free to choose a trade-off between the two types of factors. Specifically, we use hierarchical Gamma processes and Poisson random measures to construct a tensor-valued process, which can freely sample the two types of factors to generate tensors and always guarantees an asymptotic sparsity. We then normalize the tensor process to obtain hierarchical Dirichlet processes to sample each observed entry index, and use a Gaussian process to sample the entry value as a nonlinear function of the factors, so as to capture both the sparse structure properties and complex node relationships. For efficient inference, we use Dirichlet process properties over finite sample partitions, density transformations, and random features to develop a stochastic variational estimation algorithm. We demonstrate the advantage of our method in several benchmark datasets. △ Less

Submitted 3 November, 2021; v1 submitted 19 October, 2021; originally announced October 2021.

Comments: 15 pages, 4 figures

arXiv:2106.09884 [pdf, other]

Batch Multi-Fidelity Bayesian Optimization with Deep Auto-Regressive Networks

Authors: Shibo Li, Robert M. Kirby, Shandian Zhe

Abstract: Bayesian optimization (BO) is a powerful approach for optimizing black-box, expensive-to-evaluate functions. To enable a flexible trade-off between the cost and accuracy, many applications allow the function to be evaluated at different fidelities. In order to reduce the optimization cost while maximizing the benefit-cost ratio, in this paper, we propose Batch Multi-fidelity Bayesian Optimization… ▽ More Bayesian optimization (BO) is a powerful approach for optimizing black-box, expensive-to-evaluate functions. To enable a flexible trade-off between the cost and accuracy, many applications allow the function to be evaluated at different fidelities. In order to reduce the optimization cost while maximizing the benefit-cost ratio, in this paper, we propose Batch Multi-fidelity Bayesian Optimization with Deep Auto-Regressive Networks (BMBO-DARN). We use a set of Bayesian neural networks to construct a fully auto-regressive model, which is expressive enough to capture strong yet complex relationships across all the fidelities, so as to improve the surrogate learning and optimization performance. Furthermore, to enhance the quality and diversity of queries, we develop a simple yet efficient batch querying method, without any combinatorial search over the fidelities. We propose a batch acquisition function based on Max-value Entropy Search (MES) principle, which penalizes highly correlated queries and encourages diversity. We use posterior samples and moment matching to fulfill efficient computation of the acquisition function and conduct alternating optimization over every fidelity-input pair, which guarantees an improvement at each step. We demonstrate the advantage of our approach on four real-world hyperparameter optimization applications. △ Less

Submitted 25 October, 2021; v1 submitted 17 June, 2021; originally announced June 2021.

arXiv:2007.07367 [pdf, other]

Streaming Probabilistic Deep Tensor Factorization

Authors: Shikai Fang, Zheng Wang, Zhimeng Pan, Ji Liu, Shandian Zhe

Abstract: Despite the success of existing tensor factorization methods, most of them conduct a multilinear decomposition, and rarely exploit powerful modeling frameworks, like deep neural networks, to capture a variety of complicated interactions in data. More important, for highly expressive, deep factorization, we lack an effective approach to handle streaming data, which are ubiquitous in real-world appl… ▽ More Despite the success of existing tensor factorization methods, most of them conduct a multilinear decomposition, and rarely exploit powerful modeling frameworks, like deep neural networks, to capture a variety of complicated interactions in data. More important, for highly expressive, deep factorization, we lack an effective approach to handle streaming data, which are ubiquitous in real-world applications. To address these issues, we propose SPIDER, a Streaming ProbabilistIc Deep tEnsoR factorization method. We first use Bayesian neural networks (NNs) to construct a deep tensor factorization model. We assign a spike-and-slab prior over the NN weights to encourage sparsity and prevent overfitting. We then use Taylor expansions and moment matching to approximate the posterior of the NN output and calculate the running model evidence, based on which we develop an efficient streaming posterior inference algorithm in the assumed-density-filtering and expectation propagation framework. Our algorithm provides responsive incremental updates for the posterior of the latent factors and NN weights upon receiving new tensor entries, and meanwhile select and inhibit redundant/useless weights. We show the advantages of our approach in four real-world applications. △ Less

Submitted 14 July, 2020; originally announced July 2020.

arXiv:2007.03117 [pdf, ps, other]

Multi-Fidelity Bayesian Optimization via Deep Neural Networks

Authors: Shibo Li, Wei Xing, Mike Kirby, Shandian Zhe

Abstract: Bayesian optimization (BO) is a popular framework to optimize black-box functions. In many applications, the objective function can be evaluated at multiple fidelities to enable a trade-off between the cost and accuracy. To reduce the optimization cost, many multi-fidelity BO methods have been proposed. Despite their success, these methods either ignore or over-simplify the strong, complex correla… ▽ More Bayesian optimization (BO) is a popular framework to optimize black-box functions. In many applications, the objective function can be evaluated at multiple fidelities to enable a trade-off between the cost and accuracy. To reduce the optimization cost, many multi-fidelity BO methods have been proposed. Despite their success, these methods either ignore or over-simplify the strong, complex correlations across the fidelities, and hence can be inefficient in estimating the objective function. To address this issue, we propose Deep Neural Network Multi-Fidelity Bayesian Optimization (DNN-MFBO) that can flexibly capture all kinds of complicated relationships between the fidelities to improve the objective function estimation and hence the optimization performance. We use sequential, fidelity-wise Gauss-Hermite quadrature and moment-matching to fulfill a mutual information-based acquisition function, which is computationally tractable and efficient. We show the advantages of our method in both synthetic benchmark datasets and real-world applications in engineering design. △ Less

Submitted 10 December, 2020; v1 submitted 6 July, 2020; originally announced July 2020.

arXiv:2006.04976 [pdf, other]

Physics Informed Deep Kernel Learning

Authors: Zheng Wang, Wei Xing, Robert Kirby, Shandian Zhe

Abstract: Deep kernel learning is a promising combination of deep neural networks and nonparametric function learning. However, as a data driven approach, the performance of deep kernel learning can still be restricted by scarce or insufficient data, especially in extrapolation tasks. To address these limitations, we propose Physics Informed Deep Kernel Learning (PI-DKL) that exploits physics knowledge repr… ▽ More Deep kernel learning is a promising combination of deep neural networks and nonparametric function learning. However, as a data driven approach, the performance of deep kernel learning can still be restricted by scarce or insufficient data, especially in extrapolation tasks. To address these limitations, we propose Physics Informed Deep Kernel Learning (PI-DKL) that exploits physics knowledge represented by differential equations with latent sources. Specifically, we use the posterior function sample of the Gaussian process as the surrogate for the solution of the differential equation, and construct a generative component to integrate the equation in a principled Bayesian hybrid framework. For efficient and effective inference, we marginalize out the latent variables in the joint probability and derive a collapsed model evidence lower bound (ELBO), based on which we develop a stochastic model estimation algorithm. Our ELBO can be viewed as a nice, interpretable posterior regularization objective. On synthetic datasets and real-world applications, we show the advantage of our approach in both prediction accuracy and uncertainty quantification. △ Less

Submitted 18 January, 2022; v1 submitted 8 June, 2020; originally announced June 2020.

Comments: 8 pages, 5 figures, AISTATS

arXiv:2006.04972 [pdf, other]

Multi-Fidelity High-Order Gaussian Processes for Physical Simulation

Authors: Zheng Wang, Wei Xing, Robert Kirby, Shandian Zhe

Abstract: The key task of physical simulation is to solve partial differential equations (PDEs) on discretized domains, which is known to be costly. In particular, high-fidelity solutions are much more expensive than low-fidelity ones. To reduce the cost, we consider novel Gaussian process (GP) models that leverage simulation examples of different fidelities to predict high-dimensional PDE solution outputs.… ▽ More The key task of physical simulation is to solve partial differential equations (PDEs) on discretized domains, which is known to be costly. In particular, high-fidelity solutions are much more expensive than low-fidelity ones. To reduce the cost, we consider novel Gaussian process (GP) models that leverage simulation examples of different fidelities to predict high-dimensional PDE solution outputs. Existing GP methods are either not scalable to high-dimensional outputs or lack effective strategies to integrate multi-fidelity examples. To address these issues, we propose Multi-Fidelity High-Order Gaussian Process (MFHoGP) that can capture complex correlations both between the outputs and between the fidelities to enhance solution estimation, and scale to large numbers of outputs. Based on a novel nonlinear coregionalization model, MFHoGP propagates bases throughout fidelities to fuse information, and places a deep matrix GP prior over the basis weights to capture the (nonlinear) relationships across the fidelities. To improve inference efficiency and quality, we use bases decomposition to largely reduce the model parameters, and layer-wise matrix Gaussian posteriors to capture the posterior dependency and to simplify the computation. Our stochastic variational learning algorithm successfully handles millions of outputs without extra sparse approximations. We show the advantages of our method in several typical applications. △ Less

Submitted 8 June, 2020; originally announced June 2020.

arXiv:2003.11489 [pdf, ps, other]

Scalable Variational Gaussian Process Regression Networks

Authors: Shibo Li, Wei Xing, Mike Kirby, Shandian Zhe

Abstract: Gaussian process regression networks (GPRN) are powerful Bayesian models for multi-output regression, but their inference is intractable. To address this issue, existing methods use a fully factorized structure (or a mixture of such structures) over all the outputs and latent functions for posterior approximation, which, however, can miss the strong posterior dependencies among the latent variable… ▽ More Gaussian process regression networks (GPRN) are powerful Bayesian models for multi-output regression, but their inference is intractable. To address this issue, existing methods use a fully factorized structure (or a mixture of such structures) over all the outputs and latent functions for posterior approximation, which, however, can miss the strong posterior dependencies among the latent variables and hurt the inference quality. In addition, the updates of the variational parameters are inefficient and can be prohibitively expensive for a large number of outputs. To overcome these limitations, we propose a scalable variational inference algorithm for GPRN, which not only captures the abundant posterior dependencies but also is much more efficient for massive outputs. We tensorize the output space and introduce tensor/matrix-normal variational posteriors to capture the posterior correlations and to reduce the parameters. We jointly optimize all the parameters and exploit the inherent Kronecker product structure in the variational model evidence lower bound to accelerate the computation. We demonstrate the advantages of our method in several real-world applications. △ Less

Submitted 18 May, 2020; v1 submitted 25 March, 2020; originally announced March 2020.

arXiv:2002.02374 [pdf, other]

doi 10.1016/j.trb.2021.02.007

Macroscopic Traffic Flow Modeling with Physics Regularized Gaussian Process: A New Insight into Machine Learning Applications

Authors: Yun Yuan, Xianfeng Terry Yang, Zhao Zhang, Shandian Zhe

Abstract: Despite the wide implementation of machine learning (ML) techniques in traffic flow modeling recently, those data-driven approaches often fall short of accuracy in the cases with a small or noisy dataset. To address this issue, this study presents a new modeling framework, named physics regularized machine learning (PRML), to encode classical traffic flow models (referred as physical models) into… ▽ More Despite the wide implementation of machine learning (ML) techniques in traffic flow modeling recently, those data-driven approaches often fall short of accuracy in the cases with a small or noisy dataset. To address this issue, this study presents a new modeling framework, named physics regularized machine learning (PRML), to encode classical traffic flow models (referred as physical models) into the ML architecture and to regularize the ML training process. More specifically, a stochastic physics regularized Gaussian process (PRGP) model is developed and a Bayesian inference algorithm is used to estimate the mean and kernel of the PRGP. A physical regularizer based on macroscopic traffic flow models is also developed to augment the estimation via a shadow GP and an enhanced latent force model is used to encode physical knowledge into stochastic processes. Based on the posterior regularization inference framework, an efficient stochastic optimization algorithm is also developed to maximize the evidence lowerbound of the system likelihood. To prove the effectiveness of the proposed model, this paper conducts empirical studies on a real-world dataset which is collected from a stretch of I-15 freeway, Utah. Results show the new PRGP model can outperform the previous compatible methods, such as calibrated pure physical models and pure machine learning methods, in estimation precision and input robustness. △ Less

Submitted 6 February, 2020; originally announced February 2020.

Comments: 30 pages, 13 figures

Journal ref: Transp Res B: Methodol, 146, 88-110 (2021)

arXiv:1910.12360 [pdf, ps, other]

Conditional Expectation Propagation

Authors: Zheng Wang, Shandian Zhe

Abstract: Expectation propagation (EP) is a powerful approximate inference algorithm. However, a critical barrier in applying EP is that the moment matching in message updates can be intractable. Handcrafting approximations is usually tricky, and lacks generalizability. Importance sampling is very expensive. While Laplace propagation provides a good solution, it has to run numerical optimizations to find La… ▽ More Expectation propagation (EP) is a powerful approximate inference algorithm. However, a critical barrier in applying EP is that the moment matching in message updates can be intractable. Handcrafting approximations is usually tricky, and lacks generalizability. Importance sampling is very expensive. While Laplace propagation provides a good solution, it has to run numerical optimizations to find Laplace approximations in every update, which is still quite inefficient. To overcome these practical barriers, we propose conditional expectation propagation (CEP) that performs conditional moment matching given the variables outside each message, and then takes expectation w.r.t the approximate posterior of these variables. The conditional moments are often analytical and much easier to derive. In the most general case, we can use (fully) factorized messages to represent the conditional moments by quadrature formulas. We then compute the expectation of the conditional moments via Taylor approximations when necessary. In this way, our algorithm can always conduct efficient, analytical fixed point iterations. Experiments on several popular models for which standard EP is available or unavailable demonstrate the advantages of CEP in both inference quality and computational efficiency. △ Less

Submitted 8 November, 2019; v1 submitted 27 October, 2019; originally announced October 2019.

Comments: 10 pages, 5 figures, UAI 2019

arXiv:1712.05134 [pdf, other]

Learning Compact Recurrent Neural Networks with Block-Term Tensor Decomposition

Authors: **mian Ye, Linnan Wang, Guangxi Li, Di Chen, Shandian Zhe, Xinqi Chu, Zenglin Xu

Abstract: Recurrent Neural Networks (RNNs) are powerful sequence modeling tools. However, when dealing with high dimensional inputs, the training of RNNs becomes computational expensive due to the large number of model parameters. This hinders RNNs from solving many important computer vision tasks, such as Action Recognition in Videos and Image Captioning. To overcome this problem, we propose a compact and… ▽ More Recurrent Neural Networks (RNNs) are powerful sequence modeling tools. However, when dealing with high dimensional inputs, the training of RNNs becomes computational expensive due to the large number of model parameters. This hinders RNNs from solving many important computer vision tasks, such as Action Recognition in Videos and Image Captioning. To overcome this problem, we propose a compact and flexible structure, namely Block-Term tensor decomposition, which greatly reduces the parameters of RNNs and improves their training efficiency. Compared with alternative low-rank approximations, such as tensor-train RNN (TT-RNN), our method, Block-Term RNN (BT-RNN), is not only more concise (when using the same rank), but also able to attain a better approximation to the original RNNs with much fewer parameters. On three challenging tasks, including Action Recognition in Videos, Image Captioning and Image Generation, BT-RNN outperforms TT-RNN and the standard RNN in terms of both prediction accuracy and convergence rate. Specifically, BT-LSTM utilizes 17,388 times fewer parameters than the standard LSTM to achieve an accuracy improvement over 15.6\% in the Action Recognition task on the UCF11 dataset. △ Less

Submitted 11 May, 2018; v1 submitted 14 December, 2017; originally announced December 2017.

Comments: CVPR2018

arXiv:1704.06735 [pdf, ps, other]

Asynchronous Distributed Variational Gaussian Processes for Regression

Authors: Hao Peng, Shandian Zhe, Yuan Qi

Abstract: Gaussian processes (GPs) are powerful non-parametric function estimators. However, their applications are largely limited by the expensive computational cost of the inference procedures. Existing stochastic or distributed synchronous variational inferences, although have alleviated this issue by scaling up GPs to millions of samples, are still far from satisfactory for real-world large application… ▽ More Gaussian processes (GPs) are powerful non-parametric function estimators. However, their applications are largely limited by the expensive computational cost of the inference procedures. Existing stochastic or distributed synchronous variational inferences, although have alleviated this issue by scaling up GPs to millions of samples, are still far from satisfactory for real-world large applications, where the data sizes are often orders of magnitudes larger, say, billions. To solve this problem, we propose ADVGP, the first Asynchronous Distributed Variational Gaussian Process inference for regression, on the recent large-scale machine learning platform, PARAMETERSERVER. ADVGP uses a novel, flexible variational framework based on a weight space augmentation, and implements the highly efficient, asynchronous proximal gradient optimization. While maintaining comparable or better predictive performance, ADVGP greatly improves upon the efficiency of the existing variational methods. With ADVGP, we effortlessly scale up GP regression to a real-world application with billions of samples and demonstrate an excellent, superior prediction accuracy to the popular linear models. △ Less

Submitted 12 June, 2017; v1 submitted 21 April, 2017; originally announced April 2017.

Comments: International Conference on Machine Learning 2017

arXiv:1604.07928 [pdf, ps, other]

Distributed Flexible Nonlinear Tensor Factorization

Authors: Shandian Zhe, Kai Zhang, Pengyuan Wang, Kuang-chih Lee, Zenglin Xu, Yuan Qi, Zoubin Ghahramani

Abstract: Tensor factorization is a powerful tool to analyse multi-way data. Compared with traditional multi-linear methods, nonlinear tensor factorization models are capable of capturing more complex relationships in the data. However, they are computationally expensive and may suffer severe learning bias in case of extreme data sparsity. To overcome these limitations, in this paper we propose a distribute… ▽ More Tensor factorization is a powerful tool to analyse multi-way data. Compared with traditional multi-linear methods, nonlinear tensor factorization models are capable of capturing more complex relationships in the data. However, they are computationally expensive and may suffer severe learning bias in case of extreme data sparsity. To overcome these limitations, in this paper we propose a distributed, flexible nonlinear tensor factorization model. Our model can effectively avoid the expensive computations and structural restrictions of the Kronecker-product in existing TGP formulations, allowing an arbitrary subset of tensorial entries to be selected to contribute to the training. At the same time, we derive a tractable and tight variational evidence lower bound (ELBO) that enables highly decoupled, parallel computations and high-quality inference. Based on the new bound, we develop a distributed inference algorithm in the MapReduce framework, which is key-value-free and can fully exploit the memory cache mechanism in fast MapReduce systems such as SPARK. Experimental results fully demonstrate the advantages of our method over several state-of-the-art approaches, in terms of both predictive performance and computational efficiency. Moreover, our approach shows a promising potential in the application of Click-Through-Rate (CTR) prediction for online advertising. △ Less

Submitted 21 May, 2016; v1 submitted 27 April, 2016; originally announced April 2016.

Comments: Gaussian process, tensor factorization, multidimensional arrays, large scale, spark, map-reduce

ACM Class: I.5.1; I.5.4

arXiv:1311.2663 [pdf, ps, other]

DinTucker: Scaling up Gaussian process models on multidimensional arrays with billions of elements

Authors: Shandian Zhe, Yuan Qi, Youngja Park, Ian Molloy, Suresh Chari

Abstract: Infinite Tucker Decomposition (InfTucker) and random function prior models, as nonparametric Bayesian models on infinite exchangeable arrays, are more powerful models than widely-used multilinear factorization methods including Tucker and PARAFAC decomposition, (partly) due to their capability of modeling nonlinear relationships between array elements. Despite their great predictive performance an… ▽ More Infinite Tucker Decomposition (InfTucker) and random function prior models, as nonparametric Bayesian models on infinite exchangeable arrays, are more powerful models than widely-used multilinear factorization methods including Tucker and PARAFAC decomposition, (partly) due to their capability of modeling nonlinear relationships between array elements. Despite their great predictive performance and sound theoretical foundations, they cannot handle massive data due to a prohibitively high training time. To overcome this limitation, we present Distributed Infinite Tucker (DINTUCKER), a large-scale nonlinear tensor decomposition algorithm on MAPREDUCE. While maintaining the predictive accuracy of InfTucker, it is scalable on massive data. DINTUCKER is based on a new hierarchical Bayesian model that enables local training of InfTucker on subarrays and information integration from all local training results. We use distributed stochastic gradient descent, coupled with variational inference, to train this model. We apply DINTUCKER to multidimensional arrays with billions of elements from applications in the "Read the Web" project (Carlson et al., 2010) and in information security and compare it with the state-of-the-art large-scale tensor decomposition method, GigaTensor. On both datasets, DINTUCKER achieves significantly higher prediction accuracy with less computational time. △ Less

Submitted 1 February, 2014; v1 submitted 11 November, 2013; originally announced November 2013.

arXiv:1304.7284 [pdf, other]

Supervised Heterogeneous Multiview Learning for Joint Association Study and Disease Diagnosis

Authors: Shandian Zhe, Zenglin Xu, Yuan Qi

Abstract: Given genetic variations and various phenotypical traits, such as Magnetic Resonance Imaging (MRI) features, we consider two important and related tasks in biomedical research: i)to select genetic and phenotypical markers for disease diagnosis and ii) to identify associations between genetic and phenotypical data. These two tasks are tightly coupled because underlying associations between genetic… ▽ More Given genetic variations and various phenotypical traits, such as Magnetic Resonance Imaging (MRI) features, we consider two important and related tasks in biomedical research: i)to select genetic and phenotypical markers for disease diagnosis and ii) to identify associations between genetic and phenotypical data. These two tasks are tightly coupled because underlying associations between genetic variations and phenotypical features contain the biological basis for a disease. While a variety of sparse models have been applied for disease diagnosis and canonical correlation analysis and its extensions have bee widely used in association studies (e.g., eQTL analysis), these two tasks have been treated separately. To unify these two tasks, we present a new sparse Bayesian approach for joint association study and disease diagnosis. In this approach, common latent features are extracted from different data sources based on sparse projection matrices and used to predict multiple disease severity levels based on Gaussian process ordinal regression; in return, the disease status is used to guide the discovery of relationships between the data sources. The sparse projection matrices not only reveal interactions between data sources but also select groups of biomarkers related to the disease. To learn the model from data, we develop an efficient variational expectation maximization algorithm. Simulation results demonstrate that our approach achieves higher accuracy in both predicting ordinal labels and discovering associations between data sources than alternative methods. We apply our approach to an imaging genetics dataset for the study of Alzheimer's Disease (AD). Our method identifies biologically meaningful relationships between genetic variations, MRI features, and AD status, and achieves significantly higher accuracy for predicting ordinal AD stages than the competing methods. △ Less

Submitted 16 October, 2013; v1 submitted 26 April, 2013; originally announced April 2013.

Showing 1–21 of 21 results for author: Zhe, S