Search | arXiv e-print repository

Bayesian Deep ICE

Authors: Jyotishka Datta, Nicholas G. Polson

Abstract: Deep Independent Component Estimation (DICE) has many applications in modern day machine learning as a feature engineering extraction method. We provide a novel latent variable representation of independent component analysis that enables both point estimates via expectation-maximization (EM) and full posterior sampling via Markov Chain Monte Carlo (MCMC) algorithms. Our methodology also applies t… ▽ More Deep Independent Component Estimation (DICE) has many applications in modern day machine learning as a feature engineering extraction method. We provide a novel latent variable representation of independent component analysis that enables both point estimates via expectation-maximization (EM) and full posterior sampling via Markov Chain Monte Carlo (MCMC) algorithms. Our methodology also applies to flow-based methods for nonlinear feature extraction. We discuss how to implement conditional posteriors and envelope-based methods for optimization. Through this representation hierarchy, we unify a number of hitherto disjoint estimation procedures. We illustrate our methodology and algorithms on a numerical example. Finally, we conclude with directions for future research. △ Less

Submitted 24 June, 2024; originally announced June 2024.

MSC Class: 62F15; 62H25; 68T07

arXiv:2402.09583 [pdf, other]

Horseshoe Priors for Sparse Dirichlet-Multinomial Models

Authors: Yuexi Wang, Nicholas G. Polson

Abstract: Bayesian inference for Dirichlet-Multinomial (DM) models has a long and important history. The concentration parameter $α$ is pivotal in smoothing category probabilities within the multinomial distribution and is crucial for the inference afterward. Due to the lack of a tractable form of its marginal likelihood, $α$ is often chosen in an ad-hoc manner, or estimated using approximation algorithms.… ▽ More Bayesian inference for Dirichlet-Multinomial (DM) models has a long and important history. The concentration parameter $α$ is pivotal in smoothing category probabilities within the multinomial distribution and is crucial for the inference afterward. Due to the lack of a tractable form of its marginal likelihood, $α$ is often chosen in an ad-hoc manner, or estimated using approximation algorithms. A constant $α$ often leads to inadequate smoothing of probabilities, particularly for sparse compositional count datasets. In this paper, we introduce a novel class of prior distributions facilitating conjugate updating of the concentration parameter, allowing for full Bayesian inference for DM models. Our methodology is based on fast residue computation and admits closed-form posterior moments in specific scenarios. Additionally, our prior provides continuous shrinkage with its heavy tail and substantial mass around zero, ensuring adaptability to the sparsity or quasi-sparsity of the data. We demonstrate the usefulness of our approach on both simulated examples and on real-world applications. Finally, we conclude with directions for future research. △ Less

Submitted 11 March, 2024; v1 submitted 14 February, 2024; originally announced February 2024.

arXiv:2310.06251 [pdf, other]

Deep Learning: A Tutorial

Authors: Nick Polson, Vadim Sokolov

Abstract: Our goal is to provide a review of deep learning methods which provide insight into structured high-dimensional data. Rather than using shallow additive architectures common to most statistical models, deep learning uses layers of semi-affine input transformations to provide a predictive rule. Applying these layers of transformations leads to a set of attributes (or, features) to which probabilist… ▽ More Our goal is to provide a review of deep learning methods which provide insight into structured high-dimensional data. Rather than using shallow additive architectures common to most statistical models, deep learning uses layers of semi-affine input transformations to provide a predictive rule. Applying these layers of transformations leads to a set of attributes (or, features) to which probabilistic statistical methods can be applied. Thus, the best of both worlds can be achieved: scalable prediction rules fortified with uncertainty quantification, where sparse regularization finds the features. △ Less

Submitted 9 October, 2023; originally announced October 2023.

Comments: arXiv admin note: text overlap with arXiv:1808.08618

arXiv:2306.16096 [pdf, other]

Generative Causal Inference

Authors: Maria Nareklishvili, Nicholas Polson, Vadim Sokolov

Abstract: In this paper we propose the use of the generative AI methods in Econometrics. Generative methods avoid the use of densities as done by MCMC. They directrix simulate large samples of observables and unobservable (parameters, latent variables) and then using high-dimensional deep learner to inform a nonlinear transport map from data to parameter inferences. Our themed apply to a wide verity or econ… ▽ More In this paper we propose the use of the generative AI methods in Econometrics. Generative methods avoid the use of densities as done by MCMC. They directrix simulate large samples of observables and unobservable (parameters, latent variables) and then using high-dimensional deep learner to inform a nonlinear transport map from data to parameter inferences. Our themed apply to a wide verity or econometrics problems, including those where the latent variables are updates in deterministic fashion. Further, paper we illustrate our methodology in the field of causal inference and show how generative AI provides generalization of propensity scores. Our approach can also handle nonlinearity and heterogeneity. Finally, we conclude with the directions for future research. △ Less

Submitted 28 June, 2023; originally announced June 2023.

Comments: arXiv admin note: text overlap with arXiv:2305.14972

arXiv:2305.14972 [pdf, other]

Generative AI for Bayesian Computation

Authors: Nicholas G. Polson, Vadim Sokolov

Abstract: Bayesian Generative AI (BayesGen-AI) methods are developed and applied to Bayesian computation. BayesGen-AI reconstructs the posterior distribution by directly modeling the parameter of interest as a map** (a.k.a. deep learner) from a large simulated dataset. This provides a generator that we can evaluate at the observed data and provide draws from the posterior distribution. This method applies… ▽ More Bayesian Generative AI (BayesGen-AI) methods are developed and applied to Bayesian computation. BayesGen-AI reconstructs the posterior distribution by directly modeling the parameter of interest as a map** (a.k.a. deep learner) from a large simulated dataset. This provides a generator that we can evaluate at the observed data and provide draws from the posterior distribution. This method applies to all forms of Bayesian inference including parametric models, likelihood-free models, prediction and maximum expected utility problems. Bayesian computation is then equivalent to high dimensional non-parametric regression. Bayes Gen-AI main advantage is that it is density-free and therefore provides an alternative to Markov Chain Monte Carlo. It has a number of advantages over vanilla generative adversarial networks (GAN) and approximate Bayesian computation (ABC) methods due to the fact that the generator is simpler to learn than a GAN architecture and is more flexible than kernel smoothing implicit in ABC methods. Design of the Network Architecture requires careful selection of features (a.k.a. dimension reduction) and nonlinear architecture for inference. As a generic architecture, we propose a deep quantile neural network and a uniform base distribution at which to evaluate the generator. To illustrate our methodology, we provide two real data examples, the first in traffic flow prediction and the second in building a surrogate for satellite drag data-set. Finally, we conclude with directions for future research. △ Less

Submitted 24 February, 2024; v1 submitted 24 May, 2023; originally announced May 2023.

Comments: arXiv admin note: text overlap with arXiv:2209.02163

arXiv:2305.03158 [pdf, other]

Quantile Importance Sampling

Authors: Jyotishka Datta, Nicholas G. Polson

Abstract: In Bayesian inference, the approximation of integrals of the form $ψ= \mathbb{E}_{F}{l(X)} = \int_χ l(\mathbf{x}) d F(\mathbf{x})$ is a fundamental challenge. Such integrals are crucial for evidence estimation, which is important for various purposes, including model selection and numerical analysis. The existing strategies for evidence estimation are classified into four categories: deterministic… ▽ More In Bayesian inference, the approximation of integrals of the form $ψ= \mathbb{E}_{F}{l(X)} = \int_χ l(\mathbf{x}) d F(\mathbf{x})$ is a fundamental challenge. Such integrals are crucial for evidence estimation, which is important for various purposes, including model selection and numerical analysis. The existing strategies for evidence estimation are classified into four categories: deterministic approximation, density estimation, importance sampling, and vertical representation (Llorente et al., 2020). In this paper, we show that the Riemann sum estimator due to Yakowitz (1978) can be used in the context of nested sampling (Skilling, 2006) to achieve a $O(n^{-4})$ rate of convergence, faster than the usual Ergodic Central Limit Theorem. We provide a brief overview of the literature on the Riemann sum estimators and the nested sampling algorithm and its connections to vertical likelihood Monte Carlo. We provide theoretical and numerical arguments to show how merging these two ideas may result in improved and more robust estimators for evidence estimation, especially in higher dimensional spaces. We also briefly discuss the idea of simulating the Lorenz curve that avoids the problem of intractable $Λ$ functions, essential for the vertical representation and nested sampling. △ Less

Submitted 25 May, 2023; v1 submitted 4 May, 2023; originally announced May 2023.

MSC Class: 65C05; 62F15

arXiv:2208.09563 [pdf, other]

On the Probability of Magnus Carlsen reaching 2900

Authors: Sohan Bendre, Shiva Maharaj, Nick Polson, Vadim Sokolov

Abstract: How likely is it that Magnus Carlsen will achieve an Elo rating of $2900$? This has been a goal of Magnus and is of great current interest to the chess community. Our paper uses probabilistic methods to address this question. The probabilistic properties of Elo's rating system have long been studied, and we provide an application of such methods. By applying a Brownian motion model of Stern as a s… ▽ More How likely is it that Magnus Carlsen will achieve an Elo rating of $2900$? This has been a goal of Magnus and is of great current interest to the chess community. Our paper uses probabilistic methods to address this question. The probabilistic properties of Elo's rating system have long been studied, and we provide an application of such methods. By applying a Brownian motion model of Stern as a simple tool we provide answers. Our research also has fundamental bearing on the choice of the $K$-factor used in Elo's system for GrandMaster (GM) chess play. Finally, we conclude with a discussion of policy issues involved with the choice of $K$-factor. △ Less

Submitted 19 August, 2022; originally announced August 2022.

arXiv:2208.08068 [pdf, other]

Quantum Bayesian Computation

Authors: Nick Polson, Vadim Sokolov, Jianeng Xu

Abstract: Quantum Bayesian Computation (QBC) is an emerging field that levers the computational gains available from quantum computers to provide an exponential speed-up in Bayesian computation. Our paper adds to the literature in two ways. First, we show how von Neumann quantum measurement can be used to simulate machine learning algorithms such as Markov chain Monte Carlo (MCMC) and Deep Learning (DL) tha… ▽ More Quantum Bayesian Computation (QBC) is an emerging field that levers the computational gains available from quantum computers to provide an exponential speed-up in Bayesian computation. Our paper adds to the literature in two ways. First, we show how von Neumann quantum measurement can be used to simulate machine learning algorithms such as Markov chain Monte Carlo (MCMC) and Deep Learning (DL) that are fundamental to Bayesian learning. Second, we describe data encoding methods needed to implement quantum machine learning including the counterparts to traditional feature extraction and kernel embeddings methods. Our goal then is to show how to apply quantum algorithms directly to statistical machine learning problems. On the theoretical side, we provide quantum versions of high dimensional regression, Gaussian processes (Q-GP) and stochastic gradient descent (Q-SGD). On the empirical side, we apply a Quantum FFT model to Chicago housing data. Finally, we conclude with directions for future research. △ Less

Submitted 4 March, 2023; v1 submitted 17 August, 2022; originally announced August 2022.

arXiv:2207.02612 [pdf, other]

Deep Partial Least Squares for Instrumental Variable Regression

Authors: Maria Nareklishvili, Nicholas Polson, Vadim Sokolov

Abstract: In this paper, we propose deep partial least squares for the estimation of high-dimensional nonlinear instrumental variable regression. As a precursor to a flexible deep neural network architecture, our methodology uses partial least squares for dimension reduction and feature selection from the set of instruments and covariates. A central theoretical result, due to Brillinger (2012) shows that th… ▽ More In this paper, we propose deep partial least squares for the estimation of high-dimensional nonlinear instrumental variable regression. As a precursor to a flexible deep neural network architecture, our methodology uses partial least squares for dimension reduction and feature selection from the set of instruments and covariates. A central theoretical result, due to Brillinger (2012) shows that the feature selection provided by partial least squares is consistent and the weights are estimated up to a proportionality constant. We illustrate our methodology with synthetic datasets with a sparse and correlated network structure and draw applications to the effect of childbearing on the mother's labor supply based on classic data of Angrist and Evans (1996). The results on synthetic data as well as applications show that the deep partial least squares method significantly outperforms other related methods. Finally, we conclude with directions for future research. △ Less

Submitted 2 June, 2023; v1 submitted 6 July, 2022; originally announced July 2022.

arXiv:2206.10014 [pdf, other]

Deep Partial Least Squares for Empirical Asset Pricing

Authors: Matthew F. Dixon, Nicholas G. Polson, Kemen Goicoechea

Abstract: We use deep partial least squares (DPLS) to estimate an asset pricing model for individual stock returns that exploits conditioning information in a flexible and dynamic way while attributing excess returns to a small set of statistical risk factors. The novel contribution is to resolve the non-linear factor structure, thus advancing the current paradigm of deep learning in empirical asset pricing… ▽ More We use deep partial least squares (DPLS) to estimate an asset pricing model for individual stock returns that exploits conditioning information in a flexible and dynamic way while attributing excess returns to a small set of statistical risk factors. The novel contribution is to resolve the non-linear factor structure, thus advancing the current paradigm of deep learning in empirical asset pricing which uses linear stochastic discount factors under an assumption of Gaussian asset returns and factors. This non-linear factor structure is extracted by using projected least squares to jointly project firm characteristics and asset returns on to a subspace of latent factors and using deep learning to learn the non-linear map from the factor loadings to the asset returns. The result of capturing this non-linear risk factor structure is to characterize anomalies in asset returns by both linear risk factor exposure and interaction effects. Thus the well known ability of deep learning to capture outliers, shed lights on the role of convexity and higher order terms in the latent factor structure on the factor risk premia. On the empirical side, we implement our DPLS factor models and exhibit superior performance to LASSO and plain vanilla deep learning models. Furthermore, our network training times are significantly reduced due to the more parsimonious architecture of DPLS. Specifically, using 3290 assets in the Russell 1000 index over a period of December 1989 to January 2018, we assess our DPLS factor model and generate information ratios that are approximately 1.2x greater than deep learning. DPLS explains variation and pricing errors and identifies the most prominent latent factors and firm characteristics. △ Less

Submitted 20 June, 2022; originally announced June 2022.

arXiv:2204.14121 [pdf, other]

Inverse Probability Weighting: from Survey Sampling to Evidence Estimation

Authors: Jyotishka Datta, Nicholas Polson

Abstract: We consider the class of inverse probability weight (IPW) estimators, including the popular Horvitz-Thompson and Hajek estimators used routinely in survey sampling, causal inference and evidence estimation for Bayesian computation. We focus on the 'weak paradoxes' for these estimators due to two counterexamples by Basu [1988] and Wasserman [2004] and investigate the two natural Bayesian answers to… ▽ More We consider the class of inverse probability weight (IPW) estimators, including the popular Horvitz-Thompson and Hajek estimators used routinely in survey sampling, causal inference and evidence estimation for Bayesian computation. We focus on the 'weak paradoxes' for these estimators due to two counterexamples by Basu [1988] and Wasserman [2004] and investigate the two natural Bayesian answers to this problem: one based on binning and smoothing : a 'Bayesian sieve' and the other based on a conjugate hierarchical model that allows borrowing information via exchangeability. We compare the mean squared errors for the two Bayesian estimators with the IPW estimators for Wasserman's example via simulation studies on a broad range of parameter configurations. We also prove posterior consistency for the Bayes estimators under missing-completely-at-random assumption and show that it requires fewer assumptions on the inclusion probabilities. We also revisit the connection between the different problems where improved or adaptive IPW estimators will be useful, including survey sampling, evidence estimation strategies such as Conditional Monte Carlo, Riemannian sum, Trapezoidal rules and vertical likelihood, as well as average treatment effect estimation in causal inference. △ Less

Submitted 14 November, 2022; v1 submitted 29 April, 2022; originally announced April 2022.

Comments: 25 pages, 4 figures. Added another simulation study and clarified the assumptions needed for the proof of consistency

MSC Class: 62F15; 62F12; 62D05; 65C05

arXiv:2110.11561 [pdf, other]

Merging Two Cultures: Deep and Statistical Learning

Authors: Anindya Bhadra, Jyotishka Datta, Nick Polson, Vadim Sokolov, Jianeng Xu

Abstract: Merging the two cultures of deep and statistical learning provides insights into structured high-dimensional data. Traditional statistical modeling is still a dominant strategy for structured tabular data. Deep learning can be viewed through the lens of generalized linear models (GLMs) with composite link functions. Sufficient dimensionality reduction (SDR) and sparsity performs nonlinear feature… ▽ More Merging the two cultures of deep and statistical learning provides insights into structured high-dimensional data. Traditional statistical modeling is still a dominant strategy for structured tabular data. Deep learning can be viewed through the lens of generalized linear models (GLMs) with composite link functions. Sufficient dimensionality reduction (SDR) and sparsity performs nonlinear feature engineering. We show that prediction, interpolation and uncertainty quantification can be achieved using probabilistic methods at the output layer of the model. Thus a general framework for machine learning arises that first generates nonlinear features (a.k.a factors) via sparse regularization and stochastic gradient optimisation and second uses a stochastic output layer for predictive uncertainty. Rather than using shallow additive architectures as in many statistical models, deep learning uses layers of semi affine input transformations to provide a predictive rule. Applying these layers of transformations leads to a set of attributes (a.k.a features) to which predictive statistical methods can be applied. Thus we achieve the best of both worlds: scalability and fast predictive rule construction together with uncertainty quantification. Sparse regularisation with un-supervised or supervised learning finds the features. We clarify the duality between shallow and wide models such as PCA, PPR, RRR and deep but skinny architectures such as autoencoders, MLPs, CNN, and LSTM. The connection with data transformations is of practical importance for finding good network architectures. By incorporating probabilistic components at the output level we allow for predictive uncertainty. For interpolation we use deep Gaussian process and ReLU trees for classification. We provide applications to regression, classification and interpolation. Finally, we conclude with directions for future research. △ Less

Submitted 21 October, 2021; originally announced October 2021.

Comments: arXiv admin note: text overlap with arXiv:2106.14085

arXiv:2109.11602 [pdf, other]

doi 10.3390/e24040550

Chess AI: Competing Paradigms for Machine Intelligence

Authors: Shiva Maharaj, Nick Polson, Alex Turk

Abstract: Endgame studies have long served as a tool for testing human creativity and intelligence. We find that they can serve as a tool for testing machine ability as well. Two of the leading chess engines, Stockfish and Leela Chess Zero (LCZero), employ significantly different methods during play. We use Plaskett's Puzzle, a famous endgame study from the late 1970s, to compare the two engines. Our experi… ▽ More Endgame studies have long served as a tool for testing human creativity and intelligence. We find that they can serve as a tool for testing machine ability as well. Two of the leading chess engines, Stockfish and Leela Chess Zero (LCZero), employ significantly different methods during play. We use Plaskett's Puzzle, a famous endgame study from the late 1970s, to compare the two engines. Our experiments show that Stockfish outperforms LCZero on the puzzle. We examine the algorithmic differences between the engines and use our observations as a basis for carefully interpreting the test results. Drawing inspiration from how humans solve chess problems, we ask whether machines can possess a form of imagination. On the theoretical side, we describe how Bellman's equation may be applied to optimize the probability of winning. To conclude, we discuss the implications of our work on artificial intelligence (AI) and artificial general intelligence (AGI), suggesting possible avenues for future research. △ Less

Submitted 23 September, 2021; originally announced September 2021.

Comments: 15 pages, 8 figures

arXiv:2106.14085 [pdf, other]

Deep Learning Partial Least Squares

Authors: Nicholas Polson, Vadim Sokolov, Jianeng Xu

Abstract: High dimensional data reduction techniques are provided by using partial least squares within deep learning. Our framework provides a nonlinear extension of PLS together with a disciplined approach to feature selection and architecture design in deep learning. This leads to a statistical interpretation of deep learning that is tailor made for predictive problems. We can use the tools of PLS, such… ▽ More High dimensional data reduction techniques are provided by using partial least squares within deep learning. Our framework provides a nonlinear extension of PLS together with a disciplined approach to feature selection and architecture design in deep learning. This leads to a statistical interpretation of deep learning that is tailor made for predictive problems. We can use the tools of PLS, such as scree-plot, bi-plot to provide model diagnostics. Posterior predictive uncertainty is available using MCMC methods at the last layer. Thus we achieve the best of both worlds: scalability and fast predictive rule construction together with uncertainty quantification. Our key construct is to employ deep learning within PLS by predicting the output scores as a deep learner of the input scores. As with PLS our X-scores are constructed using SVD and applied to both regression and classification problems and are fast and scalable. Following Frank and Friedman 1993, we provide a Bayesian shrinkage interpretation of our nonlinear predictor. We introduce a variety of new partial least squares models: PLS-ReLU, PLS-Autoencoder, PLS-Trees and PLS-GP. To illustrate our methodology, we use simulated examples and the analysis of preferences of orange juice and predicting wine quality as a function of input characteristics. We also illustrate Brillinger's estimation procedure to provide the feature selection and data dimension reduction. Finally, we conclude with directions for future research. △ Less

Submitted 26 June, 2021; originally announced June 2021.

arXiv:2106.01906

Bayesian Inference for Gamma Models

Authors: **gyu He, Nicholas Polson, Jianeng Xu

Abstract: We use the theory of normal variance-mean mixtures to derive a data augmentation scheme for models that include gamma functions. Our methodology applies to many situations in statistics and machine learning, including Multinomial-Dirichlet distributions, Negative binomial regression, Poisson-Gamma hierarchical models, Extreme value models, to name but a few. All of those models include a gamma fun… ▽ More We use the theory of normal variance-mean mixtures to derive a data augmentation scheme for models that include gamma functions. Our methodology applies to many situations in statistics and machine learning, including Multinomial-Dirichlet distributions, Negative binomial regression, Poisson-Gamma hierarchical models, Extreme value models, to name but a few. All of those models include a gamma function which does not admit a natural conjugate prior distribution providing a significant challenge to inference and prediction. To provide a data augmentation strategy, we construct and develop the theory of the class of Exponential Reciprocal Gamma distributions. This allows scalable EM and MCMC algorithms to be developed. We illustrate our methodology on a number of examples, including gamma shape inference, negative binomial regression and Dirichlet allocation. Finally, we conclude with directions for future research. △ Less

Submitted 21 June, 2021; v1 submitted 3 June, 2021; originally announced June 2021.

Comments: Duplicate submission of arXiv:1905.12141 Please check arXiv:1905.12141 for future update

arXiv:1905.12141 [pdf, other]

Data Augementation with Polya Inverse Gamma

Authors: **gyu He, Nicholas G. Polson, Jianeng Xu

Abstract: We use the theory of normal variance-mean mixtures to derive a data augmentation scheme for models that include gamma functions. Our methodology applies to many situations in statistics and machine learning, including Multinomial-Dirichlet distributions, Negative binomial regression, Poisson-Gamma hierarchical models, Extreme value models, to name but a few. All of those models include a gamma fun… ▽ More We use the theory of normal variance-mean mixtures to derive a data augmentation scheme for models that include gamma functions. Our methodology applies to many situations in statistics and machine learning, including Multinomial-Dirichlet distributions, Negative binomial regression, Poisson-Gamma hierarchical models, Extreme value models, to name but a few. All of those models include a gamma function which does not admit a natural conjugate prior distribution providing a significant challenge to inference and prediction. To provide a data augmentation strategy, we construct and develop the theory of the class of Pólya Inverse Gamma distributions. This allows scalable EM and MCMC algorithms to be developed. We illustrate our methodology on a number of examples, including gamma shape inference, negative binomial regression and Dirichlet allocation. Finally, we conclude with directions for future research. △ Less

Submitted 1 May, 2022; v1 submitted 28 May, 2019; originally announced May 2019.

arXiv:1904.10939 [pdf, other]

Horseshoe Regularization for Machine Learning in Complex and Deep Models

Authors: Anindya Bhadra, Jyotishka Datta, Yunfan Li, Nicholas G. Polson

Abstract: Since the advent of the horseshoe priors for regularization, global-local shrinkage methods have proved to be a fertile ground for the development of Bayesian methodology in machine learning, specifically for high-dimensional regression and classification problems. They have achieved remarkable success in computation, and enjoy strong theoretical support. Most of the existing literature has focuse… ▽ More Since the advent of the horseshoe priors for regularization, global-local shrinkage methods have proved to be a fertile ground for the development of Bayesian methodology in machine learning, specifically for high-dimensional regression and classification problems. They have achieved remarkable success in computation, and enjoy strong theoretical support. Most of the existing literature has focused on the linear Gaussian case; see Bhadra et al. (2019b) for a systematic survey. The purpose of the current article is to demonstrate that the horseshoe regularization is useful far more broadly, by reviewing both methodological and computational developments in complex models that are more relevant to machine learning applications. Specifically, we focus on methodological challenges in horseshoe regularization in nonlinear and non-Gaussian models; multivariate models; and deep neural networks. We also outline the recent computational developments in horseshoe shrinkage for complex models along with a list of available software implementations that allows one to venture out beyond the comfort zone of the canonical linear regression problems. △ Less

Submitted 22 November, 2019; v1 submitted 24 April, 2019; originally announced April 2019.

arXiv:1903.09668 [pdf, ps, other]

doi 10.1214/22-BA1331

Data Augmentation for Bayesian Deep Learning

Authors: Yuexi Wang, Nicholas G. Polson, Vadim O. Sokolov

Abstract: Deep Learning (DL) methods have emerged as one of the most powerful tools for functional approximation and prediction. While the representation properties of DL have been well studied, uncertainty quantification remains challenging and largely unexplored. Data augmentation techniques are a natural approach to provide uncertainty quantification and to incorporate stochastic Monte Carlo search into… ▽ More Deep Learning (DL) methods have emerged as one of the most powerful tools for functional approximation and prediction. While the representation properties of DL have been well studied, uncertainty quantification remains challenging and largely unexplored. Data augmentation techniques are a natural approach to provide uncertainty quantification and to incorporate stochastic Monte Carlo search into stochastic gradient descent (SGD) methods. The purpose of our paper is to show that training DL architectures with data augmentation leads to efficiency gains. We use the theory of scale mixtures of normals to derive data augmentation strategies for deep learning. This allows variants of the expectation-maximization and MCMC algorithms to be brought to bear on these high dimensional nonlinear deep learning models. To demonstrate our methodology, we develop data augmentation algorithms for a variety of commonly used activation functions: logit, ReLU, leaky ReLU and SVM. Our methodology is compared to traditional stochastic gradient descent with back-propagation. Our optimization procedure leads to a version of iteratively re-weighted least squares and can be implemented at scale with accelerated linear algebra methods providing substantial improvement in speed. We illustrate our methodology on a number of standard datasets. Finally, we conclude with directions for future research. △ Less

Submitted 24 October, 2022; v1 submitted 22 March, 2019; originally announced March 2019.

arXiv:1903.07677 [pdf, other]

Deep Fundamental Factor Models

Authors: Matthew F. Dixon, Nicholas G. Polson

Abstract: Deep fundamental factor models are developed to automatically capture non-linearity and interaction effects in factor modeling. Uncertainty quantification provides interpretability with interval estimation, ranking of factor importances and estimation of interaction effects. With no hidden layers we recover a linear factor model and for one or more hidden layers, uncertainty bands for the sensitiv… ▽ More Deep fundamental factor models are developed to automatically capture non-linearity and interaction effects in factor modeling. Uncertainty quantification provides interpretability with interval estimation, ranking of factor importances and estimation of interaction effects. With no hidden layers we recover a linear factor model and for one or more hidden layers, uncertainty bands for the sensitivity to each input naturally arise from the network weights. Using 3290 assets in the Russell 1000 index over a period of December 1989 to January 2018, we assess a 49 factor model and generate information ratios that are approximately 1.5x greater than the OLS factor model. Furthermore, we compare our deep fundamental factor model with a quadratic LASSO model and demonstrate the superior performance and robustness to outliers. The Python source code and the data used for this study are provided. △ Less

Submitted 27 August, 2020; v1 submitted 18 March, 2019; originally announced March 2019.

Journal ref: Forthcoming in SIAM J. Financial Mathematics, 2020

arXiv:1902.06269 [pdf, other]

Bayesian Regularization: From Tikhonov to Horseshoe

Authors: Nicholas G. Polson, Vadim Sokolov

Abstract: Bayesian regularization is a central tool in modern-day statistical and machine learning methods. Many applications involve high-dimensional sparse signal recovery problems. The goal of our paper is to provide a review of the literature on penalty-based regularization approaches, from Tikhonov (Ridge, Lasso) to horseshoe regularization. Bayesian regularization is a central tool in modern-day statistical and machine learning methods. Many applications involve high-dimensional sparse signal recovery problems. The goal of our paper is to provide a review of the literature on penalty-based regularization approaches, from Tikhonov (Ridge, Lasso) to horseshoe regularization. △ Less

Submitted 17 February, 2019; originally announced February 2019.

arXiv:1808.08618 [pdf, other]

Deep Learning: Computational Aspects

Authors: Nicholas Polson, Vadim Sokolov

Abstract: In this article we review computational aspects of Deep Learning (DL). Deep learning uses network architectures consisting of hierarchical layers of latent variables to construct predictors for high-dimensional input-output models. Training a deep learning architecture is computationally intensive, and efficient linear algebra libraries is the key for training and inference. Stochastic gradient de… ▽ More In this article we review computational aspects of Deep Learning (DL). Deep learning uses network architectures consisting of hierarchical layers of latent variables to construct predictors for high-dimensional input-output models. Training a deep learning architecture is computationally intensive, and efficient linear algebra libraries is the key for training and inference. Stochastic gradient descent (SGD) optimization and batch sampling are used to learn from massive data sets. △ Less

Submitted 28 August, 2019; v1 submitted 26 August, 2018; originally announced August 2018.

arXiv:1807.07987 [pdf, other]

Deep Learning

Authors: Nicholas G. Polson, Vadim O. Sokolov

Abstract: Deep learning (DL) is a high dimensional data reduction technique for constructing high-dimensional predictors in input-output models. DL is a form of machine learning that uses hierarchical layers of latent features. In this article, we review the state-of-the-art of deep learning from a modeling and algorithmic perspective. We provide a list of successful areas of applications in Artificial Inte… ▽ More Deep learning (DL) is a high dimensional data reduction technique for constructing high-dimensional predictors in input-output models. DL is a form of machine learning that uses hierarchical layers of latent features. In this article, we review the state-of-the-art of deep learning from a modeling and algorithmic perspective. We provide a list of successful areas of applications in Artificial Intelligence (AI), Image Processing, Robotics and Automation. Deep learning is predictive in its nature rather then inferential and can be viewed as a black-box methodology for high-dimensional function estimation. △ Less

Submitted 3 August, 2018; v1 submitted 20 July, 2018; originally announced July 2018.

Comments: arXiv admin note: text overlap with arXiv:1602.06561

arXiv:1805.01104 [pdf, other]

Deep Learning in Characteristics-Sorted Factor Models

Authors: Guanhao Feng, **gyu He, Nicholas G. Polson, Jianeng Xu

Abstract: This paper presents an augmented deep factor model that generates latent factors for cross-sectional asset pricing. The conventional security sorting on firm characteristics for constructing long-short factor portfolio weights is nonlinear modeling, while factors are treated as inputs in linear models. We provide a structural deep learning framework to generalize the complete mechanism for fitting… ▽ More This paper presents an augmented deep factor model that generates latent factors for cross-sectional asset pricing. The conventional security sorting on firm characteristics for constructing long-short factor portfolio weights is nonlinear modeling, while factors are treated as inputs in linear models. We provide a structural deep learning framework to generalize the complete mechanism for fitting cross-sectional returns by firm characteristics through generating risk factors -- hidden layers. Our model has an economic-guided objective function that minimizes aggregated realized pricing errors. Empirical results on high-dimensional characteristics demonstrate robust asset pricing performance and strong investment improvements by identifying important raw characteristic sources. △ Less

Submitted 19 July, 2023; v1 submitted 2 May, 2018; originally announced May 2018.

arXiv:1804.09314 [pdf, other]

Deep Learning for Predicting Asset Returns

Authors: Guanhao Feng, **gyu He, Nicholas G. Polson

Abstract: Deep learning searches for nonlinear factors for predicting asset returns. Predictability is achieved via multiple layers of composite factors as opposed to additive ones. Viewed in this way, asset pricing studies can be revisited using multi-layer deep learners, such as rectified linear units (ReLU) or long-short-term-memory (LSTM) for time-series effects. State-of-the-art algorithms including st… ▽ More Deep learning searches for nonlinear factors for predicting asset returns. Predictability is achieved via multiple layers of composite factors as opposed to additive ones. Viewed in this way, asset pricing studies can be revisited using multi-layer deep learners, such as rectified linear units (ReLU) or long-short-term-memory (LSTM) for time-series effects. State-of-the-art algorithms including stochastic gradient descent (SGD), TensorFlow and dropout design provide imple- mentation and efficient factor exploration. To illustrate our methodology, we revisit the equity market risk premium dataset of Welch and Goyal (2008). We find the existence of nonlinear factors which explain predictability of returns, in particular at the extremes of the characteristic space. Finally, we conclude with directions for future research. △ Less

Submitted 26 April, 2018; v1 submitted 24 April, 2018; originally announced April 2018.

arXiv:1803.09138 [pdf, ps, other]

Posterior Concentration for Sparse Deep Learning

Authors: Nicholas Polson, Veronika Rockova

Abstract: Spike-and-Slab Deep Learning (SS-DL) is a fully Bayesian alternative to Dropout for improving generalizability of deep ReLU networks. This new type of regularization enables provable recovery of smooth input-output maps with unknown levels of smoothness. Indeed, we show that the posterior distribution concentrates at the near minimax rate for $α$-Hölder smooth maps, performing as well as if we kne… ▽ More Spike-and-Slab Deep Learning (SS-DL) is a fully Bayesian alternative to Dropout for improving generalizability of deep ReLU networks. This new type of regularization enables provable recovery of smooth input-output maps with unknown levels of smoothness. Indeed, we show that the posterior distribution concentrates at the near minimax rate for $α$-Hölder smooth maps, performing as well as if we knew the smoothness level $α$ ahead of time. Our result sheds light on architecture design for deep neural networks, namely the choice of depth, width and sparsity level. These network attributes typically depend on unknown smoothness in order to be optimal. We obviate this constraint with the fully Bayes construction. As an aside, we show that SS-DL does not overfit in the sense that the posterior concentrates on smaller networks with fewer (up to the optimal number of) nodes and links. Our results provide new theoretical justifications for deep ReLU networks from a Bayesian point of view. △ Less

Submitted 24 March, 2018; originally announced March 2018.

arXiv:1803.04559 [pdf, other]

doi 10.1002/cjs.11570

Weighted Bayesian Bootstrap for Scalable Bayes

Authors: Michael Newton, Nicholas G. Polson, Jianeng Xu

Abstract: We develop a weighted Bayesian Bootstrap (WBB) for machine learning and statistics. WBB provides uncertainty quantification by sampling from a high dimensional posterior distribution. WBB is computationally fast and scalable using only off-theshelf optimization software such as TensorFlow. We provide regularity conditions which apply to a wide range of machine learning and statistical models. We i… ▽ More We develop a weighted Bayesian Bootstrap (WBB) for machine learning and statistics. WBB provides uncertainty quantification by sampling from a high dimensional posterior distribution. WBB is computationally fast and scalable using only off-theshelf optimization software such as TensorFlow. We provide regularity conditions which apply to a wide range of machine learning and statistical models. We illustrate our methodology in regularized regression, trend filtering and deep learning. Finally, we conclude with directions for future research. △ Less

Submitted 12 March, 2018; originally announced March 2018.

Journal ref: Canadian Journal of Statistics 2020

arXiv:1712.03889 [pdf, other]

Statistical sparsity

Authors: Peter McCullagh, Nicholas Polson

Abstract: The main contribution of this paper is a mathematical definition of statistical sparsity, which is expressed as a limiting property of a sequence of probability distributions. The limit is characterized by an exceedance measure~$H$ and a rate parameter~$ρ> 0$, both of which are unrelated to sample size. The definition is sufficient to encompass all sparsity models that have been suggested in the s… ▽ More The main contribution of this paper is a mathematical definition of statistical sparsity, which is expressed as a limiting property of a sequence of probability distributions. The limit is characterized by an exceedance measure~$H$ and a rate parameter~$ρ> 0$, both of which are unrelated to sample size. The definition is sufficient to encompass all sparsity models that have been suggested in the signal-detection literature. Sparsity implies that $ρ$~is small, and a sparse approximation is asymptotic in the rate parameter, typically with error $o(ρ)$ in the sparse limit $ρ\to 0$. To first order in sparsity, the sparse signal plus Gaussian noise convolution depends on the signal distribution only through its rate parameter and exceedance measure. This is one of several asymptotic approximations implied by the definition, each of which is most conveniently expressed in terms of the zeta-transformation of the exceedance measure. One implication is that two sparse families having the same exceedance measure are inferentially equivalent, and cannot be distinguished to first order. A converse implication for methodological strategy is that it may be more fruitful to focus on the exceedance measure, ignoring aspects of the signal distribution that have negligible effect on observables and on inferences. From this point of view, scale models and inverse-power measures seem particularly attractive. △ Less

Submitted 23 May, 2018; v1 submitted 11 December, 2017; originally announced December 2017.

Comments: 21 pages, 6 figures, 1 table

arXiv:1709.00379 [pdf, ps, other]

Sparse Regularization in Marketing and Economics

Authors: Guanhao Feng, Nicholas Polson, Yuexi Wang, Jianeng Xu

Abstract: Sparse alpha-norm regularization has many data-rich applications in Marketing and Economics. Alpha-norm, in contrast to lasso and ridge regularization, jumps to a sparse solution. This feature is attractive for ultra high-dimensional problems that occur in demand estimation and forecasting. The alpha-norm objective is nonconvex and requires coordinate descent and proximal operators to find the spa… ▽ More Sparse alpha-norm regularization has many data-rich applications in Marketing and Economics. Alpha-norm, in contrast to lasso and ridge regularization, jumps to a sparse solution. This feature is attractive for ultra high-dimensional problems that occur in demand estimation and forecasting. The alpha-norm objective is nonconvex and requires coordinate descent and proximal operators to find the sparse solution. We study a typical marketing demand forecasting problem, grocery store sales for salty snacks, that has many dummy variables as controls. The key predictors of demand include price, equivalized volume, promotion, flavor, scent, and brand effects. By comparing with many commonly used machine learning methods, alpha-norm regularization achieves its goal of providing accurate out-of-sample estimates for the promotion lift effects. Finally, we conclude with directions for future research. △ Less

Submitted 5 February, 2018; v1 submitted 1 September, 2017; originally announced September 2017.

arXiv:1706.10179 [pdf, other]

Lasso Meets Horseshoe : A Survey

Authors: Anindya Bhadra, Jyotishka Datta, Nicholas G. Polson, Brandon T. Willard

Abstract: The goal of this paper is to contrast and survey the major advances in two of the most commonly used high-dimensional techniques, namely, the Lasso and horseshoe regularization. Lasso is a gold standard for predictor selection while horseshoe is a state-of-the-art Bayesian estimator for sparse signals. Lasso is fast and scalable and uses convex optimization whilst the horseshoe is non-convex. Our… ▽ More The goal of this paper is to contrast and survey the major advances in two of the most commonly used high-dimensional techniques, namely, the Lasso and horseshoe regularization. Lasso is a gold standard for predictor selection while horseshoe is a state-of-the-art Bayesian estimator for sparse signals. Lasso is fast and scalable and uses convex optimization whilst the horseshoe is non-convex. Our novel perspective focuses on three aspects: (i) theoretical optimality in high dimensional inference for the Gaussian sparse model and beyond, (ii) efficiency and scalability of computation and (iii) methodological development and performance. △ Less

Submitted 3 March, 2019; v1 submitted 30 June, 2017; originally announced June 2017.

Comments: 32 pages, 4 figures

MSC Class: Primary 62J07; 62J05; Secondary 62H15; 62F03

arXiv:1706.00473 [pdf, other]

doi 10.1214/17-BA1082

Deep Learning: A Bayesian Perspective

Authors: Nicholas Polson, Vadim Sokolov

Abstract: Deep learning is a form of machine learning for nonlinear high dimensional pattern matching and prediction. By taking a Bayesian probabilistic perspective, we provide a number of insights into more efficient algorithms for optimisation and hyper-parameter tuning. Traditional high-dimensional data reduction techniques, such as principal component analysis (PCA), partial least squares (PLS), reduced… ▽ More Deep learning is a form of machine learning for nonlinear high dimensional pattern matching and prediction. By taking a Bayesian probabilistic perspective, we provide a number of insights into more efficient algorithms for optimisation and hyper-parameter tuning. Traditional high-dimensional data reduction techniques, such as principal component analysis (PCA), partial least squares (PLS), reduced rank regression (RRR), projection pursuit regression (PPR) are all shown to be shallow learners. Their deep learning counterparts exploit multiple deep layers of data reduction which provide predictive performance gains. Stochastic gradient descent (SGD) training optimisation and Dropout (DO) regularization provide estimation and variable selection. Bayesian regularization is central to finding weights and connections in networks to optimize the predictive bias-variance trade-off. To illustrate our methodology, we provide an analysis of international bookings on Airbnb. Finally, we conclude with directions for future research. △ Less

Submitted 13 November, 2017; v1 submitted 1 June, 2017; originally announced June 2017.

arXiv:1706.00098 [pdf, ps, other]

doi 10.1002/asmb.2381

Bayesian $l_0$-regularized Least Squares

Authors: Nicholas G. Polson, Lei Sun

Abstract: Bayesian $l_0$-regularized least squares is a variable selection technique for high dimensional predictors. The challenge is optimizing a non-convex objective function via search over model space consisting of all possible predictor combinations. Spike-and-slab (a.k.a. Bernoulli-Gaussian) priors are the gold standard for Bayesian variable selection, with a caveat of computational speed and scalabi… ▽ More Bayesian $l_0$-regularized least squares is a variable selection technique for high dimensional predictors. The challenge is optimizing a non-convex objective function via search over model space consisting of all possible predictor combinations. Spike-and-slab (a.k.a. Bernoulli-Gaussian) priors are the gold standard for Bayesian variable selection, with a caveat of computational speed and scalability. Single Best Replacement (SBR) provides a fast scalable alternative. We provide a link between Bayesian regularization and proximal updating, which provides an equivalence between finding a posterior mode and a posterior mean with a different regularization prior. This allows us to use SBR to find the spike-and-slab estimator. To illustrate our methodology, we provide simulation evidence and a real data example on the statistical properties and computational efficiency of SBR versus direct posterior sampling using spike-and-slab priors. Finally, we conclude with directions for future research. △ Less

Submitted 18 December, 2018; v1 submitted 31 May, 2017; originally announced June 2017.

Comments: 22 pages, 6 figures, 1 table

MSC Class: 62-04

arXiv:1705.09851 [pdf, other]

Deep Learning for Spatio-Temporal Modeling: Dynamic Traffic Flows and High Frequency Trading

Authors: Matthew F. Dixon, Nicholas G. Polson, Vadim O. Sokolov

Abstract: Deep learning applies hierarchical layers of hidden variables to construct nonlinear high dimensional predictors. Our goal is to develop and train deep learning architectures for spatio-temporal modeling. Training a deep architecture is achieved by stochastic gradient descent (SGD) and drop-out (DO) for parameter regularization with a goal of minimizing out-of-sample predictive mean squared error.… ▽ More Deep learning applies hierarchical layers of hidden variables to construct nonlinear high dimensional predictors. Our goal is to develop and train deep learning architectures for spatio-temporal modeling. Training a deep architecture is achieved by stochastic gradient descent (SGD) and drop-out (DO) for parameter regularization with a goal of minimizing out-of-sample predictive mean squared error. To illustrate our methodology, we predict the sharp discontinuities in traffic flow data, and secondly, we develop a classification rule to predict short-term futures market prices as a function of the order book depth. Finally, we conclude with directions for future research. △ Less

Submitted 7 May, 2018; v1 submitted 27 May, 2017; originally announced May 2017.

arXiv:1705.04141 [pdf, ps, other]

From Least Squares to Signal Processing and Particle Filtering

Authors: Nozer D. Singpurwalla, Nicholas G. Polson, Refik Soyer

Abstract: De Facto, signal processing is the interpolation and extrapolation of a sequence of observations viewed as a realization of a stochastic process. Its role in applied statistics ranges from scenarios in forecasting and time series analysis, to image reconstruction, machine learning, and the degradation modeling for reliability assessment. A general solution to the problem of filtering and predictio… ▽ More De Facto, signal processing is the interpolation and extrapolation of a sequence of observations viewed as a realization of a stochastic process. Its role in applied statistics ranges from scenarios in forecasting and time series analysis, to image reconstruction, machine learning, and the degradation modeling for reliability assessment. A general solution to the problem of filtering and prediction entails some formidable mathematics. Efforts to circumvent the mathematics has resulted in the need for introducing more explicit descriptions of the underlying process. One such example, and a noteworthy one, is the Kalman Filter Model, which is a special case of state space models or what statisticians refer to as Dynamic Linear Models. Implementing the Kalman Filter Model in the era of "big and high velocity non-Gaussian data" can pose computational challenges with respect to efficiency and timeliness. Particle filtering is a way to ease such computational burdens. The purpose of this paper is to trace the historical evolution of this development from its inception to its current state, with an expository focus on two versions of the particle filter, namely, the propagate first-update next and the update first-propagate next version. By way of going beyond a pure review, this paper also makes transparent the importance and the role of a less recognized principle, namely the principle of conditionalization, in filtering and prediction based on Bayesian methods. Furthermore, the paper also articulates the philosophical underpinnings of the filtering and prediction set-up, a matter that needs to ne made explicit, and Yule's decomposition of a random variable in terms of a sequence of innovations. △ Less

Submitted 11 May, 2017; originally announced May 2017.

arXiv:1702.07400 [pdf, other]

Horseshoe Regularization for Feature Subset Selection

Authors: Anindya Bhadra, Jyotishka Datta, Nicholas G. Polson, Brandon Willard

Abstract: Feature subset selection arises in many high-dimensional applications of statistics, such as compressed sensing and genomics. The $\ell_0$ penalty is ideal for this task, the caveat being it requires the NP-hard combinatorial evaluation of all models. A recent area of considerable interest is to develop efficient algorithms to fit models with a non-convex $\ell_γ$ penalty for $γ\in (0,1)$, which r… ▽ More Feature subset selection arises in many high-dimensional applications of statistics, such as compressed sensing and genomics. The $\ell_0$ penalty is ideal for this task, the caveat being it requires the NP-hard combinatorial evaluation of all models. A recent area of considerable interest is to develop efficient algorithms to fit models with a non-convex $\ell_γ$ penalty for $γ\in (0,1)$, which results in sparser models than the convex $\ell_1$ or lasso penalty, but is harder to fit. We propose an alternative, termed the horseshoe regularization penalty for feature subset selection, and demonstrate its theoretical and computational advantages. The distinguishing feature from existing non-convex optimization approaches is a full probabilistic representation of the penalty as the negative of the logarithm of a suitable prior, which in turn enables efficient expectation-maximization and local linear approximation algorithms for optimization and MCMC for uncertainty quantification. In synthetic and real data, the resulting algorithms provide better statistical performance, and the computation requires a fraction of time of state-of-the-art non-convex solvers. △ Less

Submitted 22 June, 2017; v1 submitted 23 February, 2017; originally announced February 2017.

arXiv:1610.09750 [pdf, other]

Sequential Bayesian Learning for Merton's Jump Model with Stochastic Volatility

Authors: Eric Jacquier, Nicholas Polson, Vadim Sokolov

Abstract: Jump stochastic volatility models are central to financial econometrics for volatility forecasting, portfolio risk management, and derivatives pricing. Markov Chain Monte Carlo (MCMC) algorithms are computationally unfeasible for the sequential learning of volatility state variables and parameters, whereby the investor must update all posterior and predictive densities as new information arrives.… ▽ More Jump stochastic volatility models are central to financial econometrics for volatility forecasting, portfolio risk management, and derivatives pricing. Markov Chain Monte Carlo (MCMC) algorithms are computationally unfeasible for the sequential learning of volatility state variables and parameters, whereby the investor must update all posterior and predictive densities as new information arrives. We develop a particle filtering and learning algorithm to sample posterior distribution in Merton's jump stochastic volatility. This allows to filter spot volatilities and jump times, together with sequentially updating (learning) of jump and volatility parameters. We illustrate our methodology on Google's stock return. We conclude with directions for future research. △ Less

Submitted 30 October, 2016; originally announced October 2016.

arXiv:1606.01701 [pdf, ps, other]

Regularizing Bayesian Predictive Regressions

Authors: Guanhao Feng, Nicholas G. Polson

Abstract: We show that regularizing Bayesian predictive regressions provides a framework for prior sensitivity analysis. We develop a procedure that jointly regularizes expectations and variance-covariance matrices using a pair of shrinkage priors. Our methodology applies directly to vector autoregressions (VAR) and seemingly unrelated regressions (SUR). The regularization path provides a prior sensitivity… ▽ More We show that regularizing Bayesian predictive regressions provides a framework for prior sensitivity analysis. We develop a procedure that jointly regularizes expectations and variance-covariance matrices using a pair of shrinkage priors. Our methodology applies directly to vector autoregressions (VAR) and seemingly unrelated regressions (SUR). The regularization path provides a prior sensitivity diagnostic. By exploiting a duality between regularization penalties and predictive prior distributions, we reinterpret two classic Bayesian analyses of macro-finance studies: equity premium predictability and forecasting macroeconomic growth rates. We find there exist plausible prior specifications for predictability in excess S&P 500 index returns using book-to-market ratios, CAY (consumption, wealth, income ratio), and T-bill rates. We evaluate the forecasts using a market-timing strategy, and we show the optimally regularized solution outperforms a buy-and-hold approach. A second empirical application involves forecasting industrial production, inflation, and consumption growth rates, and demonstrates the feasibility of our approach. △ Less

Submitted 13 September, 2017; v1 submitted 6 June, 2016; originally announced June 2016.

arXiv:1604.04527 [pdf, other]

doi 10.1016/j.trc.2017.02.024

Deep Learning for Short-Term Traffic Flow Prediction

Authors: Nicholas Polson, Vadim Sokolov

Abstract: We develop a deep learning model to predict traffic flows. The main contribution is development of an architecture that combines a linear model that is fitted using $\ell_1$ regularization and a sequence of $\tanh$ layers. The challenge of predicting traffic flows are the sharp nonlinearities due to transitions between free flow, breakdown, recovery and congestion. We show that deep learning archi… ▽ More We develop a deep learning model to predict traffic flows. The main contribution is development of an architecture that combines a linear model that is fitted using $\ell_1$ regularization and a sequence of $\tanh$ layers. The challenge of predicting traffic flows are the sharp nonlinearities due to transitions between free flow, breakdown, recovery and congestion. We show that deep learning architectures can capture these nonlinear spatio-temporal effects. The first layer identifies spatio-temporal relations among predictors and other layers model nonlinear relations. We illustrate our methodology on road sensor data from Interstate I-55 and predict traffic flows during two special events; a Chicago Bears football game and an extreme snowstorm event. Both cases have sharp traffic flow regime changes, occurring very suddenly, and we show how deep learning provides precise short term traffic flow predictions. △ Less

Submitted 27 February, 2017; v1 submitted 15 April, 2016; originally announced April 2016.

arXiv:1604.03614 [pdf, other]

doi 10.1515/jqas-2016-0039

The Market for English Premier League (EPL) Odds

Authors: Guanhao Feng, Nicholas G. Polson, Jianeng Xu

Abstract: This paper employs a Skellam process to represent real-time betting odds for English Premier League (EPL) soccer games. Given a matrix of market odds on all possible score outcomes, we estimate the expected scoring rates for each team. The expected scoring rates then define the implied volatility of an EPL game. As events in the game evolve, we re-estimate the expected scoring rates and our implie… ▽ More This paper employs a Skellam process to represent real-time betting odds for English Premier League (EPL) soccer games. Given a matrix of market odds on all possible score outcomes, we estimate the expected scoring rates for each team. The expected scoring rates then define the implied volatility of an EPL game. As events in the game evolve, we re-estimate the expected scoring rates and our implied volatility measure to provide a dynamic representation of the market's expectation of the game outcome. Using a dataset of 1520 EPL games from 2012-2016, we show how our model calibrates well to the game outcome. We illustrate our methodology on real-time market odds data for a game between Everton and West Ham in the 2015-2016 season. We show how the implied volatility for the outcome evolves as goals, red cards, and corner kicks occur. Finally, we conclude with directions for future research. △ Less

Submitted 5 January, 2017; v1 submitted 12 April, 2016; originally announced April 2016.

Journal ref: Journal of Quantitative Analysis in Sports, 12.4 (2017): 167-178

arXiv:1602.01445 [pdf, ps, other]

Sequential Bayesian Analysis of Multivariate Count Data

Authors: Tevfik Aktekin, Nicholas G. Polson, Refik Soyer

Abstract: We develop a new class of dynamic multivariate Poisson count models that allow for fast online updating and we refer to these models as multivariate Poisson-scaled beta (MPSB). The MPSB model allows for serial dependence in the counts as well as dependence across multiple series with a random common environment. Other notable features include analytic forms for state propagation and predictive lik… ▽ More We develop a new class of dynamic multivariate Poisson count models that allow for fast online updating and we refer to these models as multivariate Poisson-scaled beta (MPSB). The MPSB model allows for serial dependence in the counts as well as dependence across multiple series with a random common environment. Other notable features include analytic forms for state propagation and predictive likelihood densities. Sequential updating occurs through the updating of the sufficient statistics for static model parameters, leading to a fully adapted particle learning algorithm and a new class of predictive likelihoods and marginal distributions which we refer to as the (dynamic) multivariate confluent hyper-geometric negative binomial distribution (MCHG-NB) and the the dynamic multivariate negative binomial (DMNB) distribution. To illustrate our methodology, we use various simulation studies and count data on weekly non-durable goods consumer demand. △ Less

Submitted 15 September, 2016; v1 submitted 3 February, 2016; originally announced February 2016.

Comments: 31 pages, 9 figures

arXiv:1511.06750 [pdf, other]

A deconvolution path for mixtures

Authors: Oscar Hernan Madrid Padilla, Nicholas G. Polson, James G. Scott

Abstract: We propose a class of estimators for deconvolution in mixture models based on a simple two-step "bin-and-smooth" procedure applied to histogram counts. The method is both statistically and computationally efficient: by exploiting recent advances in convex optimization, we are able to provide a full deconvolution path that shows the estimate for the mixing distribution across a range of plausible d… ▽ More We propose a class of estimators for deconvolution in mixture models based on a simple two-step "bin-and-smooth" procedure applied to histogram counts. The method is both statistically and computationally efficient: by exploiting recent advances in convex optimization, we are able to provide a full deconvolution path that shows the estimate for the mixing distribution across a range of plausible degrees of smoothness, at far less cost than a full Bayesian analysis. This enables practitioners to conduct a sensitivity analysis with minimal effort. This is especially important for applied data analysis, given the ill-posed nature of the deconvolution problem. Our results establish the favorable theoretical properties of our estimator and show that it offers state-of-the-art performance when compared to benchmark methods across a range of scenarios. △ Less

Submitted 25 May, 2017; v1 submitted 20 November, 2015; originally announced November 2015.

Journal ref: Electronic Journal of Statistics Volume 12, Number 1 (2018), 1717-1751

arXiv:1510.03516 [pdf, ps, other]

Default Bayesian analysis with global-local shrinkage priors

Authors: Anindya Bhadra, Jyotishka Datta, Nicholas G. Polson, Brandon T. Willard

Abstract: We provide a framework for assessing the default nature of a prior distribution using the property of regular variation, which we study for global-local shrinkage priors. In particular, we demonstrate the horseshoe priors, originally designed to handle sparsity, also possess regular variation and thus are appropriate for default Bayesian analysis. To illustrate our methodology, we solve a problem… ▽ More We provide a framework for assessing the default nature of a prior distribution using the property of regular variation, which we study for global-local shrinkage priors. In particular, we demonstrate the horseshoe priors, originally designed to handle sparsity, also possess regular variation and thus are appropriate for default Bayesian analysis. To illustrate our methodology, we solve a problem of non-informative priors due to Efron (1973), who showed standard flat non-informative priors in high-dimensional normal means model can be highly informative for nonlinear parameters of interest. We consider four such problems and show global-local shrinkage priors such as the horseshoe and horseshoe+ perform as Efron (1973) requires in each case. We find the reason for this lies in the ability of the global-local shrinkage priors to separate a low-dimensional signal embedded in high-dimensional noise, even for nonlinear functions. △ Less

Submitted 14 May, 2016; v1 submitted 12 October, 2015; originally announced October 2015.

Comments: 28 pages, 7 figures, 6 tables

MSC Class: 62C10; 62F15

arXiv:1509.06061 [pdf, other]

A Statistical Theory of Deep Learning via Proximal Splitting

Authors: Nicholas G. Polson, Brandon T. Willard, Massoud Heidari

Abstract: In this paper we develop a statistical theory and an implementation of deep learning models. We show that an elegant variable splitting scheme for the alternating direction method of multipliers optimises a deep learning objective. We allow for non-smooth non-convex regularisation penalties to induce sparsity in parameter weights. We provide a link between traditional shallow layer statistical mod… ▽ More In this paper we develop a statistical theory and an implementation of deep learning models. We show that an elegant variable splitting scheme for the alternating direction method of multipliers optimises a deep learning objective. We allow for non-smooth non-convex regularisation penalties to induce sparsity in parameter weights. We provide a link between traditional shallow layer statistical models such as principal component and sliced inverse regression and deep layer models. We also define the degrees of freedom of a deep learning predictor and a predictive MSE criteria to perform model selection for comparing architecture designs. We focus on deep multiclass logistic learning although our methods apply more generally. Our results suggest an interesting and previously under-exploited relationship between deep learning and proximal splitting techniques. To illustrate our methodology, we provide a multi-class logit classification analysis of Fisher's Iris data where we illustrate the convergence of our algorithm. Finally, we conclude with directions for future research. △ Less

Submitted 20 September, 2015; originally announced September 2015.

arXiv:1502.03175 [pdf, other]

Proximal Algorithms in Statistics and Machine Learning

Authors: Nicholas G. Polson, James G. Scott, Brandon T. Willard

Abstract: In this paper we develop proximal methods for statistical learning. Proximal point algorithms are useful in statistics and machine learning for obtaining optimization solutions for composite functions. Our approach exploits closed-form solutions of proximal operators and envelope representations based on the Moreau, Forward-Backward, Douglas-Rachford and Half-Quadratic envelopes. Envelope represen… ▽ More In this paper we develop proximal methods for statistical learning. Proximal point algorithms are useful in statistics and machine learning for obtaining optimization solutions for composite functions. Our approach exploits closed-form solutions of proximal operators and envelope representations based on the Moreau, Forward-Backward, Douglas-Rachford and Half-Quadratic envelopes. Envelope representations lead to novel proximal algorithms for statistical optimisation of composite objective functions which include both non-smooth and non-convex objectives. We illustrate our methodology with regularized Logistic and Poisson regression and non-convex bridge penalties with a fused lasso norm. We provide a discussion of convergence of non-descent algorithms with acceleration and for non-convex functions. Finally, we provide directions for future research. △ Less

Submitted 30 May, 2015; v1 submitted 10 February, 2015; originally announced February 2015.

arXiv:1411.5076 [pdf, other]

doi 10.1109/TITS.2017.2650947

Bayesian Particle Tracking of Traffic Flows

Authors: Nicholas Polson, Vadim Sokolov

Abstract: We develop a Bayesian particle filter for tracking traffic flows that is capable of capturing non-linearities and discontinuities present in flow dynamics. Our model includes a hidden state variable that captures sudden regime shifts between traffic free flow, breakdown and recovery. We develop an efficient particle learning algorithm for real time on-line inference of states and parameters. This… ▽ More We develop a Bayesian particle filter for tracking traffic flows that is capable of capturing non-linearities and discontinuities present in flow dynamics. Our model includes a hidden state variable that captures sudden regime shifts between traffic free flow, breakdown and recovery. We develop an efficient particle learning algorithm for real time on-line inference of states and parameters. This requires a two step approach, first, resampling the current particles, with a mixture predictive distribution and second, propagation of states using the conditional posterior distribution. Particle learning of parameters follows from updating recursions for conditional sufficient statistics. To illustrate our methodology, we analyze measurements of daily traffic flow from the Illinois interstate I-55 highway system. We demonstrate how our filter can be used to inference the change of traffic flow regime on a highway road segment based on a measurement from freeway single-loop detectors. Finally, we conclude with directions for future research. △ Less

Submitted 15 November, 2015; v1 submitted 18 November, 2014; originally announced November 2014.

MSC Class: 60K35

arXiv:1409.6034 [pdf, ps, other]

doi 10.1214/15-AOAS853

Bayesian analysis of traffic flow on interstate I-55: The LWR model

Authors: Nicholas Polson, Vadim Sokolov

Abstract: Transportation departments take actions to manage traffic flow and reduce travel times based on estimated current and projected traffic conditions. Travel time estimates and forecasts require information on traffic density which are combined with a model to project traffic flow such as the Lighthill-Whitham-Richards (LWR) model. We develop a particle filtering and learning algorithm to estimate th… ▽ More Transportation departments take actions to manage traffic flow and reduce travel times based on estimated current and projected traffic conditions. Travel time estimates and forecasts require information on traffic density which are combined with a model to project traffic flow such as the Lighthill-Whitham-Richards (LWR) model. We develop a particle filtering and learning algorithm to estimate the current traffic density state and the LWR parameters. These inputs are related to the so-called fundamental diagram, which describes the relationship between traffic flow and density. We build on existing methodology by allowing real-time updating of the posterior uncertainty for the critical density and capacity parameters. Our methodology is applied to traffic flow data from interstate highway I-55 in Chicago. We provide a real-time data analysis of how to learn the drop in capacity as a result of a major traffic accident. Our algorithm allows us to accurately assess the uncertainty of the current traffic state at shock waves, where the uncertainty is a mixture distribution. We show that Bayesian learning can correct the estimation bias that is present in the model with fixed parameters. △ Less

Submitted 29 January, 2016; v1 submitted 21 September, 2014; originally announced September 2014.

Comments: Published at http://dx.doi.org/10.1214/15-AOAS853 in the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS853

Journal ref: Annals of Applied Statistics 2015, Vol. 9, No. 4, 1864-1888

arXiv:1409.3601 [pdf, other]

Vertical-likelihood Monte Carlo

Authors: Nicholas G. Polson, James G. Scott

Abstract: In this review, we address the use of Monte Carlo methods for approximating definite integrals of the form $Z = \int L(x) d P(x)$, where $L$ is a target function (often a likelihood) and $P$ a finite measure. We present vertical-likelihood Monte Carlo, which is an approach for designing the importance function $g(x)$ used in importance sampling. Our approach exploits a duality between two random v… ▽ More In this review, we address the use of Monte Carlo methods for approximating definite integrals of the form $Z = \int L(x) d P(x)$, where $L$ is a target function (often a likelihood) and $P$ a finite measure. We present vertical-likelihood Monte Carlo, which is an approach for designing the importance function $g(x)$ used in importance sampling. Our approach exploits a duality between two random variables: the random draw $X \sim g$, and the corresponding random likelihood ordinate $Y\equiv L(X)$ of the draw. It is natural to specify $g(x)$ and ask: what is the the implied distribution of $Y$? In this paper, we take up the opposite question: what should the distribution of $Y$ be so that the implied importance function $g(x)$ is good for approximating $Z$? Our answer turns out to unite seven seemingly disparate classes of algorithms under the vertical-likelihood perspective: importance sampling, slice sampling, simulated annealing/tempering, the harmonic-mean estimator, the vertical-density sampler, nested sampling, and energy-level sampling (a suite of related methods from statistical physics). In particular, we give an alterate presentation of nested sampling, paying special attention to the connection between this method and the vertical-likelihood perspective articulated here. As an alternative to nested sampling, we describe an MCMC method based on re-weighted slice sampling. This method's convergence properties are studied, and two examples demonstrate the promise of the overall approach. △ Less

Submitted 23 June, 2015; v1 submitted 11 September, 2014; originally announced September 2014.

arXiv:1406.0177 [pdf, other]

Mixtures, envelopes, and hierarchical duality

Authors: Nicholas G. Polson, James G. Scott

Abstract: We develop a connection between mixture and envelope representations of objective functions that arise frequently in statistics. We refer to this connection using the term "hierarchical duality." Our results suggest an interesting and previously under-exploited relationship between marginalization and profiling, or equivalently between the Fenchel--Moreau theorem for convex functions and the Berns… ▽ More We develop a connection between mixture and envelope representations of objective functions that arise frequently in statistics. We refer to this connection using the term "hierarchical duality." Our results suggest an interesting and previously under-exploited relationship between marginalization and profiling, or equivalently between the Fenchel--Moreau theorem for convex functions and the Bernstein--Widder theorem for Laplace transforms. We give several different sets of conditions under which such a duality result obtains. We then extend existing work on envelope representations in several ways, including novel generalizations to variance-mean models and to multivariate Gaussian location models. This turns out to provide an elegant missing-data interpretation of the proximal gradient method, a widely used algorithm in machine learning. We show several statistical applications in which the proposed framework leads to easily implemented algorithms, including a robust version of the fused lasso, nonlinear quantile regression via trend filtering, and the binomial fused double Pareto model. Code for the examples is available on GitHub at https://github.com/jgscott/hierduals. △ Less

Submitted 22 February, 2015; v1 submitted 1 June, 2014; originally announced June 2014.

arXiv:1405.0506 [pdf, other]

Sampling Polya-Gamma random variates: alternate and approximate techniques

Authors: Jesse Windle, Nicholas G. Polson, James G. Scott

Abstract: Efficiently sampling from the Pólya-Gamma distribution, ${PG}(b,z)$, is an essential element of Pólya-Gamma data augmentation. Polson et. al (2013) show how to efficiently sample from the ${PG}(1,z)$ distribution. We build two new samplers that offer improved performance when sampling from the ${PG}(b,z)$ distribution and $b$ is not unity. Efficiently sampling from the Pólya-Gamma distribution, ${PG}(b,z)$, is an essential element of Pólya-Gamma data augmentation. Polson et. al (2013) show how to efficiently sample from the ${PG}(1,z)$ distribution. We build two new samplers that offer improved performance when sampling from the ${PG}(b,z)$ distribution and $b$ is not unity. △ Less

Submitted 2 May, 2014; originally announced May 2014.

arXiv:1212.2135 [pdf, other]

Optimisation via Slice Sampling

Authors: John R. Birge, Nicholas G. Polson

Abstract: In this paper, we develop a simulation-based approach to optimisation with multi-modal functions using slice sampling. Our method specifies the objective function as an energy potential in a Boltzmann distribution and then we use auxiliary exponential slice variables to provide samples for a variety of energy levels. Our slice sampler draws uniformly over the augmented slice region. We identify th… ▽ More In this paper, we develop a simulation-based approach to optimisation with multi-modal functions using slice sampling. Our method specifies the objective function as an energy potential in a Boltzmann distribution and then we use auxiliary exponential slice variables to provide samples for a variety of energy levels. Our slice sampler draws uniformly over the augmented slice region. We identify the global modes by projecting the path of the chain back to the underlying space. Four standard test functions are used to illustrate the methodology: Rosenbrock, Himmelblau, Rastrigin, and Shubert. These functions demonstrate the flexibility of our approach as they include functions with long ridges (Rosenbrock), multi-modality (Himmelblau, Shubert) and many local modes dominated by one global (Rastrigin). The methods described here are implemented in the {\tt R} package {\tt McmcOpt}. △ Less

Submitted 10 December, 2012; originally announced December 2012.

Comments: 22 pages, 6 figures

MSC Class: 46N10

arXiv:1212.0534 [pdf, other]

Split Sampling: Expectations, Normalisation and Rare Events

Authors: John R. Birge, Changgee Chang, Nicholas G. Polson

Abstract: In this paper we develop a methodology that we call split sampling methods to estimate high dimensional expectations and rare event probabilities. Split sampling uses an auxiliary variable MCMC simulation and expresses the expectation of interest as an integrated set of rare event probabilities. We derive our estimator from a Rao-Blackwellised estimate of a marginal auxiliary variable distribution… ▽ More In this paper we develop a methodology that we call split sampling methods to estimate high dimensional expectations and rare event probabilities. Split sampling uses an auxiliary variable MCMC simulation and expresses the expectation of interest as an integrated set of rare event probabilities. We derive our estimator from a Rao-Blackwellised estimate of a marginal auxiliary variable distribution. We illustrate our method with two applications. First, we compute a shortest network path rare event probability and compare our method to estimation to a cross entropy approach. Then, we compute a normalisation constant of a high dimensional mixture of Gaussians and compare our estimate to one based on nested sampling. We discuss the relationship between our method and other alternatives such as the product of conditional probability estimator and importance sampling. The methods developed here are available in the R package: SplitSampling. △ Less

Submitted 31 October, 2013; v1 submitted 3 December, 2012; originally announced December 2012.

MSC Class: 65C05; 65C40; 65C60

Showing 1–50 of 63 results for author: Polson, N