Search | arXiv e-print repository

Sequential Monte Carlo for Cut-Bayesian Posterior Computation

Authors: Joseph Mathews, Giri Gopalan, James Gattiker, Sean Smith, Devin Francom

Abstract: We propose a sequential Monte Carlo (SMC) method to efficiently and accurately compute cut-Bayesian posterior quantities of interest, variations of standard Bayesian approaches constructed primarily to account for model misspecification. We prove finite sample concentration bounds for estimators derived from the proposed method along with a linear tempering extension and apply these results to a r… ▽ More We propose a sequential Monte Carlo (SMC) method to efficiently and accurately compute cut-Bayesian posterior quantities of interest, variations of standard Bayesian approaches constructed primarily to account for model misspecification. We prove finite sample concentration bounds for estimators derived from the proposed method along with a linear tempering extension and apply these results to a realistic setting where a computer model is misspecified. We then illustrate the SMC method for inference in a modular chemical reactor example that includes submodels for reaction kinetics, turbulence, mass transfer, and diffusion. The samples obtained are commensurate with a direct-sampling approach that consists of running multiple Markov chains, with computational efficiency gains using the SMC method. Overall, the SMC method presented yields a novel, rigorous approach to computing with cut-Bayesian posterior distributions. △ Less

Submitted 8 March, 2024; originally announced June 2024.

Report number: LA-UR-23-31546

arXiv:2405.15358 [pdf, ps, other]

Coordinated Multi-Neighborhood Learning on a Directed Acyclic Graph

Authors: Stephen Smith, Qing Zhou

Abstract: Learning the structure of causal directed acyclic graphs (DAGs) is useful in many areas of machine learning and artificial intelligence, with wide applications. However, in the high-dimensional setting, it is challenging to obtain good empirical and theoretical results without strong and often restrictive assumptions. Additionally, it is questionable whether all of the variables purported to be in… ▽ More Learning the structure of causal directed acyclic graphs (DAGs) is useful in many areas of machine learning and artificial intelligence, with wide applications. However, in the high-dimensional setting, it is challenging to obtain good empirical and theoretical results without strong and often restrictive assumptions. Additionally, it is questionable whether all of the variables purported to be included in the network are observable. It is of interest then to restrict consideration to a subset of the variables for relevant and reliable inferences. In fact, researchers in various disciplines can usually select a set of target nodes in the network for causal discovery. This paper develops a new constraint-based method for estimating the local structure around multiple user-specified target nodes, enabling coordination in structure learning between neighborhoods. Our method facilitates causal discovery without learning the entire DAG structure. We establish consistency results for our algorithm with respect to the local neighborhood structure of the target nodes in the true graph. Experimental results on synthetic and real-world data show that our algorithm is more accurate in learning the neighborhood structures with much less computational cost than standard methods that estimate the entire DAG. An R package implementing our methods may be accessed at https://github.com/stephenvsmith/CML. △ Less

Submitted 24 May, 2024; originally announced May 2024.

Comments: 13 pages, 6 figures

arXiv:2404.01100 [pdf, other]

Finite Sample Frequency Domain Identification

Authors: Anastasios Tsiamis, Mohamed Abdalmoaty, Roy S. Smith, John Lygeros

Abstract: We study non-parametric frequency-domain system identification from a finite-sample perspective. We assume an open loop scenario where the excitation input is periodic and consider the Empirical Transfer Function Estimate (ETFE), where the goal is to estimate the frequency response at certain desired (evenly-spaced) frequencies, given input-output samples. We show that under sub-Gaussian colored n… ▽ More We study non-parametric frequency-domain system identification from a finite-sample perspective. We assume an open loop scenario where the excitation input is periodic and consider the Empirical Transfer Function Estimate (ETFE), where the goal is to estimate the frequency response at certain desired (evenly-spaced) frequencies, given input-output samples. We show that under sub-Gaussian colored noise (in time-domain) and stability assumptions, the ETFE estimates are concentrated around the true values. The error rate is of the order of $\mathcal{O}((d_{\mathrm{u}}+\sqrt{d_{\mathrm{u}}d_{\mathrm{y}}})\sqrt{M/N_{\mathrm{tot}}})$, where $N_{\mathrm{tot}}$ is the total number of samples, $M$ is the number of desired frequencies, and $d_{\mathrm{u}},\,d_{\mathrm{y}}$ are the dimensions of the input and output signals respectively. This rate remains valid for general irrational transfer functions and does not require a finite order state-space representation. By tuning $M$, we obtain a $N_{\mathrm{tot}}^{-1/3}$ finite-sample rate for learning the frequency response over all frequencies in the $ \mathcal{H}_{\infty}$ norm. Our result draws upon an extension of the Hanson-Wright inequality to semi-infinite matrices. We study the finite-sample behavior of ETFE in simulations. △ Less

Submitted 1 April, 2024; originally announced April 2024.

arXiv:2403.05899 [pdf, other]

Online Identification of Stochastic Continuous-Time Wiener Models Using Sampled Data

Authors: Mohamed Abdalmoaty, Efe C. Balta, John Lygeros, Roy S. Smith

Abstract: It is well known that ignoring the presence of stochastic disturbances in the identification of stochastic Wiener models leads to asymptotically biased estimators. On the other hand, optimal statistical identification, via likelihood-based methods, is sensitive to the assumptions on the data distribution and is usually based on relatively complex sequential Monte Carlo algorithms. We develop a sim… ▽ More It is well known that ignoring the presence of stochastic disturbances in the identification of stochastic Wiener models leads to asymptotically biased estimators. On the other hand, optimal statistical identification, via likelihood-based methods, is sensitive to the assumptions on the data distribution and is usually based on relatively complex sequential Monte Carlo algorithms. We develop a simple recursive online estimation algorithm based on an output-error predictor, for the identification of continuous-time stochastic parametric Wiener models through stochastic approximation. The method is applicable to generic model parameterizations and, as demonstrated in the numerical simulation examples, it is robust with respect to the assumptions on the spectrum of the disturbance process. △ Less

Submitted 9 March, 2024; originally announced March 2024.

arXiv:2401.11804 [pdf, other]

Regression Copulas for Multivariate Responses

Authors: Nadja Klein, Michael Stanley Smith, David Nott, Ryan Chisholm

Abstract: We propose a novel distributional regression model for a multivariate response vector based on a copula process over the covariate space. It uses the implicit copula of a Gaussian multivariate regression, which we call a ``regression copula''. To allow for large covariate vectors their coefficients are regularized using a novel multivariate extension of the horseshoe prior. Bayesian inference and… ▽ More We propose a novel distributional regression model for a multivariate response vector based on a copula process over the covariate space. It uses the implicit copula of a Gaussian multivariate regression, which we call a ``regression copula''. To allow for large covariate vectors their coefficients are regularized using a novel multivariate extension of the horseshoe prior. Bayesian inference and distributional predictions are evaluated using efficient variational inference methods, allowing application to large datasets. An advantage of the approach is that the marginal distributions of the response vector can be estimated separately and accurately, resulting in predictive distributions that are marginally-calibrated. Two substantive applications of the methodology highlight its efficacy in multivariate modeling. The first is the econometric modeling and prediction of half-hourly regional Australian electricity prices. Here, our approach produces more accurate distributional forecasts than leading benchmark methods. The second is the evaluation of multivariate posteriors in likelihood-free inference (LFI) of a model for tree species abundance data, extending a previous univariate regression copula LFI method. In both applications, we demonstrate that our new approach exhibits a desirable marginal calibration property. △ Less

Submitted 5 March, 2024; v1 submitted 22 January, 2024; originally announced January 2024.

arXiv:2310.03521 [pdf, other]

Cutting Feedback in Misspecified Copula Models

Authors: Michael Stanley Smith, Weichang Yu, David J. Nott, David Frazier

Abstract: In copula models the marginal distributions and copula function are specified separately. We treat these as two modules in a modular Bayesian inference framework, and propose conducting modified Bayesian inference by "cutting feedback". Cutting feedback limits the influence of potentially misspecified modules in posterior inference. We consider two types of cuts. The first limits the influence of… ▽ More In copula models the marginal distributions and copula function are specified separately. We treat these as two modules in a modular Bayesian inference framework, and propose conducting modified Bayesian inference by "cutting feedback". Cutting feedback limits the influence of potentially misspecified modules in posterior inference. We consider two types of cuts. The first limits the influence of a misspecified copula on inference for the marginals, which is a Bayesian analogue of the popular Inference for Margins (IFM) estimator. The second limits the influence of misspecified marginals on inference for the copula parameters by using a pseudo likelihood of the ranks to define the cut model. We establish that if only one of the modules is misspecified, then the appropriate cut posterior gives accurate uncertainty quantification asymptotically for the parameters in the other module. Computation of the cut posteriors is difficult, and new variational inference methods to do so are proposed. The efficacy of the new methodology is demonstrated using both simulated data and a substantive multivariate time series application from macroeconomic forecasting. In the latter, cutting feedback from misspecified marginals to a 1096 dimension copula improves posterior inference and predictive accuracy greatly, compared to conventional Bayesian inference. △ Less

Submitted 27 June, 2024; v1 submitted 5 October, 2023; originally announced October 2023.

arXiv:2308.05564 [pdf, other]

Large Skew-t Copula Models and Asymmetric Dependence in Intraday Equity Returns

Authors: Lin Deng, Michael Stanley Smith, Worapree Maneesoonthorn

Abstract: Skew-t copula models are attractive for the modeling of financial data because they allow for asymmetric and extreme tail dependence. We show that the copula implicit in the skew-t distribution of Azzalini and Capitanio (2003) allows for a higher level of pairwise asymmetric dependence than two popular alternative skew-t copulas. Estimation of this copula in high dimensions is challenging, and we… ▽ More Skew-t copula models are attractive for the modeling of financial data because they allow for asymmetric and extreme tail dependence. We show that the copula implicit in the skew-t distribution of Azzalini and Capitanio (2003) allows for a higher level of pairwise asymmetric dependence than two popular alternative skew-t copulas. Estimation of this copula in high dimensions is challenging, and we propose a fast and accurate Bayesian variational inference (VI) approach to do so. The method uses a generative representation of the skew-t distribution to define an augmented posterior that can be approximated accurately. A stochastic gradient ascent algorithm is used to solve the variational optimization. The methodology is used to estimate skew-t factor copula models with up to 15 factors for intraday returns from 2017 to 2021 on 93 U.S. equities. The copula captures substantial heterogeneity in asymmetric dependence over equity pairs, in addition to the variability in pairwise correlations. In a moving window study we show that the asymmetric dependencies also vary over time, and that intraday predictive densities from the skew-t copula are more accurate than those from benchmark copula models. Portfolio selection strategies based on the estimated pairwise asymmetric dependencies improve performance relative to the index. △ Less

Submitted 2 July, 2024; v1 submitted 10 August, 2023; originally announced August 2023.

arXiv:2303.09842 [pdf, ps, other]

Error Bounds for Kernel-Based Linear System Identification with Unknown Hyperparameters

Authors: Mingzhou Yin, Roy S. Smith

Abstract: The kernel-based method has been successfully applied in linear system identification using stable kernel designs. From a Gaussian process perspective, it automatically provides probabilistic error bounds for the identified models from the posterior covariance, which are useful in robust and stochastic control. However, the error bounds require knowledge of the true hyperparameters in the kernel d… ▽ More The kernel-based method has been successfully applied in linear system identification using stable kernel designs. From a Gaussian process perspective, it automatically provides probabilistic error bounds for the identified models from the posterior covariance, which are useful in robust and stochastic control. However, the error bounds require knowledge of the true hyperparameters in the kernel design and are demonstrated to be inaccurate with estimated hyperparameters for lightly damped systems or in the presence of high noise. In this work, we provide reliable quantification of the estimation error when the hyperparameters are unknown. The bounds are obtained by first constructing a high-probability set for the true hyperparameters from the marginal likelihood function and then finding the worst-case posterior covariance within the set. The proposed bound is proven to contain the true model with a high probability and its validity is verified in numerical simulation. △ Less

Submitted 17 March, 2023; originally announced March 2023.

arXiv:2302.13861 [pdf, other]

Differentially Private Diffusion Models Generate Useful Synthetic Images

Authors: Sahra Ghalebikesabi, Leonard Berrada, Sven Gowal, Ira Ktena, Robert Stanforth, Jamie Hayes, Soham De, Samuel L. Smith, Olivia Wiles, Borja Balle

Abstract: The ability to generate privacy-preserving synthetic versions of sensitive image datasets could unlock numerous ML applications currently constrained by data availability. Due to their astonishing image generation quality, diffusion models are a prime candidate for generating high-quality synthetic data. However, recent studies have found that, by default, the outputs of some diffusion models do n… ▽ More The ability to generate privacy-preserving synthetic versions of sensitive image datasets could unlock numerous ML applications currently constrained by data availability. Due to their astonishing image generation quality, diffusion models are a prime candidate for generating high-quality synthetic data. However, recent studies have found that, by default, the outputs of some diffusion models do not preserve training data privacy. By privately fine-tuning ImageNet pre-trained diffusion models with more than 80M parameters, we obtain SOTA results on CIFAR-10 and Camelyon17 in terms of both FID and the accuracy of downstream classifiers trained on synthetic data. We decrease the SOTA FID on CIFAR-10 from 26.2 to 9.8, and increase the accuracy from 51.0% to 88.0%. On synthetic data from Camelyon17, we achieve a downstream accuracy of 91.1% which is close to the SOTA of 96.5% when training on the real data. We leverage the ability of generative models to create infinite amounts of data to maximise the downstream prediction performance, and further show how to use synthetic data for hyperparameter tuning. Our results demonstrate that diffusion models fine-tuned with differential privacy can produce useful and provably private synthetic data, even in applications with significant distribution shift between the pre-training and fine-tuning distributions. △ Less

Submitted 27 February, 2023; originally announced February 2023.

arXiv:2302.13536 [pdf, other]

Natural Gradient Hybrid Variational Inference with Application to Deep Mixed Models

Authors: Weiben Zhang, Michael Stanley Smith, Worapree Maneesoonthorn, Ruben Loaiza-Maya

Abstract: Stochastic models with global parameters $\bmθ$ and latent variables $\bm{z}$ are common, and variational inference (VI) is popular for their estimation. This paper uses a variational approximation (VA) that comprises a Gaussian with factor covariance matrix for the marginal of $\bmθ$, and the exact conditional posterior of $\bm{z}|\bmθ$. Stochastic optimization for learning the VA only requires g… ▽ More Stochastic models with global parameters $\bmθ$ and latent variables $\bm{z}$ are common, and variational inference (VI) is popular for their estimation. This paper uses a variational approximation (VA) that comprises a Gaussian with factor covariance matrix for the marginal of $\bmθ$, and the exact conditional posterior of $\bm{z}|\bmθ$. Stochastic optimization for learning the VA only requires generation of $\bm{z}$ from its conditional posterior, while $\bmθ$ is updated using the natural gradient, producing a hybrid VI method. We show that this is a well-defined natural gradient optimization algorithm for the joint posterior of $(\bm{z},\bmθ)$. Fast to compute expressions for the Tikhonov damped Fisher information matrix required to compute a stable natural gradient update are derived. We use the approach to estimate probabilistic Bayesian neural networks with random output layer coefficients to allow for heterogeneity. Simulations show that using the natural gradient is more efficient than using the ordinary gradient, and that the approach is faster and more accurate than two leading benchmark natural gradient VI methods. In a financial application we show that accounting for industry level heterogeneity using the deep model improves the accuracy of probabilistic prediction of asset pricing models. △ Less

Submitted 27 February, 2023; originally announced February 2023.

arXiv:2302.10322 [pdf, other]

Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation

Authors: Bobby He, James Martens, Guodong Zhang, Aleksandar Botev, Andrew Brock, Samuel L Smith, Yee Whye Teh

Abstract: Skip connections and normalisation layers form two standard architectural components that are ubiquitous for the training of Deep Neural Networks (DNNs), but whose precise roles are poorly understood. Recent approaches such as Deep Kernel Sha** have made progress towards reducing our reliance on them, using insights from wide NN kernel theory to improve signal propagation in vanilla DNNs (which… ▽ More Skip connections and normalisation layers form two standard architectural components that are ubiquitous for the training of Deep Neural Networks (DNNs), but whose precise roles are poorly understood. Recent approaches such as Deep Kernel Sha** have made progress towards reducing our reliance on them, using insights from wide NN kernel theory to improve signal propagation in vanilla DNNs (which we define as networks without skips or normalisation). However, these approaches are incompatible with the self-attention layers present in transformers, whose kernels are intrinsically more complicated to analyse and control. And so the question remains: is it possible to train deep vanilla transformers? We answer this question in the affirmative by designing several approaches that use combinations of parameter initialisations, bias matrices and location-dependent rescaling to achieve faithful signal propagation in vanilla transformers. Our methods address various intricacies specific to signal propagation in transformers, including the interaction with positional encoding and causal masking. In experiments on WikiText-103 and C4, our approaches enable deep transformers without normalisation to train at speeds matching their standard counterparts, and deep vanilla transformers to reach the same performance as standard ones after about 5 times more iterations. △ Less

Submitted 20 February, 2023; originally announced February 2023.

Comments: ICLR 2023

arXiv:2204.13650 [pdf, other]

Unlocking High-Accuracy Differentially Private Image Classification through Scale

Authors: Soham De, Leonard Berrada, Jamie Hayes, Samuel L. Smith, Borja Balle

Abstract: Differential Privacy (DP) provides a formal privacy guarantee preventing adversaries with access to a machine learning model from extracting information about individual training points. Differentially Private Stochastic Gradient Descent (DP-SGD), the most popular DP training method for deep learning, realizes this protection by injecting noise during training. However previous works have found th… ▽ More Differential Privacy (DP) provides a formal privacy guarantee preventing adversaries with access to a machine learning model from extracting information about individual training points. Differentially Private Stochastic Gradient Descent (DP-SGD), the most popular DP training method for deep learning, realizes this protection by injecting noise during training. However previous works have found that DP-SGD often leads to a significant degradation in performance on standard image classification benchmarks. Furthermore, some authors have postulated that DP-SGD inherently performs poorly on large models, since the norm of the noise required to preserve privacy is proportional to the model dimension. In contrast, we demonstrate that DP-SGD on over-parameterized models can perform significantly better than previously thought. Combining careful hyper-parameter tuning with simple techniques to ensure signal propagation and improve the convergence rate, we obtain a new SOTA without extra data on CIFAR-10 of 81.4% under (8, 10^{-5})-DP using a 40-layer Wide-ResNet, improving over the previous SOTA of 71.7%. When fine-tuning a pre-trained NFNet-F3, we achieve a remarkable 83.8% top-1 accuracy on ImageNet under (0.5, 8*10^{-7})-DP. Additionally, we also achieve 86.7% top-1 accuracy under (8, 8 \cdot 10^{-7})-DP, which is just 4.3% below the current non-private SOTA for this task. We believe our results are a significant step towards closing the accuracy gap between private and non-private image classification. △ Less

Submitted 16 June, 2022; v1 submitted 28 April, 2022; originally announced April 2022.

arXiv:2203.14731 [pdf, ps, other]

doi 10.1109/CDC51059.2022.9992728

Infinite-Dimensional Sparse Learning in Linear System Identification

Authors: Mingzhou Yin, Mehmet Tolga Akan, Andrea Iannelli, Roy S. Smith

Abstract: Regularized methods have been widely applied to system identification problems without known model structures. This paper proposes an infinite-dimensional sparse learning algorithm based on atomic norm regularization. Atomic norm regularization decomposes the transfer function into first-order atomic models and solves a group lasso problem that selects a sparse set of poles and identifies the corr… ▽ More Regularized methods have been widely applied to system identification problems without known model structures. This paper proposes an infinite-dimensional sparse learning algorithm based on atomic norm regularization. Atomic norm regularization decomposes the transfer function into first-order atomic models and solves a group lasso problem that selects a sparse set of poles and identifies the corresponding coefficients. The difficulty in solving the problem lies in the fact that there are an infinite number of possible atomic models. This work proposes a greedy algorithm that generates new candidate atomic models maximizing the violation of the optimality condition of the existing problem. This algorithm is able to solve the infinite-dimensional group lasso problem with high precision. The algorithm is further extended to reduce the bias and reject false positives in pole location estimation by iteratively reweighted adaptive group lasso and complementary pairs stability selection respectively. Numerical results demonstrate that the proposed algorithm performs better than benchmark parameterized and regularized methods in terms of both impulse response fitting and pole location estimation. △ Less

Submitted 31 August, 2022; v1 submitted 28 March, 2022; originally announced March 2022.

Comments: Accepted for presentation at IEEE Conference on Decision and Control 2022

Journal ref: 2022 IEEE 61st Conference on Decision and Control (CDC), Cancun, Mexico, 2022, pp. 850-855

arXiv:2201.05985 [pdf, other]

Exposing the Obscured Influence of State-Controlled Media: A Causal Estimation of Influence Between Media Outlets Via Quotation Propagation

Authors: Joseph Schlessinger, Richard Bennet, Jacob Coakwell, Steven T. Smith, Edward K. Kao

Abstract: This study quantifies influence between media outlets by applying a novel methodology that uses causal effect estimation on networks and transformer language models. We demonstrate the obscured influence of state-controlled outlets over other outlets, regardless of orientation, by analyzing a large dataset of quotations from over 100 thousand articles published by the most prominent European and R… ▽ More This study quantifies influence between media outlets by applying a novel methodology that uses causal effect estimation on networks and transformer language models. We demonstrate the obscured influence of state-controlled outlets over other outlets, regardless of orientation, by analyzing a large dataset of quotations from over 100 thousand articles published by the most prominent European and Russian traditional media outlets, appearing between May 2018 and October 2019. The analysis maps out the network structure of influence with news wire services serving as prominent bridges that connect outlets in different geo-political spheres. Overall, this approach demonstrates capabilities to identify and quantify the channels of influence in intermedia agenda setting over specific topics. △ Less

Submitted 16 January, 2022; originally announced January 2022.

arXiv:2111.09511 [pdf, ps, other]

Implicit copula variational inference

Authors: Michael Stanley Smith, Rubén Loaiza-Maya

Abstract: Key to effective generic, or "black-box", variational inference is the selection of an approximation to the target density that balances accuracy and speed. Copula models are promising options, but calibration of the approximation can be slow for some choices. Smith et al. (2020) suggest using tractable and scalable "implicit copula" models that are formed by element-wise transformation of the tar… ▽ More Key to effective generic, or "black-box", variational inference is the selection of an approximation to the target density that balances accuracy and speed. Copula models are promising options, but calibration of the approximation can be slow for some choices. Smith et al. (2020) suggest using tractable and scalable "implicit copula" models that are formed by element-wise transformation of the target parameters. We propose an adjustment to these transformations that make the approximation invariant to the scale and location of the target density. We also show how a sub-class of elliptical copulas have a generative representation that allows easy application of the re-parameterization trick and efficient first order optimization. We demonstrate the estimation methodology using two statistical models as examples. The first is a mixed effects logistic regression, and the second is a regularized correlation matrix. For the latter, standard Markov chain Monte Carlo estimation methods can be slow or difficult to implement, yet our proposed variational approach provides an effective and scalable estimator. We illustrate by estimating a regularized Gaussian copula model for income inequality in U.S. states between 1917 and 2018. An Online Appendix and MATLAB code to implement the method are available as Supplementary Materials. △ Less

Submitted 29 June, 2022; v1 submitted 17 November, 2021; originally announced November 2021.

Comments: Abstract has been updated. The abstract of v2 is not up-to-date

arXiv:2111.00782 [pdf]

Unpacking uncertainty in the modelling process for energy policy making

Authors: Samuele Lo Piano, Máté János Lőrincz, Arnald Puy, Steve Pye, Andrea Saltelli, Stefán Thor Smith, Jeroen P. van der Sluijs

Abstract: This paper explores how the modelling of energy systems may lead to undue closure of alternatives by generating an excess of certainty around some of the possible policy options. We exemplify the problem with two cases: first, the International Institute for Applied Systems Analysis (IIASA) global modelling in the 1980s; and second, the modelling activity undertaken in support of the construction… ▽ More This paper explores how the modelling of energy systems may lead to undue closure of alternatives by generating an excess of certainty around some of the possible policy options. We exemplify the problem with two cases: first, the International Institute for Applied Systems Analysis (IIASA) global modelling in the 1980s; and second, the modelling activity undertaken in support of the construction of a radioactive waste repository at Yucca Mountain (Nevada, USA). We discuss different methodologies for quality assessment that may help remedy this issue, which include NUSAP (Numeral Unit Spread Assessment Pedigree), diagnostic diagrams, and sensitivity auditing. We demonstrate the potential of these reflexive modelling practices in energy policy making with four additional cases: (i) stakeholders evaluation of the assessment of the external costs of a potential large-scale nuclear accident in Belgium in the context of the ExternE (External Costs of Energy) project; (ii) the case of the ESME (Energy System Modelling Environment) for the creation of UK energy policy; (iii) the NETs (Negative Emission Technologies) uptake in Integrated Assessment Models (IAMs); and (iv) the Ecological Footprint (EF) indicator. We encourage modellers to widely adopt these approaches to achieve more robust and inclusive modelling activities in the field of energy modelling. △ Less

Submitted 1 November, 2021; originally announced November 2021.

Comments: 39 pages, 2 tables, 3 figures

arXiv:2109.04718 [pdf, ps, other]

Implicit Copulas: An Overview

Authors: Michael Stanley Smith

Abstract: Implicit copulas are the most common copula choice for modeling dependence in high dimensions. This broad class of copulas is introduced and surveyed, including elliptical copulas, skew $t$ copulas, factor copulas, time series copulas and regression copulas. The common auxiliary representation of implicit copulas is outlined, and how this makes them both scalable and tractable for statistical mode… ▽ More Implicit copulas are the most common copula choice for modeling dependence in high dimensions. This broad class of copulas is introduced and surveyed, including elliptical copulas, skew $t$ copulas, factor copulas, time series copulas and regression copulas. The common auxiliary representation of implicit copulas is outlined, and how this makes them both scalable and tractable for statistical modeling. Issues such as parameter identification, extended likelihoods for discrete or mixed data, parsimony in high dimensions, and simulation from the copula model are considered. Bayesian approaches to estimate the copula parameters, and predict from an implicit copula model, are outlined. Particular attention is given to implicit copula processes constructed from time series and regression models, which is at the forefront of current research. Two econometric applications -- one from macroeconomic time series and the other from financial asset pricing -- illustrate the advantages of implicit copula models. △ Less

Submitted 10 September, 2021; originally announced September 2021.

arXiv:2108.11066 [pdf, other]

Variational inference for cutting feedback in misspecified models

Authors: Xuejun Yu, David J. Nott, Michael Stanley Smith

Abstract: Bayesian analyses combine information represented by different terms in a joint Bayesian model. When one or more of the terms is misspecified, it can be helpful to restrict the use of information from suspect model components to modify posterior inference. This is called "cutting feedback", and both the specification and computation of the posterior for such "cut models" is challenging. In this pa… ▽ More Bayesian analyses combine information represented by different terms in a joint Bayesian model. When one or more of the terms is misspecified, it can be helpful to restrict the use of information from suspect model components to modify posterior inference. This is called "cutting feedback", and both the specification and computation of the posterior for such "cut models" is challenging. In this paper, we define cut posterior distributions as solutions to constrained optimization problems, and propose optimization-based variational methods for their computation. These methods are faster than existing Markov chain Monte Carlo (MCMC) approaches for computing cut posterior distributions by an order of magnitude. It is also shown that variational methods allow for the evaluation of computationally intensive conflict checks that can be used to decide whether or not feedback should be cut. Our methods are illustrated in a number of simulated and real examples, including an application where recent methodological advances that combine variational inference and MCMC within the variational optimization are used. △ Less

Submitted 24 June, 2022; v1 submitted 25 August, 2021; originally announced August 2021.

arXiv:2102.06171 [pdf, other]

High-Performance Large-Scale Image Recognition Without Normalization

Authors: Andrew Brock, Soham De, Samuel L. Smith, Karen Simonyan

Abstract: Batch normalization is a key component of most image classification models, but it has many undesirable properties stemming from its dependence on the batch size and interactions between examples. Although recent work has succeeded in training deep ResNets without normalization layers, these models do not match the test accuracies of the best batch-normalized networks, and are often unstable for l… ▽ More Batch normalization is a key component of most image classification models, but it has many undesirable properties stemming from its dependence on the batch size and interactions between examples. Although recent work has succeeded in training deep ResNets without normalization layers, these models do not match the test accuracies of the best batch-normalized networks, and are often unstable for large learning rates or strong data augmentations. In this work, we develop an adaptive gradient clip** technique which overcomes these instabilities, and design a significantly improved class of Normalizer-Free ResNets. Our smaller models match the test accuracy of an EfficientNet-B7 on ImageNet while being up to 8.7x faster to train, and our largest models attain a new state-of-the-art top-1 accuracy of 86.5%. In addition, Normalizer-Free models attain significantly better performance than their batch-normalized counterparts when finetuning on ImageNet after large-scale pre-training on a dataset of 300 million labeled images, with our best models obtaining an accuracy of 89.2%. Our code is available at https://github.com/deepmind/ deepmind-research/tree/master/nfnets △ Less

Submitted 11 February, 2021; originally announced February 2021.

arXiv:2101.12176 [pdf, other]

On the Origin of Implicit Regularization in Stochastic Gradient Descent

Authors: Samuel L. Smith, Benoit Dherin, David G. T. Barrett, Soham De

Abstract: For infinitesimal learning rates, stochastic gradient descent (SGD) follows the path of gradient flow on the full batch loss function. However moderately large learning rates can achieve higher test accuracies, and this generalization benefit is not explained by convergence bounds, since the learning rate which maximizes test accuracy is often larger than the learning rate which minimizes training… ▽ More For infinitesimal learning rates, stochastic gradient descent (SGD) follows the path of gradient flow on the full batch loss function. However moderately large learning rates can achieve higher test accuracies, and this generalization benefit is not explained by convergence bounds, since the learning rate which maximizes test accuracy is often larger than the learning rate which minimizes training loss. To interpret this phenomenon we prove that for SGD with random shuffling, the mean SGD iterate also stays close to the path of gradient flow if the learning rate is small and finite, but on a modified loss. This modified loss is composed of the original loss function and an implicit regularizer, which penalizes the norms of the minibatch gradients. Under mild assumptions, when the batch size is small the scale of the implicit regularization term is proportional to the ratio of the learning rate to the batch size. We verify empirically that explicitly including the implicit regularizer in the loss can enhance the test accuracy when the learning rate is small. △ Less

Submitted 28 January, 2021; originally announced January 2021.

Comments: Accepted as a conference paper at ICLR 2021

arXiv:2101.08692 [pdf, other]

Characterizing signal propagation to close the performance gap in unnormalized ResNets

Authors: Andrew Brock, Soham De, Samuel L. Smith

Abstract: Batch Normalization is a key component in almost all state-of-the-art image classifiers, but it also introduces practical challenges: it breaks the independence between training examples within a batch, can incur compute and memory overhead, and often results in unexpected bugs. Building on recent theoretical analyses of deep ResNets at initialization, we propose a simple set of analysis tools to… ▽ More Batch Normalization is a key component in almost all state-of-the-art image classifiers, but it also introduces practical challenges: it breaks the independence between training examples within a batch, can incur compute and memory overhead, and often results in unexpected bugs. Building on recent theoretical analyses of deep ResNets at initialization, we propose a simple set of analysis tools to characterize signal propagation on the forward pass, and leverage these tools to design highly performant ResNets without activation normalization layers. Crucial to our success is an adapted version of the recently proposed Weight Standardization. Our analysis tools show how this technique preserves the signal in networks with ReLU or Swish activation functions by ensuring that the per-channel activation means do not grow with depth. Across a range of FLOP budgets, our networks attain performance competitive with the state-of-the-art EfficientNets on ImageNet. △ Less

Submitted 27 January, 2021; v1 submitted 21 January, 2021; originally announced January 2021.

Comments: Published as a conference paper at ICLR 2021

arXiv:2010.10241 [pdf, ps, other]

BYOL works even without batch statistics

Authors: Pierre H. Richemond, Jean-Bastien Grill, Florent Altché, Corentin Tallec, Florian Strub, Andrew Brock, Samuel Smith, Soham De, Razvan Pascanu, Bilal Piot, Michal Valko

Abstract: Bootstrap Your Own Latent (BYOL) is a self-supervised learning approach for image representation. From an augmented view of an image, BYOL trains an online network to predict a target network representation of a different augmented view of the same image. Unlike contrastive methods, BYOL does not explicitly use a repulsion term built from negative pairs in its training objective. Yet, it avoids co… ▽ More Bootstrap Your Own Latent (BYOL) is a self-supervised learning approach for image representation. From an augmented view of an image, BYOL trains an online network to predict a target network representation of a different augmented view of the same image. Unlike contrastive methods, BYOL does not explicitly use a repulsion term built from negative pairs in its training objective. Yet, it avoids collapse to a trivial, constant representation. Thus, it has recently been hypothesized that batch normalization (BN) is critical to prevent collapse in BYOL. Indeed, BN flows gradients across batch elements, and could leak information about negative views in the batch, which could act as an implicit negative (contrastive) term. However, we experimentally show that replacing BN with a batch-independent normalization scheme (namely, a combination of group normalization and weight standardization) achieves performance comparable to vanilla BYOL ($73.9\%$ vs. $74.3\%$ top-1 accuracy under the linear evaluation protocol on ImageNet with ResNet-$50$). Our finding disproves the hypothesis that the use of batch statistics is a crucial ingredient for BYOL to learn useful representations. △ Less

Submitted 20 October, 2020; originally announced October 2020.

arXiv:2010.01844 [pdf, ps, other]

doi 10.1002/jae.2959

Deep Distributional Time Series Models and the Probabilistic Forecasting of Intraday Electricity Prices

Authors: Nadja Klein, Michael Stanley Smith, David J. Nott

Abstract: Recurrent neural networks (RNNs) with rich feature vectors of past values can provide accurate point forecasts for series that exhibit complex serial dependence. We propose two approaches to constructing deep time series probabilistic models based on a variant of RNN called an echo state network (ESN). The first is where the output layer of the ESN has stochastic disturbances and a shrinkage prior… ▽ More Recurrent neural networks (RNNs) with rich feature vectors of past values can provide accurate point forecasts for series that exhibit complex serial dependence. We propose two approaches to constructing deep time series probabilistic models based on a variant of RNN called an echo state network (ESN). The first is where the output layer of the ESN has stochastic disturbances and a shrinkage prior for additional regularization. The second approach employs the implicit copula of an ESN with Gaussian disturbances, which is a deep copula process on the feature space. Combining this copula with a non-parametrically estimated marginal distribution produces a deep distributional time series model. The resulting probabilistic forecasts are deep functions of the feature vector and also marginally calibrated. In both approaches, Bayesian Markov chain Monte Carlo methods are used to estimate the models and compute forecasts. The proposed models are suitable for the complex task of forecasting intraday electricity prices. Using data from the Australian National Electricity Market, we show that our deep time series models provide accurate short term probabilistic price forecasts, with the copula model dominating. Moreover, the models provide a flexible framework for incorporating probabilistic forecasts of electricity demand as additional features, which increases upper tail forecast accuracy from the copula model significantly. △ Less

Submitted 27 May, 2021; v1 submitted 5 October, 2020; originally announced October 2020.

Journal ref: Journal of Applied Econometrics (2023), 38( 4), 493-511

arXiv:2008.00029 [pdf, other]

Cold Posteriors and Aleatoric Uncertainty

Authors: Ben Adlam, Jasper Snoek, Samuel L. Smith

Abstract: Recent work has observed that one can outperform exact inference in Bayesian neural networks by tuning the "temperature" of the posterior on a validation set (the "cold posterior" effect). To help interpret this phenomenon, we argue that commonly used priors in Bayesian neural networks can significantly overestimate the aleatoric uncertainty in the labels on many classification datasets. This prob… ▽ More Recent work has observed that one can outperform exact inference in Bayesian neural networks by tuning the "temperature" of the posterior on a validation set (the "cold posterior" effect). To help interpret this phenomenon, we argue that commonly used priors in Bayesian neural networks can significantly overestimate the aleatoric uncertainty in the labels on many classification datasets. This problem is particularly pronounced in academic benchmarks like MNIST or CIFAR, for which the quality of the labels is high. For the special case of Gaussian process regression, any positive temperature corresponds to a valid posterior under a modified prior, and tuning this temperature is directly analogous to empirical Bayes. On classification tasks, there is no direct equivalence between modifying the prior and tuning the temperature, however reducing the temperature can lead to models which better reflect our belief that one gains little information by relabeling existing examples in the training set. Therefore although cold posteriors do not always correspond to an exact inference procedure, we believe they may often better reflect our true prior beliefs. △ Less

Submitted 31 July, 2020; originally announced August 2020.

Comments: 5 pages, 3 figures

Journal ref: ICML workshop on Uncertainty and Robustness in Deep Learning (2020)

arXiv:2006.15081 [pdf, other]

On the Generalization Benefit of Noise in Stochastic Gradient Descent

Authors: Samuel L. Smith, Erich Elsen, Soham De

Abstract: It has long been argued that minibatch stochastic gradient descent can generalize better than large batch gradient descent in deep neural networks. However recent papers have questioned this claim, arguing that this effect is simply a consequence of suboptimal hyperparameter tuning or insufficient compute budgets when the batch size is large. In this paper, we perform carefully designed experiment… ▽ More It has long been argued that minibatch stochastic gradient descent can generalize better than large batch gradient descent in deep neural networks. However recent papers have questioned this claim, arguing that this effect is simply a consequence of suboptimal hyperparameter tuning or insufficient compute budgets when the batch size is large. In this paper, we perform carefully designed experiments and rigorous hyperparameter sweeps on a range of popular models, which verify that small or moderately large batch sizes can substantially outperform very large batches on the test set. This occurs even when both models are trained for the same number of iterations and large batches achieve smaller training losses. Our results confirm that the noise in stochastic gradients can enhance generalization. We study how the optimal learning rate schedule changes as the epoch budget grows, and we provide a theoretical account of our observations based on the stochastic differential equation perspective of SGD dynamics. △ Less

Submitted 26 June, 2020; originally announced June 2020.

Comments: Camera-ready version of ICML 2020

arXiv:2006.08287 [pdf, other]

ICAM: Interpretable Classification via Disentangled Representations and Feature Attribution Map**

Authors: Cher Bass, Mariana da Silva, Carole Sudre, Petru-Daniel Tudosiu, Stephen M. Smith, Emma C. Robinson

Abstract: Feature attribution (FA), or the assignment of class-relevance to different locations in an image, is important for many classification problems but is particularly crucial within the neuroscience domain, where accurate mechanistic models of behaviours, or disease, require knowledge of all features discriminative of a trait. At the same time, predicting class relevance from brain images is challen… ▽ More Feature attribution (FA), or the assignment of class-relevance to different locations in an image, is important for many classification problems but is particularly crucial within the neuroscience domain, where accurate mechanistic models of behaviours, or disease, require knowledge of all features discriminative of a trait. At the same time, predicting class relevance from brain images is challenging as phenotypes are typically heterogeneous, and changes occur against a background of significant natural variation. Here, we present a novel framework for creating class specific FA maps through image-to-image translation. We propose the use of a VAE-GAN to explicitly disentangle class relevance from background features for improved interpretability properties, which results in meaningful FA maps. We validate our method on 2D and 3D brain image datasets of dementia (ADNI dataset), ageing (UK Biobank), and (simulated) lesion detection. We show that FA maps generated by our method outperform baseline FA methods when validated against ground truth. More significantly, our approach is the first to use latent space sampling to support exploration of phenotype variation. Our code will be available online at https://github.com/CherBass/ICAM. △ Less

Submitted 16 June, 2020; v1 submitted 15 June, 2020; originally announced June 2020.

Comments: Submitted to NeurIPS 2020: Neural Information Processing Systems. Keywords: interpretable, classification, feature attribution, domain translation, variational autoencoder, generative adversarial network, neuroimaging

arXiv:2006.05475 [pdf, other]

Simple and efficient algorithms for training machine learning potentials to force data

Authors: Justin S. Smith, Nicholas Lubbers, Aidan P. Thompson, Kipton Barros

Abstract: Abstract Machine learning models, trained on data from ab initio quantum simulations, are yielding molecular dynamics potentials with unprecedented accuracy. One limiting factor is the quantity of available training data, which can be expensive to obtain. A quantum simulation often provides all atomic forces, in addition to the total energy of the system. These forces provide much more information… ▽ More Abstract Machine learning models, trained on data from ab initio quantum simulations, are yielding molecular dynamics potentials with unprecedented accuracy. One limiting factor is the quantity of available training data, which can be expensive to obtain. A quantum simulation often provides all atomic forces, in addition to the total energy of the system. These forces provide much more information than the energy alone. It may appear that training a model to this large quantity of force data would introduce significant computational costs. Actually, training to all available force data should only be a few times more expensive than training to energies alone. Here, we present a new algorithm for efficient force training, and benchmark its accuracy by training to forces from real-world datasets for organic chemistry and bulk aluminum. △ Less

Submitted 9 June, 2020; originally announced June 2020.

arXiv:2005.10879 [pdf, other]

doi 10.1073/pnas.2011216118

Automatic Detection of Influential Actors in Disinformation Networks

Authors: Steven T. Smith, Edward K. Kao, Erika D. Mackin, Danelle C. Shah, Olga Simek, Donald B. Rubin

Abstract: The weaponization of digital communications and social media to conduct disinformation campaigns at immense scale, speed, and reach presents new challenges to identify and counter hostile influence operations (IOs). This paper presents an end-to-end framework to automate detection of disinformation narratives, networks, and influential actors. The framework integrates natural language processing,… ▽ More The weaponization of digital communications and social media to conduct disinformation campaigns at immense scale, speed, and reach presents new challenges to identify and counter hostile influence operations (IOs). This paper presents an end-to-end framework to automate detection of disinformation narratives, networks, and influential actors. The framework integrates natural language processing, machine learning, graph analytics, and a novel network causal inference approach to quantify the impact of individual actors in spreading IO narratives. We demonstrate its capability on real-world hostile IO campaigns with Twitter datasets collected during the 2017 French presidential elections, and known IO accounts disclosed by Twitter over a broad range of IO campaigns (May 2007 to February 2020), over 50,000 accounts, 17 countries, and different account types including both trolls and bots. Our system detects IO accounts with 96% precision, 79% recall, and 96% area-under-the-PR-curve, maps out salient network communities, and discovers high-impact accounts that escape the lens of traditional impact statistics based on activity counts and network centrality. Results are corroborated with independent sources of known IO accounts from U.S. Congressional reports, investigative journalism, and IO datasets provided by Twitter. △ Less

Submitted 7 January, 2021; v1 submitted 21 May, 2020; originally announced May 2020.

Comments: Proc. Natl. Acad. Sciences U.S.A. Vol. 118, No. 4, e2011216118

arXiv:2005.07430 [pdf, ps, other]

Fast and Accurate Variational Inference for Models with Many Latent Variables

Authors: Rubén Loaiza-Maya, Michael Stanley Smith, David J. Nott, Peter J. Danaher

Abstract: Models with a large number of latent variables are often used to fully utilize the information in big or complex data. However, they can be difficult to estimate using standard approaches, and variational inference methods are a popular alternative. Key to the success of these is the selection of an approximation to the target density that is accurate, tractable and fast to calibrate using optimiz… ▽ More Models with a large number of latent variables are often used to fully utilize the information in big or complex data. However, they can be difficult to estimate using standard approaches, and variational inference methods are a popular alternative. Key to the success of these is the selection of an approximation to the target density that is accurate, tractable and fast to calibrate using optimization methods. Most existing choices can be inaccurate or slow to calibrate when there are many latent variables. Here, we propose a family of tractable variational approximations that are more accurate and faster to calibrate for this case. It combines a parsimonious parametric approximation for the parameter posterior, with the exact conditional posterior of the latent variables. We derive a simplified expression for the re-parameterization gradient of the variational lower bound, which is the main ingredient of efficient optimization algorithms used to implement variational estimation. To do so only requires the ability to generate exactly or approximately from the conditional posterior of the latent variables, rather than to compute its density. We illustrate using two complex contemporary econometric examples. The first is a nonlinear multivariate state space model for U.S. macroeconomic variables. The second is a random coefficients tobit model applied to two million sales by 20,000 individuals in a large consumer panel from a marketing study. In both cases, we show that our approximating family is considerably more accurate than mean field or structured Gaussian approximations, and faster than Markov chain Monte Carlo. Last, we show how to implement data sub-sampling in variational inference for our approximation, which can lead to a further reduction in computation time. MATLAB code implementing the method for our examples is included in supplementary material. △ Less

Submitted 18 April, 2021; v1 submitted 15 May, 2020; originally announced May 2020.

Comments: Macroeconomic example was replaced by the bigger and more challenging time varying parameter vector autoregression model with stochastic volatility. Microeconomic example was extended to 20,000 individuals and variational subsampling is also implemented for this example. Small microeconomics example now uses 1000 individuals

MSC Class: 62P20 ACM Class: G.3

arXiv:2002.10444 [pdf, other]

Batch Normalization Biases Residual Blocks Towards the Identity Function in Deep Networks

Authors: Soham De, Samuel L. Smith

Abstract: Batch normalization dramatically increases the largest trainable depth of residual networks, and this benefit has been crucial to the empirical success of deep residual networks on a wide range of benchmarks. We show that this key benefit arises because, at initialization, batch normalization downscales the residual branch relative to the skip connection, by a normalizing factor on the order of th… ▽ More Batch normalization dramatically increases the largest trainable depth of residual networks, and this benefit has been crucial to the empirical success of deep residual networks on a wide range of benchmarks. We show that this key benefit arises because, at initialization, batch normalization downscales the residual branch relative to the skip connection, by a normalizing factor on the order of the square root of the network depth. This ensures that, early in training, the function computed by normalized residual blocks in deep networks is close to the identity function (on average). We use this insight to develop a simple initialization scheme that can train deep residual networks without normalization. We also provide a detailed empirical study of residual networks, which clarifies that, although batch normalized networks can be trained with larger learning rates, this effect is only beneficial in specific compute regimes, and has minimal benefits when the batch size is small. △ Less

Submitted 9 December, 2020; v1 submitted 24 February, 2020; originally announced February 2020.

Comments: Camera-ready version of NeurIPS 2020

arXiv:2002.10046 [pdf, other]

doi 10.1016/j.neuroimage.2020.117065

Permutation Inference for Canonical Correlation Analysis

Authors: Anderson M. Winkler, Olivier Renaud, Stephen M. Smith, Thomas E. Nichols

Abstract: Canonical correlation analysis (CCA) has become a key tool for population neuroimaging, allowing investigation of associations between many imaging and non-imaging measurements. As other variables are often a source of variability not of direct interest, previous work has used CCA on residuals from a model that removes these effects, then proceeded directly to permutation inference. We show that s… ▽ More Canonical correlation analysis (CCA) has become a key tool for population neuroimaging, allowing investigation of associations between many imaging and non-imaging measurements. As other variables are often a source of variability not of direct interest, previous work has used CCA on residuals from a model that removes these effects, then proceeded directly to permutation inference. We show that such a simple permutation test leads to inflated error rates. The reason is that residualisation introduces dependencies among the observations that violate the exchangeability assumption. Even in the absence of nuisance variables, however, a simple permutation test for CCA also leads to excess error rates for all canonical correlations other than the first. The reason is that a simple permutation scheme does not ignore the variability already explained by previous canonical variables. Here we propose solutions for both problems: in the case of nuisance variables, we show that transforming the residuals to a lower dimensional basis where exchangeability holds results in a valid permutation test; for more general cases, with or without nuisance variables, we propose estimating the canonical correlations in a stepwise manner, removing at each iteration the variance already explained, while dealing with different number of variables in both sides. We also discuss how to address the multiplicity of tests, proposing an admissible test that is not conservative, and provide a complete algorithm for permutation inference for CCA. △ Less

Submitted 17 June, 2020; v1 submitted 23 February, 2020; originally announced February 2020.

Comments: 49 pages, 2 figures, 10 tables, 3 algorithms, 119 references

arXiv:2001.01805 [pdf, other]

Geodesically parameterized covariance estimation

Authors: Antoni Musolas, Steven T. Smith, Youssef Marzouk

Abstract: Statistical modeling of spatiotemporal phenomena often requires selecting a covariance matrix from a covariance class. Yet standard parametric covariance families can be insufficiently flexible for practical applications, while non-parametric approaches may not easily allow certain kinds of prior knowledge to be incorporated. We propose instead to build covariance families out of geodesic curves.… ▽ More Statistical modeling of spatiotemporal phenomena often requires selecting a covariance matrix from a covariance class. Yet standard parametric covariance families can be insufficiently flexible for practical applications, while non-parametric approaches may not easily allow certain kinds of prior knowledge to be incorporated. We propose instead to build covariance families out of geodesic curves. These covariances offer more flexibility for problem-specific tailoring than classical parametric families, and are preferable to simple convex combinations. Once the covariance family has been chosen, one typically needs to select a representative member by solving an optimization problem, e.g., by maximizing the likelihood of a data set. We consider instead a differential geometric interpretation of this problem: minimizing the geodesic distance to a sample covariance matrix ("natural projection"). Our approach is consistent with the notion of distance employed to build the covariance family and does not require assuming a particular probability distribution for the data. We show that natural projection and maximum likelihood estimation within the covariance family are locally equivalent up to second order. We also demonstrate that natural projection may yield more accurate estimates with noise-corrupted data. △ Less

Submitted 23 December, 2020; v1 submitted 6 January, 2020; originally announced January 2020.

arXiv:1909.06134 [pdf, other]

Deep Adversarial Belief Networks

Authors: Yuming Huang, Ashkan Panahi, Hamid Krim, Yiyi Yu, Spencer L. Smith

Abstract: We present a novel adversarial framework for training deep belief networks (DBNs), which includes replacing the generator network in the methodology of generative adversarial networks (GANs) with a DBN and develo** a highly parallelizable numerical algorithm for training the resulting architecture in a stochastic manner. Unlike the existing techniques, this framework can be applied to the most g… ▽ More We present a novel adversarial framework for training deep belief networks (DBNs), which includes replacing the generator network in the methodology of generative adversarial networks (GANs) with a DBN and develo** a highly parallelizable numerical algorithm for training the resulting architecture in a stochastic manner. Unlike the existing techniques, this framework can be applied to the most general form of DBNs with no requirement for back propagation. As such, it lays a new foundation for develo** DBNs on a par with GANs with various regularization units, such as pooling and normalization. Foregoing back-propagation, our framework also exhibits superior scalability as compared to other DBN and GAN learning techniques. We present a number of numerical experiments in computer vision as well as neurosciences to illustrate the main advantages of our approach. △ Less

Submitted 25 September, 2019; v1 submitted 13 September, 2019; originally announced September 2019.

arXiv:1908.09482 [pdf, ps, other]

doi 10.1080/10618600.2020.1807996

Marginally-calibrated deep distributional regression

Authors: Nadja Klein, David J. Nott, Michael Stanley Smith

Abstract: Deep neural network (DNN) regression models are widely used in applications requiring state-of-the-art predictive accuracy. However, until recently there has been little work on accurate uncertainty quantification for predictions from such models. We add to this literature by outlining an approach to constructing predictive distributions that are `marginally calibrated'. This is where the long run… ▽ More Deep neural network (DNN) regression models are widely used in applications requiring state-of-the-art predictive accuracy. However, until recently there has been little work on accurate uncertainty quantification for predictions from such models. We add to this literature by outlining an approach to constructing predictive distributions that are `marginally calibrated'. This is where the long run average of the predictive distributions of the response variable matches the observed empirical margin. Our approach considers a DNN regression with a conditionally Gaussian prior for the final layer weights, from which an implicit copula process on the feature space is extracted. This copula process is combined with a non-parametrically estimated marginal distribution for the response. The end result is a scalable distributional DNN regression method with marginally calibrated predictions, and our work complements existing methods for probability calibration. The approach is first illustrated using two applications of dense layer feed-forward neural networks. However, our main motivating applications are in likelihood-free inference, where distributional deep regression is used to estimate marginal posterior distributions. In two complex ecological time series examples we employ the implicit copulas of convolutional networks, and show that marginal calibration results in improved uncertainty quantification. Our approach also avoids the need for manual specification of summary statistics, a requirement that is burdensome for users and typical of competing likelihood-free inference methods. △ Less

Submitted 3 September, 2020; v1 submitted 26 August, 2019; originally announced August 2019.

Journal ref: Journal of Computational and Graphical Statistics (2020)

arXiv:1907.04530 [pdf, ps, other]

doi 10.1111/biom.13355

Bayesian Variable Selection for Non-Gaussian Responses: A Marginally Calibrated Copula Approach

Authors: Nadja Klein, Michael Stanley Smith

Abstract: We propose a new highly flexible and tractable Bayesian approach to undertake variable selection in non-Gaussian regression models. It uses a copula decomposition for the joint distribution of observations on the dependent variable. This allows the marginal distribution of the dependent variable to be calibrated accurately using a nonparametric or other estimator. The family of copulas employed ar… ▽ More We propose a new highly flexible and tractable Bayesian approach to undertake variable selection in non-Gaussian regression models. It uses a copula decomposition for the joint distribution of observations on the dependent variable. This allows the marginal distribution of the dependent variable to be calibrated accurately using a nonparametric or other estimator. The family of copulas employed are `implicit copulas' that are constructed from existing hierarchical Bayesian models widely used for variable selection, and we establish some of their properties. Even though the copulas are high-dimensional, they can be estimated efficiently and quickly using Markov chain Monte Carlo (MCMC). A simulation study shows that when the responses are non-Gaussian the approach selects variables more accurately than contemporary benchmarks. A real data example in the Web Appendix illustrates that accounting for even mild deviations from normality can lead to a substantial increase in accuracy. To illustrate the full potential of our approach we extend it to spatial variable selection for fMRI. Using real data, we show our method allows for voxel-specific marginal calibration of the magnetic resonance signal at over 6,000 voxels, leading to an increase in the quality of the activation maps. △ Less

Submitted 3 September, 2020; v1 submitted 10 July, 2019; originally announced July 2019.

Journal ref: Biometrics (2020)

arXiv:1907.04529 [pdf, ps, other]

doi 10.1080/07350015.2020.1721295

Bayesian Inference for Regression Copulas

Authors: Michael Stanley Smith, Nadja Klein

Abstract: We propose a new semi-parametric distributional regression smoother that is based on a copula decomposition of the joint distribution of the vector of response values. The copula is high-dimensional and constructed by inversion of a pseudo regression, where the conditional mean and variance are semi-parametric functions of covariates modeled using regularized basis functions. By integrating out th… ▽ More We propose a new semi-parametric distributional regression smoother that is based on a copula decomposition of the joint distribution of the vector of response values. The copula is high-dimensional and constructed by inversion of a pseudo regression, where the conditional mean and variance are semi-parametric functions of covariates modeled using regularized basis functions. By integrating out the basis coefficients, an implicit copula process on the covariate space is obtained, which we call a `regression copula'. We combine this with a non-parametric margin to define a copula model, where the entire distribution - including the mean and variance - of the response is a smooth semi-parametric function of the covariates. The copula is estimated using both Hamiltonian Monte Carlo and variational Bayes; the latter of which is scalable to high dimensions. Using real data examples and a simulation study we illustrate the efficacy of these estimators and the copula model. In a substantive example, we estimate the distribution of half-hourly electricity spot prices as a function of demand and two time covariates using radial bases and horseshoe regularization. The copula model produces distributional estimates that are locally adaptive with respect to the covariates, and predictions that are more accurate than those from benchmark models. △ Less

Submitted 24 January, 2020; v1 submitted 10 July, 2019; originally announced July 2019.

Comments: Journal of Business & Economic Statistics (2020)

arXiv:1906.03318 [pdf, other]

Efficient non-conjugate Gaussian process factor models for spike count data using polynomial approximations

Authors: Stephen L. Keeley, David M. Zoltowski, Yiyi Yu, Jacob L. Yates, Spencer L. Smith, Jonathan W. Pillow

Abstract: Gaussian Process Factor Analysis (GPFA) has been broadly applied to the problem of identifying smooth, low-dimensional temporal structure underlying large-scale neural recordings. However, spike trains are non-Gaussian, which motivates combining GPFA with discrete observation models for binned spike count data. The drawback to this approach is that GPFA priors are not conjugate to count model like… ▽ More Gaussian Process Factor Analysis (GPFA) has been broadly applied to the problem of identifying smooth, low-dimensional temporal structure underlying large-scale neural recordings. However, spike trains are non-Gaussian, which motivates combining GPFA with discrete observation models for binned spike count data. The drawback to this approach is that GPFA priors are not conjugate to count model likelihoods, which makes inference challenging. Here we address this obstacle by introducing a fast, approximate inference method for non-conjugate GPFA models. Our approach uses orthogonal second-order polynomials to approximate the nonlinear terms in the non-conjugate log-likelihood, resulting in a method we refer to as \textit{polynomial approximate log-likelihood} (PAL) estimators. This approximation allows for accurate closed-form evaluation of marginal likelihoods and fast numerical optimization for parameters and hyperparameters. We derive PAL estimators for GPFA models with binomial, Poisson, and negative binomial observations and find the PAL estimation is highly accurate, and achieves faster convergence times compared to existing state-of-the-art inference methods. We also find that PAL hyperparameters can provide sensible initialization for black box variational inference (BBVI), which improves BBVI accuracy. We demonstrate that PAL estimators achieve fast and accurate extraction of latent structure from multi-neuron spike train data. △ Less

Submitted 5 October, 2020; v1 submitted 7 June, 2019; originally announced June 2019.

arXiv:1905.03776 [pdf, other]

The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study

Authors: Daniel S. Park, Jascha Sohl-Dickstein, Quoc V. Le, Samuel L. Smith

Abstract: We investigate how the final parameters found by stochastic gradient descent are influenced by over-parameterization. We generate families of models by increasing the number of channels in a base network, and then perform a large hyper-parameter search to study how the test error depends on learning rate, batch size, and network width. We find that the optimal SGD hyper-parameters are determined b… ▽ More We investigate how the final parameters found by stochastic gradient descent are influenced by over-parameterization. We generate families of models by increasing the number of channels in a base network, and then perform a large hyper-parameter search to study how the test error depends on learning rate, batch size, and network width. We find that the optimal SGD hyper-parameters are determined by a "normalized noise scale," which is a function of the batch size, learning rate, and initialization conditions. In the absence of batch normalization, the optimal normalized noise scale is directly proportional to width. Wider networks, with their higher optimal noise scale, also achieve higher test accuracy. These observations hold for MLPs, ConvNets, and ResNets, and for two different parameterization schemes ("Standard" and "NTK"). We observe a similar trend with batch normalization for ResNets. Surprisingly, since the largest stable learning rate is bounded, the largest batch size consistent with the optimal normalized noise scale decreases as the width increases. △ Less

Submitted 9 May, 2019; originally announced May 2019.

Comments: 17 pages, 3 tables, 17 figures; accepted to ICML 2019

arXiv:1904.07495 [pdf, ps, other]

High-dimensional copula variational approximation through transformation

Authors: Michael Stanley Smith, Ruben Loaiza-Maya, David J. Nott

Abstract: Variational methods are attractive for computing Bayesian inference for highly parametrized models and large datasets where exact inference is impractical. They approximate a target distribution - either the posterior or an augmented posterior - using a simpler distribution that is selected to balance accuracy with computational feasibility. Here we approximate an element-wise parametric transform… ▽ More Variational methods are attractive for computing Bayesian inference for highly parametrized models and large datasets where exact inference is impractical. They approximate a target distribution - either the posterior or an augmented posterior - using a simpler distribution that is selected to balance accuracy with computational feasibility. Here we approximate an element-wise parametric transformation of the target distribution as multivariate Gaussian or skew-normal. Approximations of this kind are implicit copula models for the original parameters, with a Gaussian or skew-normal copula function and flexible parametric margins. A key observation is that their adoption can improve the accuracy of variational inference in high dimensions at limited or no additional computational cost. We consider the Yeo-Johnson and G&H transformations, along with sparse factor structures for the scale matrix of the Gaussian or skew-normal. We also show how to implement efficient reparametrization gradient methods for these copula-based approximations. The efficacy of the approach is illustrated by computing posterior inference for three different models using six real datasets. In each case, we show that our proposed copula model distributions are more accurate variational approximations than Gaussian or skew-normal distributions, but at only a minor or no increase in computational cost. △ Less

Submitted 20 November, 2019; v1 submitted 16 April, 2019; originally announced April 2019.

arXiv:1812.10800 [pdf, other]

doi 10.1007/s12561-018-09228-w

Practical Considerations for Data Collection and Management in Mobile Health Micro-randomized Trials

Authors: Nicholas J. Seewald, Shawna N. Smith, Andy **seok Lee, Predrag Klasnja, Susan A. Murphy

Abstract: There is a growing interest in leveraging the prevalence of mobile technology to improve health by delivering momentary, contextualized interventions to individuals' smartphones. A just-in-time adaptive intervention (JITAI) adjusts to an individual's changing state and/or context to provide the right treatment, at the right time, in the right place. Micro-randomized trials (MRTs) allow for the col… ▽ More There is a growing interest in leveraging the prevalence of mobile technology to improve health by delivering momentary, contextualized interventions to individuals' smartphones. A just-in-time adaptive intervention (JITAI) adjusts to an individual's changing state and/or context to provide the right treatment, at the right time, in the right place. Micro-randomized trials (MRTs) allow for the collection of data which aid in the construction of an optimized JITAI by sequentially randomizing participants to different treatment options at each of many decision points throughout the study. Often, this data is collected passively using a mobile phone. To assess the causal effect of treatment on a near-term outcome, care must be taken when designing the data collection system to ensure it is of appropriately high quality. Here, we make several recommendations for collecting and managing data from an MRT. We provide advice on selecting which features to collect and when, choosing between "agents" to implement randomization, identifying sources of missing data, and overcoming other novel challenges. The recommendations are informed by our experience with HeartSteps, an MRT designed to test the effects of an intervention aimed at increasing physical activity in sedentary adults. We also provide a checklist which can be used in designing a data collection system so that scientists can focus more on their questions of interest, and less on cleaning data. △ Less

Submitted 27 December, 2018; originally announced December 2018.

Comments: Author accepted manuscript

arXiv:1811.03578 [pdf, other]

The ASCCR Frame for Learning Essential Collaboration Skills

Authors: Eric A. Vance, Heather S. Smith

Abstract: Statistics and data science are especially collaborative disciplines that typically require practitioners to interact with many different people or groups. Consequently, interdisciplinary collaboration skills are part of the personal and professional skills essential for success as an applied statistician or data scientist. These skills are learnable and teachable, and learning and improving colla… ▽ More Statistics and data science are especially collaborative disciplines that typically require practitioners to interact with many different people or groups. Consequently, interdisciplinary collaboration skills are part of the personal and professional skills essential for success as an applied statistician or data scientist. These skills are learnable and teachable, and learning and improving collaboration skills provides a way to enhance one's practice of statistics and data science. To help individuals learn these skills and organizations to teach them, we have developed a framework covering five essential components of statistical collaboration: Attitude, Structure, Content, Communication, and Relationship. We call this the ASCCR Frame. This framework can be incorporated into formal training programs in the classroom or on the job and can also be used by individuals through self-study. We show how this framework can be applied specifically to statisticians and data scientists to improve their collaboration skills and their interdisciplinary impact. We believe that the ASCCR Frame can help organize and stimulate research and teaching in interdisciplinary collaboration and call on individuals and organizations to begin generating evidence regarding its effectiveness. △ Less

Submitted 30 August, 2019; v1 submitted 8 November, 2018; originally announced November 2018.

Comments: 12 pages, 1 figure. Updated to this Version 5 by adding a few more references, discussing how to teach ASCCR in the classroom, calling on others to add to research supporting the use of the ASCCR Frame, and adding discussion of ethics and reproducible research

arXiv:1806.09597 [pdf, other]

Stochastic natural gradient descent draws posterior samples in function space

Authors: Samuel L. Smith, Daniel Duckworth, Semon Rezchikov, Quoc V. Le, Jascha Sohl-Dickstein

Abstract: Recent work has argued that stochastic gradient descent can approximate the Bayesian uncertainty in model parameters near local minima. In this work we develop a similar correspondence for minibatch natural gradient descent (NGD). We prove that for sufficiently small learning rates, if the model predictions on the training set approach the true conditional distribution of labels given inputs, the… ▽ More Recent work has argued that stochastic gradient descent can approximate the Bayesian uncertainty in model parameters near local minima. In this work we develop a similar correspondence for minibatch natural gradient descent (NGD). We prove that for sufficiently small learning rates, if the model predictions on the training set approach the true conditional distribution of labels given inputs, the stationary distribution of minibatch NGD approaches a Bayesian posterior near local minima. The temperature $T = εN / (2B)$ is controlled by the learning rate $ε$, training set size $N$ and batch size $B$. However minibatch NGD is not parameterisation invariant and it does not sample a valid posterior away from local minima. We therefore propose a novel optimiser, "stochastic NGD", which introduces the additional correction terms required to preserve both properties. △ Less

Submitted 28 November, 2018; v1 submitted 25 June, 2018; originally announced June 2018.

Comments: Workshop on Bayesian Deep Learning (NeurIPS 2018)

arXiv:1804.10397 [pdf, other]

doi 10.1214/18-BA1138

Implicit Copulas from Bayesian Regularized Regression Smoothers

Authors: Nadja Klein, Michael Stanley Smith

Abstract: We show how to extract the implicit copula of a response vector from a Bayesian regularized regression smoother with Gaussian disturbances. The copula can be used to compare smoothers that employ different shrinkage priors and function bases. We illustrate with three popular choices of shrinkage priors --- a pairwise prior, the horseshoe prior and a g prior augmented with a point mass as employed… ▽ More We show how to extract the implicit copula of a response vector from a Bayesian regularized regression smoother with Gaussian disturbances. The copula can be used to compare smoothers that employ different shrinkage priors and function bases. We illustrate with three popular choices of shrinkage priors --- a pairwise prior, the horseshoe prior and a g prior augmented with a point mass as employed for Bayesian variable selection --- and both univariate and multivariate function bases. The implicit copulas are high-dimensional, have flexible dependence structures that are far from that of a Gaussian copula, and are unavailable in closed form. However, we show how they can be evaluated by first constructing a Gaussian copula conditional on the regularization parameters, and then integrating over these. Combined with non-parametric margins the regularized smoothers can be used to model the distribution of non-Gaussian univariate responses conditional on the covariates. Efficient Markov chain Monte Carlo schemes for evaluating the copula are given for this case. Using both simulated and real data, we show how such copula smoothing models can improve the quality of resulting function estimates and predictive distributions. △ Less

Submitted 14 May, 2018; v1 submitted 27 April, 2018; originally announced April 2018.

Journal ref: Bayesian Anal. 14 (2019), no. 4, 1143--1171

arXiv:1804.08218 [pdf, ps, other]

Econometric Modeling of Regional Electricity Spot Prices in the Australian Market

Authors: Michael Stanley Smith, Thomas S. Shively

Abstract: Wholesale electricity markets are increasingly integrated via high voltage interconnectors, and inter-regional trade in electricity is growing. To model this, we consider a spatial equilibrium model of price formation, where constraints on inter-regional flows result in three distinct equilibria in prices. We use this to motivate an econometric model for the distribution of observed electricity sp… ▽ More Wholesale electricity markets are increasingly integrated via high voltage interconnectors, and inter-regional trade in electricity is growing. To model this, we consider a spatial equilibrium model of price formation, where constraints on inter-regional flows result in three distinct equilibria in prices. We use this to motivate an econometric model for the distribution of observed electricity spot prices that captures many of their unique empirical characteristics. The econometric model features supply and inter-regional trade cost functions, which are estimated using Bayesian monotonic regression smoothing methodology. A copula multivariate time series model is employed to capture additional dependence -- both cross-sectional and serial-- in regional prices. The marginal distributions are nonparametric, with means given by the regression means. The model has the advantage of preserving the heavy right-hand tail in the predictive densities of price. We fit the model to half-hourly spot price data in the five interconnected regions of the Australian national electricity market. The fitted model is then used to measure how both supply and price shocks in one region are transmitted to the distribution of prices in all regions in subsequent periods. Finally, to validate our econometric model, we show that prices forecast using the proposed model compare favorably with those from some benchmark alternatives. △ Less

Submitted 22 April, 2018; originally announced April 2018.

Comments: Key Words: Bayesian Monotonic Function Estimation, Intraday Electricity Prices, Copula Time Series Model. JEL: C11, C14, C32, C53

arXiv:1801.09319 [pdf]

doi 10.1063/1.5023802

Less is more: sampling chemical space with active learning

Authors: Justin S. Smith, Ben Nebgen, Nicholas Lubbers, Olexandr Isayev, Adrian E. Roitberg

Abstract: The development of accurate and transferable machine learning (ML) potentials for predicting molecular energetics is a challenging task. The process of data generation to train such ML potentials is a task neither well understood nor researched in detail. In this work, we present a fully automated approach for the generation of datasets with the intent of training universal ML potentials. It is ba… ▽ More The development of accurate and transferable machine learning (ML) potentials for predicting molecular energetics is a challenging task. The process of data generation to train such ML potentials is a task neither well understood nor researched in detail. In this work, we present a fully automated approach for the generation of datasets with the intent of training universal ML potentials. It is based on the concept of active learning (AL) via Query by Committee (QBC), which uses the disagreement between an ensemble of ML potentials to infer the reliability of the ensemble's prediction. QBC allows the presented AL algorithm to automatically sample regions of chemical space where the ML potential fails to accurately predict the potential energy. AL improves the overall fitness of ANAKIN-ME (ANI) deep learning potentials in rigorous test cases by mitigating human biases in deciding what new training data to use. AL also reduces the training set size to a fraction of the data required when using naive random sampling techniques. To provide validation of our AL approach we develop the COMP6 benchmark (publicly available on GitHub), which contains a diverse set of organic molecules. Through the AL process, it is shown that the AL-based potentials perform as well as the ANI-1 potential on COMP6 with only 10% of the data, and vastly outperforms ANI-1 with 25% the amount of data. Finally, we show that our proposed AL technique develops a universal ANI potential (ANI-1x) that provides accurate energy and force predictions on the entire COMP6 benchmark. This universal ML potential achieves a level of accuracy on par with the best ML potentials for single molecule or materials, while remaining applicable to the general class of organic molecules comprised of the elements CHNO. △ Less

Submitted 9 April, 2018; v1 submitted 28 January, 2018; originally announced January 2018.

Comments: Accepted at J. Chem. Phys

Journal ref: J. Chem. Phys. 148, 241733 (2018)

arXiv:1712.09150 [pdf, ps, other]

Variational Bayes Estimation of Discrete-Margined Copula Models with Application to Time Series

Authors: Ruben Loaiza-Maya, Michael Stanley Smith

Abstract: We propose a new variational Bayes estimator for high-dimensional copulas with discrete, or a combination of discrete and continuous, margins. The method is based on a variational approximation to a tractable augmented posterior, and is faster than previous likelihood-based approaches. We use it to estimate drawable vine copulas for univariate and multivariate Markov ordinal and mixed time series.… ▽ More We propose a new variational Bayes estimator for high-dimensional copulas with discrete, or a combination of discrete and continuous, margins. The method is based on a variational approximation to a tractable augmented posterior, and is faster than previous likelihood-based approaches. We use it to estimate drawable vine copulas for univariate and multivariate Markov ordinal and mixed time series. These have dimension $rT$, where $T$ is the number of observations and $r$ is the number of series, and are difficult to estimate using previous methods. The vine pair-copulas are carefully selected to allow for heteroskedasticity, which is a feature of most ordinal time series data. When combined with flexible margins, the resulting time series models also allow for other common features of ordinal data, such as zero inflation, multiple modes and under- or over-dispersion. Using six example series, we illustrate both the flexibility of the time series copula models, and the efficacy of the variational Bayes estimator for copulas of up to 792 dimensions and 60 parameters. This far exceeds the size and complexity of copula models for discrete data that can be estimated using previous methods. △ Less

Submitted 20 July, 2018; v1 submitted 25 December, 2017; originally announced December 2017.

arXiv:1711.00489 [pdf, other]

Don't Decay the Learning Rate, Increase the Batch Size

Authors: Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying, Quoc V. Le

Abstract: It is common practice to decay the learning rate. Here we show one can usually obtain the same learning curve on both training and test sets by instead increasing the batch size during training. This procedure is successful for stochastic gradient descent (SGD), SGD with momentum, Nesterov momentum, and Adam. It reaches equivalent test accuracies after the same number of training epochs, but with… ▽ More It is common practice to decay the learning rate. Here we show one can usually obtain the same learning curve on both training and test sets by instead increasing the batch size during training. This procedure is successful for stochastic gradient descent (SGD), SGD with momentum, Nesterov momentum, and Adam. It reaches equivalent test accuracies after the same number of training epochs, but with fewer parameter updates, leading to greater parallelism and shorter training times. We can further reduce the number of parameter updates by increasing the learning rate $ε$ and scaling the batch size $B \propto ε$. Finally, one can increase the momentum coefficient $m$ and scale $B \propto 1/(1-m)$, although this tends to slightly reduce the test accuracy. Crucially, our techniques allow us to repurpose existing training schedules for large batch training with no hyper-parameter tuning. We train ResNet-50 on ImageNet to $76.1\%$ validation accuracy in under 30 minutes. △ Less

Submitted 23 February, 2018; v1 submitted 1 November, 2017; originally announced November 2017.

Comments: 11 pages, 8 figures. Published as a conference paper at ICLR 2018

arXiv:1710.06451 [pdf, other]

A Bayesian Perspective on Generalization and Stochastic Gradient Descent

Authors: Samuel L. Smith, Quoc V. Le

Abstract: We consider two questions at the heart of machine learning; how can we predict if a minimum will generalize to the test set, and why does stochastic gradient descent find minima that generalize well? Our work responds to Zhang et al. (2016), who showed deep neural networks can easily memorize randomly labeled training data, despite generalizing well on real labels of the same inputs. We show that… ▽ More We consider two questions at the heart of machine learning; how can we predict if a minimum will generalize to the test set, and why does stochastic gradient descent find minima that generalize well? Our work responds to Zhang et al. (2016), who showed deep neural networks can easily memorize randomly labeled training data, despite generalizing well on real labels of the same inputs. We show that the same phenomenon occurs in small linear models. These observations are explained by the Bayesian evidence, which penalizes sharp minima but is invariant to model parameterization. We also demonstrate that, when one holds the learning rate fixed, there is an optimum batch size which maximizes the test set accuracy. We propose that the noise introduced by small mini-batches drives the parameters towards minima whose evidence is large. Interpreting stochastic gradient descent as a stochastic differential equation, we identify the "noise scale" $g = ε(\frac{N}{B} - 1) \approx εN/B$, where $ε$ is the learning rate, $N$ the training set size and $B$ the batch size. Consequently the optimum batch size is proportional to both the learning rate and the size of the training set, $B_{opt} \propto εN$. We verify these predictions empirically. △ Less

Submitted 14 February, 2018; v1 submitted 17 October, 2017; originally announced October 2017.

Comments: 13 pages, 9 figures. Published as a conference paper at ICLR 2018

arXiv:1710.00017 [pdf, other]

doi 10.1063/1.5011181

Hierarchical modeling of molecular energies using a deep neural network

Authors: Nicholas Lubbers, Justin S. Smith, Kipton Barros

Abstract: We introduce the Hierarchically Interacting Particle Neural Network (HIP-NN) to model molecular properties from datasets of quantum calculations. Inspired by a many-body expansion, HIP-NN decomposes properties, such as energy, as a sum over hierarchical terms. These terms are generated from a neural network--a composition of many nonlinear transformations--acting on a representation of the molecul… ▽ More We introduce the Hierarchically Interacting Particle Neural Network (HIP-NN) to model molecular properties from datasets of quantum calculations. Inspired by a many-body expansion, HIP-NN decomposes properties, such as energy, as a sum over hierarchical terms. These terms are generated from a neural network--a composition of many nonlinear transformations--acting on a representation of the molecule. HIP-NN achieves state-of-the-art performance on a dataset of 131k ground state organic molecules, and predicts energies with 0.26 kcal/mol mean absolute error. With minimal tuning, our model is also competitive on a dataset of molecular dynamics trajectories. In addition to enabling accurate energy predictions, the hierarchical structure of HIP-NN helps to identify regions of model uncertainty. △ Less

Submitted 29 September, 2017; originally announced October 2017.

arXiv:1701.07152 [pdf, other]

Time Series Copulas for Heteroskedastic Data

Authors: Rubén Loaiza-Maya, Michael S. Smith, Worapree Maneesoonthorn

Abstract: We propose parametric copulas that capture serial dependence in stationary heteroskedastic time series. We develop our copula for first order Markov series, and extend it to higher orders and multivariate series. We derive the copula of a volatility proxy, based on which we propose new measures of volatility dependence, including co-movement and spillover in multivariate series. In general, these… ▽ More We propose parametric copulas that capture serial dependence in stationary heteroskedastic time series. We develop our copula for first order Markov series, and extend it to higher orders and multivariate series. We derive the copula of a volatility proxy, based on which we propose new measures of volatility dependence, including co-movement and spillover in multivariate series. In general, these depend upon the marginal distributions of the series. Using exchange rate returns, we show that the resulting copula models can capture their marginal distributions more accurately than univariate and multivariate GARCH models, and produce more accurate value at risk forecasts. △ Less

Submitted 24 January, 2017; originally announced January 2017.

Showing 1–50 of 60 results for author: Smith, S