-
Towards a turnkey approach to unbiased Monte Carlo estimation of smooth functions of expectations
Authors:
Nicolas Chopin,
Francesca R. Crucinio,
Sumeetpal S. Singh
Abstract:
Given a smooth function $f$, we develop a general approach to turn Monte Carlo samples with expectation $m$ into an unbiased estimate of $f(m)$. Specifically, we develop estimators that are based on randomly truncating the Taylor series expansion of $f$ and estimating the coefficients of the truncated series. We derive their properties and propose a strategy to set their tuning parameters -- which…
▽ More
Given a smooth function $f$, we develop a general approach to turn Monte Carlo samples with expectation $m$ into an unbiased estimate of $f(m)$. Specifically, we develop estimators that are based on randomly truncating the Taylor series expansion of $f$ and estimating the coefficients of the truncated series. We derive their properties and propose a strategy to set their tuning parameters -- which depend on $m$ -- automatically, with a view to make the whole approach simple to use. We develop our methods for the specific functions $f(x)=\log x$ and $f(x)=1/x$, as they arise in several statistical applications such as maximum likelihood estimation of latent variable models and Bayesian inference for un-normalised models. Detailed numerical studies are performed for a range of applications to determine how competitive and reliable the proposed approach is.
△ Less
Submitted 12 April, 2024; v1 submitted 29 March, 2024;
originally announced March 2024.
-
Hallmarks of Optimization Trajectories in Neural Networks: Directional Exploration and Redundancy
Authors:
Sidak Pal Singh,
Bobby He,
Thomas Hofmann,
Bernhard Schölkopf
Abstract:
We propose a fresh take on understanding the mechanisms of neural networks by analyzing the rich directional structure of optimization trajectories, represented by their pointwise parameters. Towards this end, we introduce some natural notions of the complexity of optimization trajectories, both qualitative and quantitative, which hallmark the directional nature of optimization in neural networks:…
▽ More
We propose a fresh take on understanding the mechanisms of neural networks by analyzing the rich directional structure of optimization trajectories, represented by their pointwise parameters. Towards this end, we introduce some natural notions of the complexity of optimization trajectories, both qualitative and quantitative, which hallmark the directional nature of optimization in neural networks: when is there redundancy, and when exploration. We use them to reveal the inherent nuance and interplay involved between various optimization choices, such as momentum and weight decay. Further, the trajectory perspective helps us see the effect of scale on regularizing the directional nature of trajectories, and as a by-product, we also observe an intriguing heterogeneity of Q,K,V dynamics in the middle attention layers in LLMs and which is homogenized by scale. Importantly, we put the significant directional redundancy observed to the test by demonstrating that training only scalar batchnorm parameters some while into training matches the performance of training the entire network, which thus exhibits the potential of hybrid optimization schemes that are geared towards efficiency.
△ Less
Submitted 24 June, 2024; v1 submitted 12 March, 2024;
originally announced March 2024.
-
Fourier Basis Density Model
Authors:
Alfredo De la Fuente,
Saurabh Singh,
Johannes Ballé
Abstract:
We introduce a lightweight, flexible and end-to-end trainable probability density model parameterized by a constrained Fourier basis. We assess its performance at approximating a range of multi-modal 1D densities, which are generally difficult to fit. In comparison to the deep factorized model introduced in [1], our model achieves a lower cross entropy at a similar computational budget. In additio…
▽ More
We introduce a lightweight, flexible and end-to-end trainable probability density model parameterized by a constrained Fourier basis. We assess its performance at approximating a range of multi-modal 1D densities, which are generally difficult to fit. In comparison to the deep factorized model introduced in [1], our model achieves a lower cross entropy at a similar computational budget. In addition, we also evaluate our method on a toy compression task, demonstrating its utility in learned compression.
△ Less
Submitted 23 February, 2024;
originally announced February 2024.
-
Mixing time of the conditional backward sampling particle filter
Authors:
Joona Karjalainen,
Anthony Lee,
Sumeetpal S. Singh,
Matti Vihola
Abstract:
The conditional backward sampling particle filter (CBPF) is a powerful Markov chain Monte Carlo sampler for general state space hidden Markov model smoothing. It was proposed as an improvement over the conditional particle filter, which is known to have an $O(T^2)$ computational time complexity under a general `strong' mixing assumption, where $T$ is the time horizon. We provide the first proof th…
▽ More
The conditional backward sampling particle filter (CBPF) is a powerful Markov chain Monte Carlo sampler for general state space hidden Markov model smoothing. It was proposed as an improvement over the conditional particle filter, which is known to have an $O(T^2)$ computational time complexity under a general `strong' mixing assumption, where $T$ is the time horizon. We provide the first proof that the CBPF admits an $O(T \log T)$ time complexity under strong mixing, complementing strong empirical evidence of the superiority of the CBPF in practice. In particular, the CBPF's mixing time is upper bounded by $O(\log T)$, for any sufficiently large number of particles $N$ that depends only on the mixing assumptions and not $T$. We show that an $O(\log T)$ mixing time is optimal. The proof involves the analysis of a novel coupling of two CBPFs, which involves a maximal coupling of two particle systems at each time instant. The coupling is implementable, and thus can also be used to construct unbiased, finite variance, estimates of functionals which have arbitrary dependence on the latent state's path, with a total expected cost of $O(T \log T)$. We also investigate other couplings, and we show some of these alternatives have improved empirical behaviour.
△ Less
Submitted 22 February, 2024; v1 submitted 29 December, 2023;
originally announced December 2023.
-
Transformer Fusion with Optimal Transport
Authors:
Moritz Imfeld,
Jacopo Graldi,
Marco Giordano,
Thomas Hofmann,
Sotiris Anagnostidis,
Sidak Pal Singh
Abstract:
Fusion is a technique for merging multiple independently-trained neural networks in order to combine their capabilities. Past attempts have been restricted to the case of fully-connected, convolutional, and residual networks. This paper presents a systematic approach for fusing two or more transformer-based networks exploiting Optimal Transport to (soft-)align the various architectural components.…
▽ More
Fusion is a technique for merging multiple independently-trained neural networks in order to combine their capabilities. Past attempts have been restricted to the case of fully-connected, convolutional, and residual networks. This paper presents a systematic approach for fusing two or more transformer-based networks exploiting Optimal Transport to (soft-)align the various architectural components. We flesh out an abstraction for layer alignment, that can generalize to arbitrary architectures - in principle - and we apply this to the key ingredients of Transformers such as multi-head self-attention, layer-normalization, and residual connections, and we discuss how to handle them via various ablation studies. Furthermore, our method allows the fusion of models of different sizes (heterogeneous fusion), providing a new and efficient way to compress Transformers. The proposed approach is evaluated on both image classification tasks via Vision Transformer and natural language modeling tasks using BERT. Our approach consistently outperforms vanilla fusion, and, after a surprisingly short finetuning, also outperforms the individual converged parent models. In our analysis, we uncover intriguing insights about the significant role of soft alignment in the case of Transformers. Our results showcase the potential of fusing multiple Transformers, thus compounding their expertise, in the budding paradigm of model fusion and recombination. Code is available at https://github.com/graldij/transformer-fusion.
△ Less
Submitted 22 April, 2024; v1 submitted 9 October, 2023;
originally announced October 2023.
-
On the Forgetting of Particle Filters
Authors:
Joona Karjalainen,
Anthony Lee,
Sumeetpal S. Singh,
Matti Vihola
Abstract:
We study the forgetting properties of the particle filter when its state - the collection of particles - is regarded as a Markov chain. Under a strong mixing assumption on the particle filter's underlying Feynman-Kac model, we find that the particle filter is exponentially mixing, and forgets its initial state in $O(\log N )$ `time', where $N$ is the number of particles and time refers to the numb…
▽ More
We study the forgetting properties of the particle filter when its state - the collection of particles - is regarded as a Markov chain. Under a strong mixing assumption on the particle filter's underlying Feynman-Kac model, we find that the particle filter is exponentially mixing, and forgets its initial state in $O(\log N )$ `time', where $N$ is the number of particles and time refers to the number of particle filter algorithm steps, each comprising a selection (or resampling) and mutation (or prediction) operation. We present an example which suggests that this rate is optimal. In contrast to our result, available results to-date are extremely conservative, suggesting $O(α^N)$ time steps are needed, for some $α>1$, for the particle filter to forget its initialisation. We also study the conditional particle filter (CPF) and extend our forgetting result to this context. We establish a similar conclusion, namely, CPF is exponentially mixing and forgets its initial state in $O(\log N )$ time. To support this analysis, we establish new time-uniform $L^p$ error estimates for CPF, which can be of independent interest.
△ Less
Submitted 15 September, 2023;
originally announced September 2023.
-
Spuriosity Didn't Kill the Classifier: Using Invariant Predictions to Harness Spurious Features
Authors:
Cian Eastwood,
Shashank Singh,
Andrei Liviu Nicolicioiu,
Marin Vlastelica,
Julius von Kügelgen,
Bernhard Schölkopf
Abstract:
To avoid failures on out-of-distribution data, recent works have sought to extract features that have an invariant or stable relationship with the label across domains, discarding "spurious" or unstable features whose relationship with the label changes across domains. However, unstable features often carry complementary information that could boost performance if used correctly in the test domain…
▽ More
To avoid failures on out-of-distribution data, recent works have sought to extract features that have an invariant or stable relationship with the label across domains, discarding "spurious" or unstable features whose relationship with the label changes across domains. However, unstable features often carry complementary information that could boost performance if used correctly in the test domain. In this work, we show how this can be done without test-domain labels. In particular, we prove that pseudo-labels based on stable features provide sufficient guidance for doing so, provided that stable and unstable features are conditionally independent given the label. Based on this theoretical insight, we propose Stable Feature Boosting (SFB), an algorithm for: (i) learning a predictor that separates stable and conditionally-independent unstable features; and (ii) using the stable-feature predictions to adapt the unstable-feature predictions in the test domain. Theoretically, we prove that SFB can learn an asymptotically-optimal predictor without test-domain labels. Empirically, we demonstrate the effectiveness of SFB on real and synthetic data.
△ Less
Submitted 8 November, 2023; v1 submitted 19 July, 2023;
originally announced July 2023.
-
Robots That Ask For Help: Uncertainty Alignment for Large Language Model Planners
Authors:
Allen Z. Ren,
Anushri Dixit,
Alexandra Bodrova,
Sumeet Singh,
Stephen Tu,
Noah Brown,
Peng Xu,
Leila Takayama,
Fei Xia,
Jake Varley,
Zhenjia Xu,
Dorsa Sadigh,
Andy Zeng,
Anirudha Majumdar
Abstract:
Large language models (LLMs) exhibit a wide range of promising capabilities -- from step-by-step planning to commonsense reasoning -- that may provide utility for robots, but remain prone to confidently hallucinated predictions. In this work, we present KnowNo, which is a framework for measuring and aligning the uncertainty of LLM-based planners such that they know when they don't know and ask for…
▽ More
Large language models (LLMs) exhibit a wide range of promising capabilities -- from step-by-step planning to commonsense reasoning -- that may provide utility for robots, but remain prone to confidently hallucinated predictions. In this work, we present KnowNo, which is a framework for measuring and aligning the uncertainty of LLM-based planners such that they know when they don't know and ask for help when needed. KnowNo builds on the theory of conformal prediction to provide statistical guarantees on task completion while minimizing human help in complex multi-step planning settings. Experiments across a variety of simulated and real robot setups that involve tasks with different modes of ambiguity (e.g., from spatial to numeric uncertainties, from human preferences to Winograd schemas) show that KnowNo performs favorably over modern baselines (which may involve ensembles or extensive prompt tuning) in terms of improving efficiency and autonomy, while providing formal assurances. KnowNo can be used with LLMs out of the box without model-finetuning, and suggests a promising lightweight approach to modeling uncertainty that can complement and scale with the growing capabilities of foundation models. Website: https://robot-help.github.io
△ Less
Submitted 4 September, 2023; v1 submitted 4 July, 2023;
originally announced July 2023.
-
PyBADS: Fast and robust black-box optimization in Python
Authors:
Gurjeet Sangra Singh,
Luigi Acerbi
Abstract:
PyBADS is a Python implementation of the Bayesian Adaptive Direct Search (BADS) algorithm for fast and robust black-box optimization (Acerbi and Ma 2017). BADS is an optimization algorithm designed to efficiently solve difficult optimization problems where the objective function is rough (non-convex, non-smooth), mildly expensive (e.g., the function evaluation requires more than 0.1 seconds), poss…
▽ More
PyBADS is a Python implementation of the Bayesian Adaptive Direct Search (BADS) algorithm for fast and robust black-box optimization (Acerbi and Ma 2017). BADS is an optimization algorithm designed to efficiently solve difficult optimization problems where the objective function is rough (non-convex, non-smooth), mildly expensive (e.g., the function evaluation requires more than 0.1 seconds), possibly noisy, and gradient information is unavailable. With BADS, these issues are well addressed, making it an excellent choice for fitting computational models using methods such as maximum-likelihood estimation. The algorithm scales efficiently to black-box functions with up to $D \approx 20$ continuous input parameters and supports bounds or no constraints. PyBADS comes along with an easy-to-use Pythonic interface for running the algorithm and inspecting its results. PyBADS only requires the user to provide a Python function for evaluating the target function, and optionally other constraints.
Extensive benchmarks on both artificial test problems and large real model-fitting problems models drawn from cognitive, behavioral and computational neuroscience, show that BADS performs on par with or better than many other common and state-of-the-art optimizers (Acerbi and Ma 2017), making it a general model-fitting tool which provides fast and robust solutions.
△ Less
Submitted 27 June, 2023;
originally announced June 2023.
-
Optimization for truss design using Bayesian optimization
Authors:
Bhawani Sandeep,
Surjeet Singh,
Sumit Kumar
Abstract:
In this work, geometry optimization of mechanical truss using computer-aided finite element analysis is presented. The shape of the truss is a dominant factor in determining the capacity of load it can bear. At a given parameter space, our goal is to find the parameters of a hull that maximize the load-bearing capacity and also don't yield to the induced stress. We rely on finite element analysis,…
▽ More
In this work, geometry optimization of mechanical truss using computer-aided finite element analysis is presented. The shape of the truss is a dominant factor in determining the capacity of load it can bear. At a given parameter space, our goal is to find the parameters of a hull that maximize the load-bearing capacity and also don't yield to the induced stress. We rely on finite element analysis, which is a computationally costly design analysis tool for design evaluation. For such expensive to-evaluate functions, we chose Bayesian optimization as our optimization framework which has empirically proven sample efficient than other simulation-based optimization methods.
By utilizing Bayesian optimization algorithms, the truss design involves iteratively evaluating a set of candidate truss designs and updating a probabilistic model of the design space based on the results. The model is used to predict the performance of each candidate design, and the next candidate design is selected based on the prediction and an acquisition function that balances exploration and exploitation of the design space. Our result can be used as a baseline for future study on AI-based optimization in expensive engineering domains especially in finite element Analysis.
△ Less
Submitted 1 July, 2023; v1 submitted 27 May, 2023;
originally announced June 2023.
-
The Hessian perspective into the Nature of Convolutional Neural Networks
Authors:
Sidak Pal Singh,
Thomas Hofmann,
Bernhard Schölkopf
Abstract:
While Convolutional Neural Networks (CNNs) have long been investigated and applied, as well as theorized, we aim to provide a slightly different perspective into their nature -- through the perspective of their Hessian maps. The reason is that the loss Hessian captures the pairwise interaction of parameters and therefore forms a natural ground to probe how the architectural aspects of CNN get mani…
▽ More
While Convolutional Neural Networks (CNNs) have long been investigated and applied, as well as theorized, we aim to provide a slightly different perspective into their nature -- through the perspective of their Hessian maps. The reason is that the loss Hessian captures the pairwise interaction of parameters and therefore forms a natural ground to probe how the architectural aspects of CNN get manifested in its structure and properties. We develop a framework relying on Toeplitz representation of CNNs, and then utilize it to reveal the Hessian structure and, in particular, its rank. We prove tight upper bounds (with linear activations), which closely follow the empirical trend of the Hessian rank and hold in practice in more general settings. Overall, our work generalizes and establishes the key insight that, even in CNNs, the Hessian rank grows as the square root of the number of parameters.
△ Less
Submitted 15 May, 2023;
originally announced May 2023.
-
Some Fundamental Aspects about Lipschitz Continuity of Neural Networks
Authors:
Grigory Khromov,
Sidak Pal Singh
Abstract:
Lipschitz continuity is a crucial functional property of any predictive model, that naturally governs its robustness, generalisation, as well as adversarial vulnerability. Contrary to other works that focus on obtaining tighter bounds and develo** different practical strategies to enforce certain Lipschitz properties, we aim to thoroughly examine and characterise the Lipschitz behaviour of Neura…
▽ More
Lipschitz continuity is a crucial functional property of any predictive model, that naturally governs its robustness, generalisation, as well as adversarial vulnerability. Contrary to other works that focus on obtaining tighter bounds and develo** different practical strategies to enforce certain Lipschitz properties, we aim to thoroughly examine and characterise the Lipschitz behaviour of Neural Networks. Thus, we carry out an empirical investigation in a range of different settings (namely, architectures, datasets, label noise, and more) by exhausting the limits of the simplest and the most general lower and upper bounds. As a highlight of this investigation, we showcase a remarkable fidelity of the lower Lipschitz bound, identify a striking Double Descent trend in both upper and lower bounds to the Lipschitz and explain the intriguing effects of label noise on function smoothness and generalisation.
△ Less
Submitted 14 May, 2024; v1 submitted 21 February, 2023;
originally announced February 2023.
-
Quasi-Newton Sequential Monte Carlo
Authors:
Samuel Duffield,
Sumeetpal S. Singh
Abstract:
Sequential Monte Carlo samplers represent a compelling approach to posterior inference in Bayesian models, due to being parallelisable and providing an unbiased estimate of the posterior normalising constant. In this work, we significantly accelerate sequential Monte Carlo samplers by adopting the L-BFGS Hessian approximation which represents the state-of-the-art in full-batch optimisation techniq…
▽ More
Sequential Monte Carlo samplers represent a compelling approach to posterior inference in Bayesian models, due to being parallelisable and providing an unbiased estimate of the posterior normalising constant. In this work, we significantly accelerate sequential Monte Carlo samplers by adopting the L-BFGS Hessian approximation which represents the state-of-the-art in full-batch optimisation techniques. The L-BFGS Hessian approximation has only linear complexity in the parameter dimension and requires no additional posterior or gradient evaluations. The resulting sequential Monte Carlo algorithm is adaptive, parallelisable and well-suited to high-dimensional and multi-modal settings, which we demonstrate in numerical experiments on challenging posterior distributions.
△ Less
Submitted 22 November, 2022;
originally announced November 2022.
-
Weighted Ensemble Self-Supervised Learning
Authors:
Yangjun Ruan,
Saurabh Singh,
Warren Morningstar,
Alexander A. Alemi,
Sergey Ioffe,
Ian Fischer,
Joshua V. Dillon
Abstract:
Ensembling has proven to be a powerful technique for boosting model performance, uncertainty estimation, and robustness in supervised learning. Advances in self-supervised learning (SSL) enable leveraging large unlabeled corpora for state-of-the-art few-shot and supervised learning performance. In this paper, we explore how ensemble methods can improve recent SSL techniques by develo** a framewo…
▽ More
Ensembling has proven to be a powerful technique for boosting model performance, uncertainty estimation, and robustness in supervised learning. Advances in self-supervised learning (SSL) enable leveraging large unlabeled corpora for state-of-the-art few-shot and supervised learning performance. In this paper, we explore how ensemble methods can improve recent SSL techniques by develo** a framework that permits data-dependent weighted cross-entropy losses. We refrain from ensembling the representation backbone; this choice yields an efficient ensemble method that incurs a small training cost and requires no architectural changes or computational overhead to downstream evaluation. The effectiveness of our method is demonstrated with two state-of-the-art SSL methods, DINO (Caron et al., 2021) and MSN (Assran et al., 2022). Our method outperforms both in multiple evaluation metrics on ImageNet-1K, particularly in the few-shot setting. We explore several weighting schemes and find that those which increase the diversity of ensemble heads lead to better downstream evaluation results. Thorough experiments yield improved prior art baselines which our method still surpasses; e.g., our overall improvement with MSN ViT-B/16 is 3.9 p.p. for 1-shot learning.
△ Less
Submitted 9 April, 2023; v1 submitted 17 November, 2022;
originally announced November 2022.
-
Planning to the Information Horizon of BAMDPs via Epistemic State Abstraction
Authors:
Dilip Arumugam,
Satinder Singh
Abstract:
The Bayes-Adaptive Markov Decision Process (BAMDP) formalism pursues the Bayes-optimal solution to the exploration-exploitation trade-off in reinforcement learning. As the computation of exact solutions to Bayesian reinforcement-learning problems is intractable, much of the literature has focused on develo** suitable approximation algorithms. In this work, before diving into algorithm design, we…
▽ More
The Bayes-Adaptive Markov Decision Process (BAMDP) formalism pursues the Bayes-optimal solution to the exploration-exploitation trade-off in reinforcement learning. As the computation of exact solutions to Bayesian reinforcement-learning problems is intractable, much of the literature has focused on develo** suitable approximation algorithms. In this work, before diving into algorithm design, we first define, under mild structural assumptions, a complexity measure for BAMDP planning. As efficient exploration in BAMDPs hinges upon the judicious acquisition of information, our complexity measure highlights the worst-case difficulty of gathering information and exhausting epistemic uncertainty. To illustrate its significance, we establish a computationally-intractable, exact planning algorithm that takes advantage of this measure to show more efficient planning. We then conclude by introducing a specific form of state abstraction with the potential to reduce BAMDP complexity and gives rise to a computationally-tractable, approximate planning algorithm.
△ Less
Submitted 30 October, 2022;
originally announced October 2022.
-
Shape-based Evaluation of Epidemic Forecasts
Authors:
Ajitesh Srivastava,
Satwant Singh,
Fiona Lee
Abstract:
Infectious disease forecasting for ongoing epidemics has been traditionally performed, communicated, and evaluated as numerical targets - 1, 2, 3, and 4 week ahead cases, deaths, and hospitalizations. While there is great value in predicting these numerical targets to assess the burden of the disease, we argue that there is also value in communicating the future trend (description of the shape) of…
▽ More
Infectious disease forecasting for ongoing epidemics has been traditionally performed, communicated, and evaluated as numerical targets - 1, 2, 3, and 4 week ahead cases, deaths, and hospitalizations. While there is great value in predicting these numerical targets to assess the burden of the disease, we argue that there is also value in communicating the future trend (description of the shape) of the epidemic -- for instance, if the cases will remain flat or a surge is expected. To ensure what is being communicated is useful we need to be able to evaluate how well the predicted shape matches with the ground truth shape. Instead of treating this as a classification problem (one out of $n$ shapes), we define a transformation of the numerical forecasts into a ``shapelet''-space representation. In this representation, each dimension corresponds to the similarity of the shape with one of the shapes of interest (a shapelet). We prove that this representation satisfies the property that two shapes that one would consider similar are mapped close to each other, and vice versa. We demonstrate that our representation is able to reasonably capture the trends in COVID-19 cases and deaths time-series. With this representation, we define an evaluation measure and a measure of agreement among multiple models. We also define the shapelet-space ensemble of multiple models as the mean of their shapelet-space representations. We show that this ensemble is able to accurately predict the shape of the future trend for COVID-19 cases and trends. We also show that the agreement between models can provide a good indicator of the reliability of the forecast.
△ Less
Submitted 11 November, 2022; v1 submitted 8 September, 2022;
originally announced September 2022.
-
Probable Domain Generalization via Quantile Risk Minimization
Authors:
Cian Eastwood,
Alexander Robey,
Shashank Singh,
Julius von Kügelgen,
Hamed Hassani,
George J. Pappas,
Bernhard Schölkopf
Abstract:
Domain generalization (DG) seeks predictors which perform well on unseen test distributions by leveraging data drawn from multiple related training distributions or domains. To achieve this, DG is commonly formulated as an average- or worst-case problem over the set of possible domains. However, predictors that perform well on average lack robustness while predictors that perform well in the worst…
▽ More
Domain generalization (DG) seeks predictors which perform well on unseen test distributions by leveraging data drawn from multiple related training distributions or domains. To achieve this, DG is commonly formulated as an average- or worst-case problem over the set of possible domains. However, predictors that perform well on average lack robustness while predictors that perform well in the worst case tend to be overly-conservative. To address this, we propose a new probabilistic framework for DG where the goal is to learn predictors that perform well with high probability. Our key idea is that distribution shifts seen during training should inform us of probable shifts at test time, which we realize by explicitly relating training and test domains as draws from the same underlying meta-distribution. To achieve probable DG, we propose a new optimization problem called Quantile Risk Minimization (QRM). By minimizing the $α$-quantile of predictor's risk distribution over domains, QRM seeks predictors that perform well with probability $α$. To solve QRM in practice, we propose the Empirical QRM (EQRM) algorithm and provide: (i) a generalization bound for EQRM; and (ii) the conditions under which EQRM recovers the causal predictor as $α\to 1$. In our experiments, we introduce a more holistic quantile-focused evaluation protocol for DG and demonstrate that EQRM outperforms state-of-the-art baselines on datasets from WILDS and DomainBed.
△ Less
Submitted 22 August, 2023; v1 submitted 20 July, 2022;
originally announced July 2022.
-
De-biasing particle filtering for a continuous time hidden Markov model with a Cox process observation model
Authors:
Ruiyang **,
Sumeetpal S. Singh,
Nicolas Chopin
Abstract:
We develop a (nearly) unbiased particle filtering algorithm for a specific class of continuous-time state-space models, such that (a) the latent process $X_t$ is a linear Gaussian diffusion; and (b) the observations arise from a Poisson process with intensity $λ(X_t)$. The likelihood of the posterior probability density function of the latent process includes an intractable path integral. Our algo…
▽ More
We develop a (nearly) unbiased particle filtering algorithm for a specific class of continuous-time state-space models, such that (a) the latent process $X_t$ is a linear Gaussian diffusion; and (b) the observations arise from a Poisson process with intensity $λ(X_t)$. The likelihood of the posterior probability density function of the latent process includes an intractable path integral. Our algorithm relies on Poisson estimates which approximate unbiasedly this integral. We show how we can tune these Poisson estimates to ensure that, with large probability, all but a few of the estimates generated by the algorithm are positive. Then replacing the negative estimates by zero leads to a much smaller bias than what would obtain through discretisation. We quantify the probability of negative estimates for certain special cases and show that our particle filter is effectively unbiased. We apply our method to a challenging 3D single molecule tracking example with a Born and Wolf observation model.
△ Less
Submitted 30 June, 2022; v1 submitted 21 June, 2022;
originally announced June 2022.
-
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Authors:
Aarohi Srivastava,
Abhinav Rastogi,
Abhishek Rao,
Abu Awal Md Shoeb,
Abubakar Abid,
Adam Fisch,
Adam R. Brown,
Adam Santoro,
Aditya Gupta,
Adrià Garriga-Alonso,
Agnieszka Kluska,
Aitor Lewkowycz,
Akshat Agarwal,
Alethea Power,
Alex Ray,
Alex Warstadt,
Alexander W. Kocurek,
Ali Safaya,
Ali Tazarv,
Alice Xiang,
Alicia Parrish,
Allen Nie,
Aman Hussain,
Amanda Askell,
Amanda Dsouza
, et al. (426 additional authors not shown)
Abstract:
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur…
▽ More
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.
△ Less
Submitted 12 June, 2023; v1 submitted 9 June, 2022;
originally announced June 2022.
-
Indirect Active Learning
Authors:
Shashank Singh
Abstract:
Traditional models of active learning assume a learner can directly manipulate or query a covariate $X$ in order to study its relationship with a response $Y$. However, if $X$ is a feature of a complex system, it may be possible only to indirectly influence $X$ by manipulating a control variable $Z$, a scenario we refer to as Indirect Active Learning. Under a nonparametric model of Indirect Active…
▽ More
Traditional models of active learning assume a learner can directly manipulate or query a covariate $X$ in order to study its relationship with a response $Y$. However, if $X$ is a feature of a complex system, it may be possible only to indirectly influence $X$ by manipulating a control variable $Z$, a scenario we refer to as Indirect Active Learning. Under a nonparametric model of Indirect Active Learning with a fixed budget, we study minimax convergence rates for estimating the relationship between $X$ and $Y$ locally at a point, obtaining different rates depending on the complexities and noise levels of the relationships between $Z$ and $X$ and between $X$ and $Y$. We also identify minimax rates for passive learning under comparable assumptions. In many cases, our results show that, while there is an asymptotic benefit to active learning, this benefit is fully realized by a simple two-stage learner that runs two passive experiments in sequence.
△ Less
Submitted 21 January, 2023; v1 submitted 3 June, 2022;
originally announced June 2022.
-
Conditional particle filters with bridge backward sampling
Authors:
Santeri Karppinen,
Sumeetpal S. Singh,
Matti Vihola
Abstract:
Conditional particle filters (CPFs) with backward/ancestor sampling are powerful methods for sampling from the posterior distribution of the latent states of a dynamic model such as a hidden Markov model. However, the performance of these methods deteriorates with models involving weakly informative observations and/or slowly mixing dynamics. Both of these complications arise when sampling finely…
▽ More
Conditional particle filters (CPFs) with backward/ancestor sampling are powerful methods for sampling from the posterior distribution of the latent states of a dynamic model such as a hidden Markov model. However, the performance of these methods deteriorates with models involving weakly informative observations and/or slowly mixing dynamics. Both of these complications arise when sampling finely time-discretised continuous-time path integral models, but can occur with hidden Markov models too. Multinomial resampling, which is commonly employed with CPFs, resamples excessively for weakly informative observations and thereby introduces extra variance. Furthermore, slowly mixing dynamics render the backward/ancestor sampling steps ineffective, leading to degeneracy issues. We detail two conditional resampling strategies suitable for the weakly informative regime: the so-called `killing' resampling and the systematic resampling with mean partial order. To avoid the degeneracy issues, we introduce a generalisation of the CPF with backward sampling that involves auxiliary `bridging' CPF steps that are parameterised by a blocking sequence. We present practical tuning strategies for choosing an appropriate blocking. Our experiments demonstrate that the CPF with a suitable resampling and the developed `bridge backward sampling' can lead to substantial efficiency gains in the weakly informative and slow mixing regime.
△ Less
Submitted 19 June, 2023; v1 submitted 27 May, 2022;
originally announced May 2022.
-
Multilevel Bayesian Deep Neural Networks
Authors:
Neil K. Chada,
Ajay Jasra,
Kody J. H. Law,
Sumeetpal S. Singh
Abstract:
In this article we consider Bayesian inference associated to deep neural networks (DNNs) and in particular, trace-class neural network (TNN) priors which were proposed by Sell et al. [39]. Such priors were developed as more robust alternatives to classical architectures in the context of inference problems. For this work we develop multilevel Monte Carlo (MLMC) methods for such models. MLMC is a p…
▽ More
In this article we consider Bayesian inference associated to deep neural networks (DNNs) and in particular, trace-class neural network (TNN) priors which were proposed by Sell et al. [39]. Such priors were developed as more robust alternatives to classical architectures in the context of inference problems. For this work we develop multilevel Monte Carlo (MLMC) methods for such models. MLMC is a popular variance reduction technique, with particular applications in Bayesian statistics and uncertainty quantification. We show how a particular advanced MLMC method that was introduced in [4] can be applied to Bayesian inference from DNNs and establish mathematically, that the computational cost to achieve a particular mean square error, associated to posterior expectation computation, can be reduced by several orders, versus more conventional techniques. To verify such results we provide numerous numerical experiments on model problems arising in machine learning. These include Bayesian regression, as well as Bayesian classification and reinforcement learning.
△ Less
Submitted 20 July, 2022; v1 submitted 24 March, 2022;
originally announced March 2022.
-
On resampling schemes for particle filters with weakly informative observations
Authors:
Nicolas Chopin,
Sumeetpal S. Singh,
Tomás Soto,
Matti Vihola
Abstract:
We consider particle filters with weakly informative observations (or `potentials') relative to the latent state dynamics. The particular focus of this work is on particle filters to approximate time-discretisations of continuous-time Feynman--Kac path integral models -- a scenario that naturally arises when addressing filtering and smoothing problems in continuous time -- but our findings are ind…
▽ More
We consider particle filters with weakly informative observations (or `potentials') relative to the latent state dynamics. The particular focus of this work is on particle filters to approximate time-discretisations of continuous-time Feynman--Kac path integral models -- a scenario that naturally arises when addressing filtering and smoothing problems in continuous time -- but our findings are indicative about weakly informative settings beyond this context too. We study the performance of different resampling schemes, such as systematic resampling, SSP (Srinivasan sampling process) and stratified resampling, as the time-discretisation becomes finer and also identify their continuous-time limit, which is expressed as a suitably defined `infinitesimal generator.' By contrasting these generators, we find that (certain modifications of) systematic and SSP resampling `dominate' stratified and independent `killing' resampling in terms of their limiting overall resampling rate. The reduced intensity of resampling manifests itself in lower variance in our numerical experiment. This efficiency result, through an ordering of the resampling rate, is new to the literature. The second major contribution of this work concerns the analysis of the limiting behaviour of the entire population of particles of the particle filter as the time discretisation becomes finer. We provide the first proof, under general conditions, that the particle approximation of the discretised continuous-time Feynman--Kac path integral models converges to a (uniformly weighted) continuous-time particle system.
△ Less
Submitted 9 July, 2022; v1 submitted 18 March, 2022;
originally announced March 2022.
-
Phenomenology of Double Descent in Finite-Width Neural Networks
Authors:
Sidak Pal Singh,
Aurelien Lucchi,
Thomas Hofmann,
Bernhard Schölkopf
Abstract:
`Double descent' delineates the generalization behaviour of models depending on the regime they belong to: under- or over-parameterized. The current theoretical understanding behind the occurrence of this phenomenon is primarily based on linear and kernel regression models -- with informal parallels to neural networks via the Neural Tangent Kernel. Therefore such analyses do not adequately capture…
▽ More
`Double descent' delineates the generalization behaviour of models depending on the regime they belong to: under- or over-parameterized. The current theoretical understanding behind the occurrence of this phenomenon is primarily based on linear and kernel regression models -- with informal parallels to neural networks via the Neural Tangent Kernel. Therefore such analyses do not adequately capture the mechanisms behind double descent in finite-width neural networks, as well as, disregard crucial components -- such as the choice of the loss function. We address these shortcomings by leveraging influence functions in order to derive suitable expressions of the population loss and its lower bound, while imposing minimal assumptions on the form of the parametric model. Our derived bounds bear an intimate connection with the spectrum of the Hessian at the optimum, and importantly, exhibit a double descent behaviour at the interpolation threshold. Building on our analysis, we further investigate how the loss function affects double descent -- and thus uncover interesting properties of neural networks and their Hessian spectra near the interpolation threshold.
△ Less
Submitted 14 March, 2022;
originally announced March 2022.
-
Interpretable Personalized Experimentation
Authors:
Han Wu,
Sarah Tan,
Weiwei Li,
Mia Garrard,
Adam Obeng,
Drew Dimmery,
Shaun Singh,
Hanson Wang,
Daniel Jiang,
Eytan Bakshy
Abstract:
Black-box heterogeneous treatment effect (HTE) models are increasingly being used to create personalized policies that assign individuals to their optimal treatments. However, they are difficult to understand, and can be burdensome to maintain in a production environment. In this paper, we present a scalable, interpretable personalized experimentation system, implemented and deployed in production…
▽ More
Black-box heterogeneous treatment effect (HTE) models are increasingly being used to create personalized policies that assign individuals to their optimal treatments. However, they are difficult to understand, and can be burdensome to maintain in a production environment. In this paper, we present a scalable, interpretable personalized experimentation system, implemented and deployed in production at Meta. The system works in a multiple treatment, multiple outcome setting typical at Meta to: (1) learn explanations for black-box HTE models; (2) generate interpretable personalized policies. We evaluate the methods used in the system on publicly available data and Meta use cases, and discuss lessons learnt during the development of the system.
△ Less
Submitted 5 August, 2022; v1 submitted 5 November, 2021;
originally announced November 2021.
-
Ensemble Kalman Inversion for General Likelihoods
Authors:
Samuel Duffield,
Sumeetpal S. Singh
Abstract:
In this letter we generalise Ensemble Kalman inversion techniques to general Bayesian models where previously they were restricted to additive Gaussian likelihoods - all in the difficult setting where the likelihood can be sampled from, but its density not necessarily evaluated.
In this letter we generalise Ensemble Kalman inversion techniques to general Bayesian models where previously they were restricted to additive Gaussian likelihoods - all in the difficult setting where the likelihood can be sampled from, but its density not necessarily evaluated.
△ Less
Submitted 7 June, 2022; v1 submitted 6 October, 2021;
originally announced October 2021.
-
Bootstrapped Meta-Learning
Authors:
Sebastian Flennerhag,
Yannick Schroecker,
Tom Zahavy,
Hado van Hasselt,
David Silver,
Satinder Singh
Abstract:
Meta-learning empowers artificial intelligence to increase its efficiency by learning how to learn. Unlocking this potential involves overcoming a challenging meta-optimisation problem. We propose an algorithm that tackles this problem by letting the meta-learner teach itself. The algorithm first bootstraps a target from the meta-learner, then optimises the meta-learner by minimising the distance…
▽ More
Meta-learning empowers artificial intelligence to increase its efficiency by learning how to learn. Unlocking this potential involves overcoming a challenging meta-optimisation problem. We propose an algorithm that tackles this problem by letting the meta-learner teach itself. The algorithm first bootstraps a target from the meta-learner, then optimises the meta-learner by minimising the distance to that target under a chosen (pseudo-)metric. Focusing on meta-learning with gradients, we establish conditions that guarantee performance improvements and show that the metric can control meta-optimisation. Meanwhile, the bootstrap** mechanism can extend the effective meta-learning horizon without requiring backpropagation through all updates. We achieve a new state-of-the art for model-free agents on the Atari ALE benchmark and demonstrate that it yields both performance and efficiency gains in multi-task meta-learning. Finally, we explore how bootstrap** opens up new possibilities and find that it can meta-learn efficient exploration in an epsilon-greedy Q-learning agent, without backpropagating through the update rule.
△ Less
Submitted 16 March, 2022; v1 submitted 9 September, 2021;
originally announced September 2021.
-
Optimal Binary Classification Beyond Accuracy
Authors:
Shashank Singh,
Justin Khim
Abstract:
The vast majority of statistical theory on binary classification characterizes performance in terms of accuracy. However, accuracy is known in many cases to poorly reflect the practical consequences of classification error, most famously in imbalanced binary classification, where data are dominated by samples from one of two classes. The first part of this paper derives a novel generalization of t…
▽ More
The vast majority of statistical theory on binary classification characterizes performance in terms of accuracy. However, accuracy is known in many cases to poorly reflect the practical consequences of classification error, most famously in imbalanced binary classification, where data are dominated by samples from one of two classes. The first part of this paper derives a novel generalization of the Bayes-optimal classifier from accuracy to any performance metric computed from the confusion matrix. Specifically, this result (a) demonstrates that stochastic classifiers sometimes outperform the best possible deterministic classifier and (b) removes an empirically unverifiable absolute continuity assumption that is poorly understood but pervades existing results. We then demonstrate how to use this generalized Bayes classifier to obtain regret bounds in terms of the error of estimating regression functions under uniform loss. Finally, we use these results to develop some of the first finite-sample statistical guarantees specific to imbalanced binary classification. Specifically, we demonstrate that optimal classification performance depends on properties of class imbalance, such as a novel notion called Uniform Class Imbalance, that have not previously been formalized. We further illustrate these contributions numerically in the case of $k$-nearest neighbor classification
△ Less
Submitted 26 September, 2022; v1 submitted 4 July, 2021;
originally announced July 2021.
-
Analytic Insights into Structure and Rank of Neural Network Hessian Maps
Authors:
Sidak Pal Singh,
Gregor Bachmann,
Thomas Hofmann
Abstract:
The Hessian of a neural network captures parameter interactions through second-order derivatives of the loss. It is a fundamental object of study, closely tied to various problems in deep learning, including model design, optimization, and generalization. Most prior work has been empirical, typically focusing on low-rank approximations and heuristics that are blind to the network structure. In con…
▽ More
The Hessian of a neural network captures parameter interactions through second-order derivatives of the loss. It is a fundamental object of study, closely tied to various problems in deep learning, including model design, optimization, and generalization. Most prior work has been empirical, typically focusing on low-rank approximations and heuristics that are blind to the network structure. In contrast, we develop theoretical tools to analyze the range of the Hessian map, providing us with a precise understanding of its rank deficiency as well as the structural reasons behind it. This yields exact formulas and tight upper bounds for the Hessian rank of deep linear networks, allowing for an elegant interpretation in terms of rank deficiency. Moreover, we demonstrate that our bounds remain faithful as an estimate of the numerical Hessian rank, for a larger class of models such as rectified and hyperbolic tangent networks. Further, we also investigate the implications of model architecture (e.g.~width, depth, bias) on the rank deficiency. Overall, our work provides novel insights into the source and extent of redundancy in overparameterized networks.
△ Less
Submitted 1 July, 2021; v1 submitted 30 June, 2021;
originally announced June 2021.
-
Limits of accuracy for parameter estimation and localisation in Single-Molecule Microscopy via sequential Monte Carlo methods
Authors:
A. Marie d'Avigneau,
S. S. Singh,
R. J. Ober
Abstract:
Assessing the quality of parameter estimates for models describing the motion of single molecules in cellular environments is an important problem in fluorescence microscopy. We consider the fundamental data model, where molecules emit photons at random times and the photons arrive at random locations on the detector according to complex point spread functions (PSFs). The random, non-Gaussian PSF…
▽ More
Assessing the quality of parameter estimates for models describing the motion of single molecules in cellular environments is an important problem in fluorescence microscopy. We consider the fundamental data model, where molecules emit photons at random times and the photons arrive at random locations on the detector according to complex point spread functions (PSFs). The random, non-Gaussian PSF of the detection process and random trajectory of the molecule make inference challenging. Moreover, the presence of other nearby molecules causes further uncertainty in the origin of the measurements, which impacts the statistical precision of estimates. We quantify the limits of accuracy of model parameter estimates and separation distance between closely spaced molecules (known as the resolution problem) by computing the Cramer-Rao lower bound (CRLB), or equivalently the inverse of the Fisher information matrix (FIM), for the variance of estimates. This fundamental CRLB is crucial, as it provides a lower bound for more practical scenarios. While analytic expressions for the FIM can be derived for static molecules, the analytical tools to evaluate it for molecules whose trajectories follow SDEs are still mostly missing. We address this by presenting a general SMC based methodology for both parameter inference and computing the desired accuracy limits for non-static molecules and a non-Gaussian fundamental detection model. For the first time, we are able to estimate the FIM for stochastically moving molecules observed through the Airy and Born & Wolf PSF. This is achieved by estimating the score and observed information matrix via SMC. We sum up the outcome of our numerical work by summarising the qualitative behaviours for the accuracy limits as functions of e.g. collected photon count, molecule diffusion, etc. We also verify that we can recover known results from the static molecule case.
△ Less
Submitted 14 September, 2021; v1 submitted 3 June, 2021;
originally announced June 2021.
-
Discovering Diverse Nearly Optimal Policies with Successor Features
Authors:
Tom Zahavy,
Brendan O'Donoghue,
Andre Barreto,
Volodymyr Mnih,
Sebastian Flennerhag,
Satinder Singh
Abstract:
Finding different solutions to the same problem is a key aspect of intelligence associated with creativity and adaptation to novel situations. In reinforcement learning, a set of diverse policies can be useful for exploration, transfer, hierarchy, and robustness. We propose Diverse Successive Policies, a method for discovering policies that are diverse in the space of Successor Features, while ass…
▽ More
Finding different solutions to the same problem is a key aspect of intelligence associated with creativity and adaptation to novel situations. In reinforcement learning, a set of diverse policies can be useful for exploration, transfer, hierarchy, and robustness. We propose Diverse Successive Policies, a method for discovering policies that are diverse in the space of Successor Features, while assuring that they are near optimal. We formalize the problem as a Constrained Markov Decision Process (CMDP) where the goal is to find policies that maximize diversity, characterized by an intrinsic diversity reward, while remaining near-optimal with respect to the extrinsic reward of the MDP. We also analyze how recently proposed robustness and discrimination rewards perform and find that they are sensitive to the initialization of the procedure and may converge to sub-optimal solutions. To alleviate this, we propose new explicit diversity rewards that aim to minimize the correlation between the Successor Features of the policies in the set. We compare the different diversity mechanisms in the DeepMind Control Suite and find that the type of explicit diversity we are proposing is important to discover distinct behavior, like for example different locomotion patterns.
△ Less
Submitted 4 January, 2022; v1 submitted 1 June, 2021;
originally announced June 2021.
-
Reward is enough for convex MDPs
Authors:
Tom Zahavy,
Brendan O'Donoghue,
Guillaume Desjardins,
Satinder Singh
Abstract:
Maximising a cumulative reward function that is Markov and stationary, i.e., defined over state-action pairs and independent of time, is sufficient to capture many kinds of goals in a Markov decision process (MDP). However, not all goals can be captured in this manner. In this paper we study convex MDPs in which goals are expressed as convex functions of the stationary distribution and show that t…
▽ More
Maximising a cumulative reward function that is Markov and stationary, i.e., defined over state-action pairs and independent of time, is sufficient to capture many kinds of goals in a Markov decision process (MDP). However, not all goals can be captured in this manner. In this paper we study convex MDPs in which goals are expressed as convex functions of the stationary distribution and show that they cannot be formulated using stationary reward functions. Convex MDPs generalize the standard reinforcement learning (RL) problem formulation to a larger framework that includes many supervised and unsupervised RL problems, such as apprenticeship learning, constrained MDPs, and so-called `pure exploration'. Our approach is to reformulate the convex MDP problem as a min-max game involving policy and cost (negative reward) `players', using Fenchel duality. We propose a meta-algorithm for solving this problem and show that it unifies many existing algorithms in the literature.
△ Less
Submitted 2 June, 2023; v1 submitted 1 June, 2021;
originally announced June 2021.
-
Gradient-Based Markov Chain Monte Carlo for Bayesian Inference With Non-Differentiable Priors
Authors:
Jacob Vorstrup Goldman,
Torben Sell,
Sumeetpal Sidhu Singh
Abstract:
The use of non-differentiable priors in Bayesian statistics has become increasingly popular, in particular in Bayesian imaging analysis. Current state of the art methods are approximate in the sense that they replace the posterior with a smooth approximation via Moreau-Yosida envelopes, and apply gradient-based discretized diffusions to sample from the resulting distribution. We characterize the e…
▽ More
The use of non-differentiable priors in Bayesian statistics has become increasingly popular, in particular in Bayesian imaging analysis. Current state of the art methods are approximate in the sense that they replace the posterior with a smooth approximation via Moreau-Yosida envelopes, and apply gradient-based discretized diffusions to sample from the resulting distribution. We characterize the error of the Moreau-Yosida approximation and propose a novel implementation using underdamped Langevin dynamics. In misson-critical cases, however, replacing the posterior with an approximation may not be a viable option. Instead, we show that Piecewise-Deterministic Markov Processes (PDMP) can be utilized for exact posterior inference from distributions satisfying almost everywhere differentiability. Furthermore, in contrast with diffusion-based methods, the suggested PDMP-based samplers place no assumptions on the prior shape, nor require access to a computationally cheap proximal operator, and consequently have a much broader scope of application. Through detailed numerical examples, including a non-differentiable circular distribution and a non-convex genomics model, we elucidate the relative strengths of these sampling methods on problems of moderate to high dimensions, underlining the benefits of PDMP-based methods when accurate sampling is decisive.
△ Less
Submitted 16 March, 2021;
originally announced March 2021.
-
Spatiotemporal blocking of the bouncy particle sampler for efficient inference in state space models
Authors:
Jacob Vorstrup Goldman,
Sumeetpal Sidhu Singh
Abstract:
We propose a novel blocked version of the continuous-time bouncy particle sampler of [Bouchard-Côté et al., 2018] which is applicable to any differentiable probability density. This alternative implementation is motivated by blocked Gibbs sampling for state space models [Singh et al., 2017] and leads to significant improvement in terms of effective sample size per second, and furthermore, allows f…
▽ More
We propose a novel blocked version of the continuous-time bouncy particle sampler of [Bouchard-Côté et al., 2018] which is applicable to any differentiable probability density. This alternative implementation is motivated by blocked Gibbs sampling for state space models [Singh et al., 2017] and leads to significant improvement in terms of effective sample size per second, and furthermore, allows for significant parallelization of the resulting algorithm. The new algorithms are particularly efficient for latent state inference in high-dimensional state space models, where blocking in both space and time is necessary to avoid degeneracy of MCMC. The efficiency of our blocked bouncy particle sampler, in comparison with both the standard implementation of the bouncy particle sampler and the particle Gibbs algorithm of Andrieu et al. [2010], is illustrated numerically for both simulated data and a challenging real-world financial dataset.
△ Less
Submitted 9 July, 2021; v1 submitted 8 January, 2021;
originally announced January 2021.
-
Trace-class Gaussian priors for Bayesian learning of neural networks with MCMC
Authors:
Torben Sell,
Sumeetpal S. Singh
Abstract:
This paper introduces a new neural network based prior for real valued functions on $\mathbb R^d$ which, by construction, is more easily and cheaply scaled up in the domain dimension $d$ compared to the usual Karhunen-Loève function space prior. The new prior is a Gaussian neural network prior, where each weight and bias has an independent Gaussian prior, but with the key difference that the varia…
▽ More
This paper introduces a new neural network based prior for real valued functions on $\mathbb R^d$ which, by construction, is more easily and cheaply scaled up in the domain dimension $d$ compared to the usual Karhunen-Loève function space prior. The new prior is a Gaussian neural network prior, where each weight and bias has an independent Gaussian prior, but with the key difference that the variances decrease in the width of the network in such a way that the resulting function is \emph{almost surely} well defined in the limit of an infinite width network. We show that in a Bayesian treatment of inferring unknown functions, the induced posterior over functions is amenable to Monte Carlo sampling using Hilbert space Markov chain Monte Carlo (MCMC) methods. This type of MCMC is popular, e.g. in the Bayesian Inverse Problems literature, because it is stable under \emph{mesh refinement}, i.e. the acceptance probability does not shrink to $0$ as more parameters of the function's prior are introduced, even \emph{ad infinitum}. In numerical examples we demonstrate these stated competitive advantages over other function space priors. We also implement examples in Bayesian Reinforcement Learning to automate tasks from data and demonstrate, for the first time, stability of MCMC to mesh refinement for these type of problems.
△ Less
Submitted 8 September, 2022; v1 submitted 20 December, 2020;
originally announced December 2020.
-
Online Particle Smoothing with Application to Map-matching
Authors:
Samuel Duffield,
Sumeetpal S. Singh
Abstract:
We introduce a novel method for online smoothing in state-space models that utilises a fixed-lag approximation to overcome the well known issue of path degeneracy. Unlike classical fixed-lag techniques that only approximate certain marginals, we introduce an online resampling algorithm, called particle stitching, that converts these marginal samples into a full posterior approximation. We demonstr…
▽ More
We introduce a novel method for online smoothing in state-space models that utilises a fixed-lag approximation to overcome the well known issue of path degeneracy. Unlike classical fixed-lag techniques that only approximate certain marginals, we introduce an online resampling algorithm, called particle stitching, that converts these marginal samples into a full posterior approximation. We demonstrate the utility of our method in the context of map-matching, the task of inferring a vehicle's trajectory given a road network and noisy GPS observations. We develop a new state-space model for the difficult task of map-matching on dense, urban road networks.
△ Less
Submitted 2 August, 2021; v1 submitted 8 December, 2020;
originally announced December 2020.
-
A General Class of New Continuous Mixture Distribution and Application
Authors:
Brijesh P. Singh,
Sandeep Singh,
Utpal Dhar Das
Abstract:
A generalization of a distribution increases the flexibility particularly in studying of a phenomenon and its properties. Many generalizations of continuous univariate distributions are available in literature. In this study, an investigation is conducted on a distribution and its generalization. Several available generalizations of the distribution are reviewed and recent trends in the constructi…
▽ More
A generalization of a distribution increases the flexibility particularly in studying of a phenomenon and its properties. Many generalizations of continuous univariate distributions are available in literature. In this study, an investigation is conducted on a distribution and its generalization. Several available generalizations of the distribution are reviewed and recent trends in the construction of generalized classes with a generalized mixing parameter are discussed. To check the suitability and comparability, real data set have been used.
△ Less
Submitted 11 November, 2020;
originally announced November 2020.
-
Continuum-Armed Bandits: A Function Space Perspective
Authors:
Shashank Singh
Abstract:
Continuum-armed bandits (a.k.a., black-box or $0^{th}$-order optimization) involves optimizing an unknown objective function given an oracle that evaluates the function at a query point, with the goal of using as few query points as possible. In the most well-studied case, the objective function is assumed to be Lipschitz continuous and minimax rates of simple and cumulative regrets are known in b…
▽ More
Continuum-armed bandits (a.k.a., black-box or $0^{th}$-order optimization) involves optimizing an unknown objective function given an oracle that evaluates the function at a query point, with the goal of using as few query points as possible. In the most well-studied case, the objective function is assumed to be Lipschitz continuous and minimax rates of simple and cumulative regrets are known in both noiseless and noisy settings. This paper studies continuum-armed bandits under more general smoothness conditions, namely Besov smoothness conditions, on the objective function. In both noiseless and noisy conditions, we derive minimax rates under simple and cumulative regrets. Our results show that minimax rates over objective functions in a Besov space are identical to minimax rates over objective functions in the smallest Hölder space into which the Besov space embeds.
△ Less
Submitted 21 March, 2021; v1 submitted 15 October, 2020;
originally announced October 2020.
-
Neither Private Nor Fair: Impact of Data Imbalance on Utility and Fairness in Differential Privacy
Authors:
Tom Farrand,
Fatemehsadat Mireshghallah,
Sahib Singh,
Andrew Trask
Abstract:
Deployment of deep learning in different fields and industries is growing day by day due to its performance, which relies on the availability of data and compute. Data is often crowd-sourced and contains sensitive information about its contributors, which leaks into models that are trained on it. To achieve rigorous privacy guarantees, differentially private training mechanisms are used. However,…
▽ More
Deployment of deep learning in different fields and industries is growing day by day due to its performance, which relies on the availability of data and compute. Data is often crowd-sourced and contains sensitive information about its contributors, which leaks into models that are trained on it. To achieve rigorous privacy guarantees, differentially private training mechanisms are used. However, it has recently been shown that differential privacy can exacerbate existing biases in the data and have disparate impacts on the accuracy of different subgroups of data. In this paper, we aim to study these effects within differentially private deep learning. Specifically, we aim to study how different levels of imbalance in the data affect the accuracy and the fairness of the decisions made by the model, given different levels of privacy. We demonstrate that even small imbalances and loose privacy guarantees can cause disparate impacts.
△ Less
Submitted 3 October, 2020; v1 submitted 10 September, 2020;
originally announced September 2020.
-
Reliable Post hoc Explanations: Modeling Uncertainty in Explainability
Authors:
Dylan Slack,
Sophie Hilgard,
Sameer Singh,
Himabindu Lakkaraju
Abstract:
As black box explanations are increasingly being employed to establish model credibility in high-stakes settings, it is important to ensure that these explanations are accurate and reliable. However, prior work demonstrates that explanations generated by state-of-the-art techniques are inconsistent, unstable, and provide very little insight into their correctness and reliability. In addition, thes…
▽ More
As black box explanations are increasingly being employed to establish model credibility in high-stakes settings, it is important to ensure that these explanations are accurate and reliable. However, prior work demonstrates that explanations generated by state-of-the-art techniques are inconsistent, unstable, and provide very little insight into their correctness and reliability. In addition, these methods are also computationally inefficient, and require significant hyper-parameter tuning. In this paper, we address the aforementioned challenges by develo** a novel Bayesian framework for generating local explanations along with their associated uncertainty. We instantiate this framework to obtain Bayesian versions of LIME and KernelSHAP which output credible intervals for the feature importances, capturing the associated uncertainty. The resulting explanations not only enable us to make concrete inferences about their quality (e.g., there is a 95% chance that the feature importance lies within the given range), but are also highly consistent and stable. We carry out a detailed theoretical analysis that leverages the aforementioned uncertainty to estimate how many perturbations to sample, and how to sample for faster convergence. This work makes the first attempt at addressing several critical issues with popular explanation methods in one shot, thereby generating consistent, stable, and reliable explanations with guarantees in a computationally efficient manner. Experimental evaluation with multiple real world datasets and user studies demonstrate that the efficacy of the proposed framework.
△ Less
Submitted 6 November, 2021; v1 submitted 11 August, 2020;
originally announced August 2020.
-
Zero-Shot Heterogeneous Transfer Learning from Recommender Systems to Cold-Start Search Retrieval
Authors:
Tao Wu,
Ellie Ka-In Chio,
Heng-Tze Cheng,
Yu Du,
Steffen Rendle,
Dima Kuzmin,
Ritesh Agarwal,
Li Zhang,
John Anderson,
Sarvjeet Singh,
Tushar Chandra,
Ed H. Chi,
Wen Li,
Ankit Kumar,
Xiang Ma,
Alex Soares,
Nitin **dal,
Pei Cao
Abstract:
Many recent advances in neural information retrieval models, which predict top-K items given a query, learn directly from a large training set of (query, item) pairs. However, they are often insufficient when there are many previously unseen (query, item) combinations, often referred to as the cold start problem. Furthermore, the search system can be biased towards items that are frequently shown…
▽ More
Many recent advances in neural information retrieval models, which predict top-K items given a query, learn directly from a large training set of (query, item) pairs. However, they are often insufficient when there are many previously unseen (query, item) combinations, often referred to as the cold start problem. Furthermore, the search system can be biased towards items that are frequently shown to a query previously, also known as the 'rich get richer' (a.k.a. feedback loop) problem. In light of these problems, we observed that most online content platforms have both a search and a recommender system that, while having heterogeneous input spaces, can be connected through their common output item space and a shared semantic representation. In this paper, we propose a new Zero-Shot Heterogeneous Transfer Learning framework that transfers learned knowledge from the recommender system component to improve the search component of a content platform. First, it learns representations of items and their natural-language features by predicting (item, item) correlation graphs derived from the recommender system as an auxiliary task. Then, the learned representations are transferred to solve the target search retrieval task, performing query-to-item prediction without having seen any (query, item) pairs in training. We conduct online and offline experiments on one of the world's largest search and recommender systems from Google, and present the results and lessons learned. We demonstrate that the proposed approach can achieve high performance on offline search retrieval tasks, and more importantly, achieved significant improvements on relevance and user interactions over the highly-optimized production system in online experiments.
△ Less
Submitted 18 August, 2020; v1 submitted 6 August, 2020;
originally announced August 2020.
-
Interpretable Sequence Learning for COVID-19 Forecasting
Authors:
Sercan O. Arik,
Chun-Liang Li,
**sung Yoon,
Rajarishi Sinha,
Arkady Epshteyn,
Long T. Le,
Vikas Menon,
Shashank Singh,
Leyou Zhang,
Nate Yoder,
Martin Nikoltchev,
Yash Sonthalia,
Hootan Nakhost,
Elli Kanal,
Tomas Pfister
Abstract:
We propose a novel approach that integrates machine learning into compartmental disease modeling to predict the progression of COVID-19. Our model is explainable by design as it explicitly shows how different compartments evolve and it uses interpretable encoders to incorporate covariates and improve performance. Explainability is valuable to ensure that the model's forecasts are credible to epide…
▽ More
We propose a novel approach that integrates machine learning into compartmental disease modeling to predict the progression of COVID-19. Our model is explainable by design as it explicitly shows how different compartments evolve and it uses interpretable encoders to incorporate covariates and improve performance. Explainability is valuable to ensure that the model's forecasts are credible to epidemiologists and to instill confidence in end-users such as policy makers and healthcare institutions. Our model can be applied at different geographic resolutions, and here we demonstrate it for states and counties in the United States. We show that our model provides more accurate forecasts, in metrics averaged across the entire US, than state-of-the-art alternatives, and that it provides qualitatively meaningful explanatory insights. Lastly, we analyze the performance of our model for different subgroups based on the subgroup distributions within the counties.
△ Less
Submitted 13 January, 2021; v1 submitted 3 August, 2020;
originally announced August 2020.
-
Meta-Gradient Reinforcement Learning with an Objective Discovered Online
Authors:
Zhongwen Xu,
Hado van Hasselt,
Matteo Hessel,
Junhyuk Oh,
Satinder Singh,
David Silver
Abstract:
Deep reinforcement learning includes a broad family of algorithms that parameterise an internal representation, such as a value function or policy, by a deep neural network. Each algorithm optimises its parameters with respect to an objective, such as Q-learning or policy gradient, that defines its semantics. In this work, we propose an algorithm based on meta-gradient descent that discovers its o…
▽ More
Deep reinforcement learning includes a broad family of algorithms that parameterise an internal representation, such as a value function or policy, by a deep neural network. Each algorithm optimises its parameters with respect to an objective, such as Q-learning or policy gradient, that defines its semantics. In this work, we propose an algorithm based on meta-gradient descent that discovers its own objective, flexibly parameterised by a deep neural network, solely from interactive experience with its environment. Over time, this allows the agent to learn how to learn increasingly effectively. Furthermore, because the objective is discovered online, it can adapt to changes over time. We demonstrate that the algorithm discovers how to address several important issues in RL, such as bootstrap**, non-stationarity, and off-policy learning. On the Atari Learning Environment, the meta-gradient algorithm adapts over time to learn with greater efficiency, eventually outperforming the median score of a strong actor-critic baseline.
△ Less
Submitted 16 July, 2020;
originally announced July 2020.
-
Anytime Parallel Tempering
Authors:
A. Marie d'Avigneau,
S. S. Singh,
L. M. Murray
Abstract:
Develo** efficient MCMC algorithms is indispensable in Bayesian inference. In parallel tempering, multiple interacting MCMC chains run to more efficiently explore the state space and improve performance. The multiple chains advance independently through local moves, and the performance enhancement steps are exchange moves, where the chains pause to exchange their current sample amongst each othe…
▽ More
Develo** efficient MCMC algorithms is indispensable in Bayesian inference. In parallel tempering, multiple interacting MCMC chains run to more efficiently explore the state space and improve performance. The multiple chains advance independently through local moves, and the performance enhancement steps are exchange moves, where the chains pause to exchange their current sample amongst each other. To accelerate the independent local moves, they may be performed simultaneously on multiple processors. Another problem is then encountered: depending on the MCMC implementation and inference problem, local moves can take a varying and random amount of time to complete. There may also be infrastructure-induced variations, such as competing jobs on the same processors, which arises in cloud computing. Before exchanges can occur, all chains must complete the local moves they are engaged in to avoid introducing a potentially substantial bias (Proposition 2.1). To solve this issue of randomly varying local move completion times in multi-processor parallel tempering, we adopt the Anytime Monte Carlo framework of Murray et al. (2016): we impose real-time deadlines on the parallel local moves and perform exchanges at these deadlines without any processor idling. We show our methodology for exchanges at real-time deadlines does not introduce a bias and leads to significant performance enhancements over the naïve approach of idling until every processor's local moves complete. The methodology is then applied in an ABC setting, where an Anytime ABC parallel tempering algorithm is derived for the difficult task of estimating the parameters of a Lotka-Volterra predator-prey model, and similar efficiency enhancements are observed.
△ Less
Submitted 14 September, 2021; v1 submitted 26 June, 2020;
originally announced June 2020.
-
Learning to Play No-Press Diplomacy with Best Response Policy Iteration
Authors:
Thomas Anthony,
Tom Eccles,
Andrea Tacchetti,
János Kramár,
Ian Gemp,
Thomas C. Hudson,
Nicolas Porcel,
Marc Lanctot,
Julien Pérolat,
Richard Everett,
Roman Werpachowski,
Satinder Singh,
Thore Graepel,
Yoram Bachrach
Abstract:
Recent advances in deep reinforcement learning (RL) have led to considerable progress in many 2-player zero-sum games, such as Go, Poker and Starcraft. The purely adversarial nature of such games allows for conceptually simple and principled application of RL methods. However real-world settings are many-agent, and agent interactions are complex mixtures of common-interest and competitive aspects.…
▽ More
Recent advances in deep reinforcement learning (RL) have led to considerable progress in many 2-player zero-sum games, such as Go, Poker and Starcraft. The purely adversarial nature of such games allows for conceptually simple and principled application of RL methods. However real-world settings are many-agent, and agent interactions are complex mixtures of common-interest and competitive aspects. We consider Diplomacy, a 7-player board game designed to accentuate dilemmas resulting from many-agent interactions. It also features a large combinatorial action space and simultaneous moves, which are challenging for RL algorithms. We propose a simple yet effective approximate best response operator, designed to handle large combinatorial action spaces and simultaneous moves. We also introduce a family of policy iteration methods that approximate fictitious play. With these methods, we successfully apply RL to Diplomacy: we show that our agents convincingly outperform the previous state-of-the-art, and game theoretic equilibrium analysis shows that the new process yields consistent improvements.
△ Less
Submitted 4 January, 2022; v1 submitted 8 June, 2020;
originally announced June 2020.
-
Image Augmentations for GAN Training
Authors:
Zhengli Zhao,
Zizhao Zhang,
Ting Chen,
Sameer Singh,
Han Zhang
Abstract:
Data augmentations have been widely studied to improve the accuracy and robustness of classifiers. However, the potential of image augmentation in improving GAN models for image synthesis has not been thoroughly investigated in previous studies. In this work, we systematically study the effectiveness of various existing augmentation techniques for GAN training in a variety of settings. We provide…
▽ More
Data augmentations have been widely studied to improve the accuracy and robustness of classifiers. However, the potential of image augmentation in improving GAN models for image synthesis has not been thoroughly investigated in previous studies. In this work, we systematically study the effectiveness of various existing augmentation techniques for GAN training in a variety of settings. We provide insights and guidelines on how to augment images for both vanilla GANs and GANs with regularizations, improving the fidelity of the generated images substantially. Surprisingly, we find that vanilla GANs attain generation quality on par with recent state-of-the-art results if we use augmentations on both real and generated images. When this GAN training is combined with other augmentation-based regularization techniques, such as contrastive loss and consistency regularization, the augmentations further improve the quality of generated images. We provide new state-of-the-art results for conditional generation on CIFAR-10 with both consistency loss and contrastive loss as additional regularizations.
△ Less
Submitted 3 June, 2020;
originally announced June 2020.
-
Benchmarking Differentially Private Residual Networks for Medical Imagery
Authors:
Sahib Singh,
Harshvardhan Sikka,
Sasikanth Kotti,
Andrew Trask
Abstract:
In this paper we measure the effectiveness of $ε$-Differential Privacy (DP) when applied to medical imaging. We compare two robust differential privacy mechanisms: Local-DP and DP-SGD and benchmark their performance when analyzing medical imagery records. We analyze the trade-off between the model's accuracy and the level of privacy it guarantees, and also take a closer look to evaluate how useful…
▽ More
In this paper we measure the effectiveness of $ε$-Differential Privacy (DP) when applied to medical imaging. We compare two robust differential privacy mechanisms: Local-DP and DP-SGD and benchmark their performance when analyzing medical imagery records. We analyze the trade-off between the model's accuracy and the level of privacy it guarantees, and also take a closer look to evaluate how useful these theoretical privacy guarantees actually prove to be in the real world medical setting.
△ Less
Submitted 4 September, 2020; v1 submitted 26 May, 2020;
originally announced May 2020.
-
WoodFisher: Efficient Second-Order Approximation for Neural Network Compression
Authors:
Sidak Pal Singh,
Dan Alistarh
Abstract:
Second-order information, in the form of Hessian- or Inverse-Hessian-vector products, is a fundamental tool for solving optimization problems. Recently, there has been significant interest in utilizing this information in the context of deep neural networks; however, relatively little is known about the quality of existing approximations in this context. Our work examines this question, identifies…
▽ More
Second-order information, in the form of Hessian- or Inverse-Hessian-vector products, is a fundamental tool for solving optimization problems. Recently, there has been significant interest in utilizing this information in the context of deep neural networks; however, relatively little is known about the quality of existing approximations in this context. Our work examines this question, identifies issues with existing approaches, and proposes a method called WoodFisher to compute a faithful and efficient estimate of the inverse Hessian.
Our main application is to neural network compression, where we build on the classic Optimal Brain Damage/Surgeon framework. We demonstrate that WoodFisher significantly outperforms popular state-of-the-art methods for one-shot pruning. Further, even when iterative, gradual pruning is considered, our method results in a gain in test accuracy over the state-of-the-art approaches, for pruning popular neural networks (like ResNet-50, MobileNetV1) trained on standard image classification datasets such as ImageNet ILSVRC. We examine how our method can be extended to take into account first-order information, as well as illustrate its ability to automatically set layer-wise pruning thresholds and perform compression in the limited-data regime. The code is available at the following link, https://github.com/IST-DASLab/WoodFisher.
△ Less
Submitted 25 November, 2020; v1 submitted 29 April, 2020;
originally announced April 2020.
-
Robust Density Estimation under Besov IPM Losses
Authors:
Ananya Uppal,
Shashank Singh,
Barnabas Poczos
Abstract:
We study minimax convergence rates of nonparametric density estimation in the Huber contamination model, in which a proportion of the data comes from an unknown outlier distribution. We provide the first results for this problem under a large family of losses, called Besov integral probability metrics (IPMs), that includes $\mathcal{L}^p$, Wasserstein, Kolmogorov-Smirnov, and other common distance…
▽ More
We study minimax convergence rates of nonparametric density estimation in the Huber contamination model, in which a proportion of the data comes from an unknown outlier distribution. We provide the first results for this problem under a large family of losses, called Besov integral probability metrics (IPMs), that includes $\mathcal{L}^p$, Wasserstein, Kolmogorov-Smirnov, and other common distances between probability distributions. Specifically, under a range of smoothness assumptions on the population and outlier distributions, we show that a re-scaled thresholding wavelet series estimator achieves minimax optimal convergence rates under a wide variety of losses. Finally, based on connections that have recently been shown between nonparametric density estimation under IPM losses and generative adversarial networks (GANs), we show that certain GAN architectures also achieve these minimax rates.
△ Less
Submitted 6 September, 2021; v1 submitted 18 April, 2020;
originally announced April 2020.
-
Multiclass Classification via Class-Weighted Nearest Neighbors
Authors:
Justin Khim,
Ziyu Xu,
Shashank Singh
Abstract:
We study statistical properties of the k-nearest neighbors algorithm for multiclass classification, with a focus on settings where the number of classes may be large and/or classes may be highly imbalanced. In particular, we consider a variant of the k-nearest neighbor classifier with non-uniform class-weightings, for which we derive upper and minimax lower bounds on accuracy, class-weighted risk,…
▽ More
We study statistical properties of the k-nearest neighbors algorithm for multiclass classification, with a focus on settings where the number of classes may be large and/or classes may be highly imbalanced. In particular, we consider a variant of the k-nearest neighbor classifier with non-uniform class-weightings, for which we derive upper and minimax lower bounds on accuracy, class-weighted risk, and uniform error. Additionally, we show that uniform error bounds lead to bounds on the difference between empirical confusion matrix quantities and their population counterparts across a set of weights. As a result, we may adjust the class weights to optimize classification metrics such as F1 score or Matthew's Correlation Coefficient that are commonly used in practice, particularly in settings with imbalanced classes. We additionally provide a simple example to instantiate our bounds and numerical experiments.
△ Less
Submitted 3 May, 2020; v1 submitted 9 April, 2020;
originally announced April 2020.