-
Nyström Kernel Stein Discrepancy
Authors:
Florian Kalinke,
Zoltan Szabo,
Bharath K. Sriperumbudur
Abstract:
Kernel methods underpin many of the most successful approaches in data science and statistics, and they allow representing probability measures as elements of a reproducing kernel Hilbert space without loss of information. Recently, the kernel Stein discrepancy (KSD), which combines Stein's method with kernel techniques, gained considerable attention. Through the Stein operator, KSD allows the con…
▽ More
Kernel methods underpin many of the most successful approaches in data science and statistics, and they allow representing probability measures as elements of a reproducing kernel Hilbert space without loss of information. Recently, the kernel Stein discrepancy (KSD), which combines Stein's method with kernel techniques, gained considerable attention. Through the Stein operator, KSD allows the construction of powerful goodness-of-fit tests where it is sufficient to know the target distribution up to a multiplicative constant. However, the typical U- and V-statistic-based KSD estimators suffer from a quadratic runtime complexity, which hinders their application in large-scale settings. In this work, we propose a Nyström-based KSD acceleration -- with runtime $\mathcal O\!\left(mn+m^3\right)$ for $n$ samples and $m\ll n$ Nyström points -- , show its $\sqrt{n}$-consistency under the null with a classical sub-Gaussian assumption, and demonstrate its applicability for goodness-of-fit testing on a suite of benchmarks.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Minimax Optimal Goodness-of-Fit Testing with Kernel Stein Discrepancy
Authors:
Omar Hagrass,
Bharath Sriperumbudur,
Krishnakumar Balasubramanian
Abstract:
We explore the minimax optimality of goodness-of-fit tests on general domains using the kernelized Stein discrepancy (KSD). The KSD framework offers a flexible approach for goodness-of-fit testing, avoiding strong distributional assumptions, accommodating diverse data structures beyond Euclidean spaces, and relying only on partial knowledge of the reference distribution, while maintaining computat…
▽ More
We explore the minimax optimality of goodness-of-fit tests on general domains using the kernelized Stein discrepancy (KSD). The KSD framework offers a flexible approach for goodness-of-fit testing, avoiding strong distributional assumptions, accommodating diverse data structures beyond Euclidean spaces, and relying only on partial knowledge of the reference distribution, while maintaining computational efficiency. We establish a general framework and an operator-theoretic representation of the KSD, encompassing many existing KSD tests in the literature, which vary depending on the domain. We reveal the characteristics and limitations of KSD and demonstrate its non-optimality under a certain alternative space, defined over general domains when considering $χ^2$-divergence as the separation metric. To address this issue of non-optimality, we propose a modified, minimax optimal test by incorporating a spectral regularizer, thereby overcoming the shortcomings of standard KSD tests. Our results are established under a weak moment condition on the Stein kernel, which relaxes the bounded kernel assumption required by prior work in the analysis of kernel-based hypothesis testing. Additionally, we introduce an adaptive test capable of achieving minimax optimality up to a logarithmic factor by adapting to unknown parameters. Through numerical experiments, we illustrate the superior performance of our proposed tests across various domains compared to their unregularized counterparts.
△ Less
Submitted 20 May, 2024; v1 submitted 12 April, 2024;
originally announced April 2024.
-
Spectral Regularized Kernel Goodness-of-Fit Tests
Authors:
Omar Hagrass,
Bharath K. Sriperumbudur,
Bing Li
Abstract:
Maximum mean discrepancy (MMD) has enjoyed a lot of success in many machine learning and statistical applications, including non-parametric hypothesis testing, because of its ability to handle non-Euclidean data. Recently, it has been demonstrated in Balasubramanian et al.(2021) that the goodness-of-fit test based on MMD is not minimax optimal while a Tikhonov regularized version of it is, for an…
▽ More
Maximum mean discrepancy (MMD) has enjoyed a lot of success in many machine learning and statistical applications, including non-parametric hypothesis testing, because of its ability to handle non-Euclidean data. Recently, it has been demonstrated in Balasubramanian et al.(2021) that the goodness-of-fit test based on MMD is not minimax optimal while a Tikhonov regularized version of it is, for an appropriate choice of the regularization parameter. However, the results in Balasubramanian et al. (2021) are obtained under the restrictive assumptions of the mean element being zero, and the uniform boundedness condition on the eigenfunctions of the integral operator. Moreover, the test proposed in Balasubramanian et al. (2021) is not practical as it is not computable for many kernels. In this paper, we address these shortcomings and extend the results to general spectral regularizers that include Tikhonov regularization.
△ Less
Submitted 8 August, 2023;
originally announced August 2023.
-
Kernel $ε$-Greedy for Contextual Bandits
Authors:
Sakshi Arya,
Bharath K. Sriperumbudur
Abstract:
We consider a kernelized version of the $ε$-greedy strategy for contextual bandits. More precisely, in a setting with finitely many arms, we consider that the mean reward functions lie in a reproducing kernel Hilbert space (RKHS). We propose an online weighted kernel ridge regression estimator for the reward functions. Under some conditions on the exploration probability sequence, $\{ε_t\}_t$, and…
▽ More
We consider a kernelized version of the $ε$-greedy strategy for contextual bandits. More precisely, in a setting with finitely many arms, we consider that the mean reward functions lie in a reproducing kernel Hilbert space (RKHS). We propose an online weighted kernel ridge regression estimator for the reward functions. Under some conditions on the exploration probability sequence, $\{ε_t\}_t$, and choice of the regularization parameter, $\{λ_t\}_t$, we show that the proposed estimator is consistent. We also show that for any choice of kernel and the corresponding RKHS, we achieve a sub-linear regret rate depending on the intrinsic dimensionality of the RKHS. Furthermore, we achieve the optimal regret rate of $\sqrt{T}$ under a margin condition for finite-dimensional RKHS.
△ Less
Submitted 29 June, 2023;
originally announced June 2023.
-
Spectral Regularized Kernel Two-Sample Tests
Authors:
Omar Hagrass,
Bharath K. Sriperumbudur,
Bing Li
Abstract:
Over the last decade, an approach that has gained a lot of popularity to tackle nonparametric testing problems on general (i.e., non-Euclidean) domains is based on the notion of reproducing kernel Hilbert space (RKHS) embedding of probability distributions. The main goal of our work is to understand the optimality of two-sample tests constructed based on this approach. First, we show the popular M…
▽ More
Over the last decade, an approach that has gained a lot of popularity to tackle nonparametric testing problems on general (i.e., non-Euclidean) domains is based on the notion of reproducing kernel Hilbert space (RKHS) embedding of probability distributions. The main goal of our work is to understand the optimality of two-sample tests constructed based on this approach. First, we show the popular MMD (maximum mean discrepancy) two-sample test to be not optimal in terms of the separation boundary measured in Hellinger distance. Second, we propose a modification to the MMD test based on spectral regularization by taking into account the covariance information (which is not captured by the MMD test) and prove the proposed test to be minimax optimal with a smaller separation boundary than that achieved by the MMD test. Third, we propose an adaptive version of the above test which involves a data-driven strategy to choose the regularization parameter and show the adaptive test to be almost minimax optimal up to a logarithmic factor. Moreover, our results hold for the permutation variant of the test where the test threshold is chosen elegantly through the permutation of the samples. Through numerical experiments on synthetic and real data, we demonstrate the superior performance of the proposed test in comparison to the MMD test and other popular tests in the literature.
△ Less
Submitted 1 May, 2024; v1 submitted 18 December, 2022;
originally announced December 2022.
-
Regularized Stein Variational Gradient Flow
Authors:
Ye He,
Krishnakumar Balasubramanian,
Bharath K. Sriperumbudur,
Jianfeng Lu
Abstract:
The Stein Variational Gradient Descent (SVGD) algorithm is a deterministic particle method for sampling. However, a mean-field analysis reveals that the gradient flow corresponding to the SVGD algorithm (i.e., the Stein Variational Gradient Flow) only provides a constant-order approximation to the Wasserstein Gradient Flow corresponding to the KL-divergence minimization. In this work, we propose t…
▽ More
The Stein Variational Gradient Descent (SVGD) algorithm is a deterministic particle method for sampling. However, a mean-field analysis reveals that the gradient flow corresponding to the SVGD algorithm (i.e., the Stein Variational Gradient Flow) only provides a constant-order approximation to the Wasserstein Gradient Flow corresponding to the KL-divergence minimization. In this work, we propose the Regularized Stein Variational Gradient Flow, which interpolates between the Stein Variational Gradient Flow and the Wasserstein Gradient Flow. We establish various theoretical properties of the Regularized Stein Variational Gradient Flow (and its time-discretization) including convergence to equilibrium, existence and uniqueness of weak solutions, and stability of the solutions. We provide preliminary numerical evidence of the improved performance offered by the regularization.
△ Less
Submitted 8 May, 2024; v1 submitted 14 November, 2022;
originally announced November 2022.
-
Shrinkage Estimation of Higher Order Bochner Integrals
Authors:
Saiteja Utpala,
Bharath K. Sriperumbudur
Abstract:
We consider shrinkage estimation of higher order Hilbert space valued Bochner integrals in a non-parametric setting. We propose estimators that shrink the $U$-statistic estimator of the Bochner integral towards a pre-specified target element in the Hilbert space. Depending on the degeneracy of the kernel of the $U$-statistic, we construct consistent shrinkage estimators with fast rates of converge…
▽ More
We consider shrinkage estimation of higher order Hilbert space valued Bochner integrals in a non-parametric setting. We propose estimators that shrink the $U$-statistic estimator of the Bochner integral towards a pre-specified target element in the Hilbert space. Depending on the degeneracy of the kernel of the $U$-statistic, we construct consistent shrinkage estimators with fast rates of convergence, and develop oracle inequalities comparing the risks of the the $U$-statistic estimator and its shrinkage version. Surprisingly, we show that the shrinkage estimator designed by assuming complete degeneracy of the kernel of the $U$-statistic is a consistent estimator even when the kernel is not complete degenerate. This work subsumes and improves upon Krikamol et al., 2016, JMLR and Zhou et al., 2019, JMVA, which only handle mean element and covariance operator estimation in a reproducing kernel Hilbert space. We also specialize our results to normal mean estimation and show that for $d\ge 3$, the proposed estimator strictly improves upon the sample mean in terms of the mean squared error.
△ Less
Submitted 21 July, 2022; v1 submitted 13 July, 2022;
originally announced July 2022.
-
Robust Topological Inference in the Presence of Outliers
Authors:
Siddharth Vishwanath,
Bharath K. Sriperumbudur,
Kenji Fukumizu,
Satoshi Kuriki
Abstract:
The distance function to a compact set plays a crucial role in the paradigm of topological data analysis. In particular, the sublevel sets of the distance function are used in the computation of persistent homology -- a backbone of the topological data analysis pipeline. Despite its stability to perturbations in the Hausdorff distance, persistent homology is highly sensitive to outliers. In this w…
▽ More
The distance function to a compact set plays a crucial role in the paradigm of topological data analysis. In particular, the sublevel sets of the distance function are used in the computation of persistent homology -- a backbone of the topological data analysis pipeline. Despite its stability to perturbations in the Hausdorff distance, persistent homology is highly sensitive to outliers. In this work, we develop a framework of statistical inference for persistent homology in the presence of outliers. Drawing inspiration from recent developments in robust statistics, we propose a $\textit{median-of-means}$ variant of the distance function ($\textsf{MoM Dist}$), and establish its statistical properties. In particular, we show that, even in the presence of outliers, the sublevel filtrations and weighted filtrations induced by $\textsf{MoM Dist}$ are both consistent estimators of the true underlying population counterpart, and their rates of convergence in the bottleneck metric are controlled by the fraction of outliers in the data. Finally, we demonstrate the advantages of the proposed methodology through simulations and applications.
△ Less
Submitted 3 June, 2022;
originally announced June 2022.
-
Cycle Consistent Probability Divergences Across Different Spaces
Authors:
Zhengxin Zhang,
Youssef Mroueh,
Ziv Goldfeld,
Bharath K. Sriperumbudur
Abstract:
Discrepancy measures between probability distributions are at the core of statistical inference and machine learning. In many applications, distributions of interest are supported on different spaces, and yet a meaningful correspondence between data points is desired. Motivated to explicitly encode consistent bidirectional maps into the discrepancy measure, this work proposes a novel unbalanced Mo…
▽ More
Discrepancy measures between probability distributions are at the core of statistical inference and machine learning. In many applications, distributions of interest are supported on different spaces, and yet a meaningful correspondence between data points is desired. Motivated to explicitly encode consistent bidirectional maps into the discrepancy measure, this work proposes a novel unbalanced Monge optimal transport formulation for matching, up to isometries, distributions on different spaces. Our formulation arises as a principled relaxation of the Gromov-Haussdroff distance between metric spaces, and employs two cycle-consistent maps that push forward each distribution onto the other. We study structural properties of the proposed discrepancy and, in particular, show that it captures the popular cycle-consistent generative adversarial network (GAN) framework as a special case, thereby providing the theory to explain it. Motivated by computational efficiency, we then kernelize the discrepancy and restrict the map**s to parametric function classes. The resulting kernelized version is coined the generalized maximum mean discrepancy (GMMD). Convergence rates for empirical estimation of GMMD are studied and experiments to support our theory are provided.
△ Less
Submitted 22 November, 2021;
originally announced November 2021.
-
Statistical Optimality and Computational Efficiency of Nyström Kernel PCA
Authors:
Nicholas Sterge,
Bharath Sriperumbudur
Abstract:
Kernel methods provide an elegant framework for develo** nonlinear learning algorithms from simple linear methods. Though these methods have superior empirical performance in several real data applications, their usefulness is inhibited by the significant computational burden incurred in large sample situations. Various approximation schemes have been proposed in the literature to alleviate thes…
▽ More
Kernel methods provide an elegant framework for develo** nonlinear learning algorithms from simple linear methods. Though these methods have superior empirical performance in several real data applications, their usefulness is inhibited by the significant computational burden incurred in large sample situations. Various approximation schemes have been proposed in the literature to alleviate these computational issues, and the approximate kernel machines are shown to retain the empirical performance. However, the theoretical properties of these approximate kernel machines are less well understood. In this work, we theoretically study the trade-off between computational complexity and statistical accuracy in Nyström approximate kernel principal component analysis (KPCA), wherein we show that the Nyström approximate KPCA matches the statistical performance of (non-approximate) KPCA while remaining computationally beneficial. Additionally, we show that Nyström approximate KPCA outperforms the statistical behavior of another popular approximation scheme, the random feature approximation, when applied to KPCA.
△ Less
Submitted 18 May, 2021;
originally announced May 2021.
-
Robust Persistence Diagrams using Reproducing Kernels
Authors:
Siddharth Vishwanath,
Kenji Fukumizu,
Satoshi Kuriki,
Bharath Sriperumbudur
Abstract:
Persistent homology has become an important tool for extracting geometric and topological features from data, whose multi-scale features are summarized in a persistence diagram. From a statistical perspective, however, persistence diagrams are very sensitive to perturbations in the input space. In this work, we develop a framework for constructing robust persistence diagrams from superlevel filtra…
▽ More
Persistent homology has become an important tool for extracting geometric and topological features from data, whose multi-scale features are summarized in a persistence diagram. From a statistical perspective, however, persistence diagrams are very sensitive to perturbations in the input space. In this work, we develop a framework for constructing robust persistence diagrams from superlevel filtrations of robust density estimators constructed using reproducing kernels. Using an analogue of the influence function on the space of persistence diagrams, we establish the proposed framework to be less sensitive to outliers. The robust persistence diagrams are shown to be consistent estimators in bottleneck distance, with the convergence rate controlled by the smoothness of the kernel. This, in turn, allows us to construct uniform confidence bands in the space of persistence diagrams. Finally, we demonstrate the superiority of the proposed approach on benchmark datasets.
△ Less
Submitted 3 June, 2022; v1 submitted 17 June, 2020;
originally announced June 2020.
-
On Distance and Kernel Measures of Conditional Independence
Authors:
Tianhong Sheng,
Bharath K. Sriperumbudur
Abstract:
Measuring conditional independence is one of the important tasks in statistical inference and is fundamental in causal discovery, feature selection, dimensionality reduction, Bayesian network learning, and others. In this work, we explore the connection between conditional independence measures induced by distances on a metric space and reproducing kernels associated with a reproducing kernel Hilb…
▽ More
Measuring conditional independence is one of the important tasks in statistical inference and is fundamental in causal discovery, feature selection, dimensionality reduction, Bayesian network learning, and others. In this work, we explore the connection between conditional independence measures induced by distances on a metric space and reproducing kernels associated with a reproducing kernel Hilbert space (RKHS). For certain distance and kernel pairs, we show the distance-based conditional independence measures to be equivalent to that of kernel-based measures. On the other hand, we also show that some popular---in machine learning---kernel conditional independence measures based on the Hilbert-Schmidt norm of a certain cross-conditional covariance operator, do not have a simple distance representation, except in some limiting cases. This paper, therefore, shows the distance and kernel measures of conditional independence to be not quite equivalent unlike in the case of joint independence as shown by Sejdinovic et al. (2013).
△ Less
Submitted 17 August, 2020; v1 submitted 2 December, 2019;
originally announced December 2019.
-
Gaussian Sketching yields a J-L Lemma in RKHS
Authors:
Samory Kpotufe,
Bharath K. Sriperumbudur
Abstract:
The main contribution of the paper is to show that Gaussian sketching of a kernel-Gram matrix $\boldsymbol K$ yields an operator whose counterpart in an RKHS $\mathcal H$, is a \emph{random projection} operator---in the spirit of Johnson-Lindenstrauss (J-L) lemma. To be precise, given a random matrix $Z$ with i.i.d. Gaussian entries, we show that a sketch $Z\boldsymbol{K}$ corresponds to a particu…
▽ More
The main contribution of the paper is to show that Gaussian sketching of a kernel-Gram matrix $\boldsymbol K$ yields an operator whose counterpart in an RKHS $\mathcal H$, is a \emph{random projection} operator---in the spirit of Johnson-Lindenstrauss (J-L) lemma. To be precise, given a random matrix $Z$ with i.i.d. Gaussian entries, we show that a sketch $Z\boldsymbol{K}$ corresponds to a particular random operator in (infinite-dimensional) Hilbert space $\mathcal H$ that maps functions $f \in \mathcal H$ to a low-dimensional space $\mathbb R^d$, while preserving a weighted RKHS inner-product of the form $\langle f, g \rangle_Σ \doteq \langle f, Σ^3 g \rangle_{\mathcal H}$, where $Σ$ is the \emph{covariance} operator induced by the data distribution. In particular, under similar assumptions as in kernel PCA (KPCA), or kernel $k$-means (K-$k$-means), well-separated subsets of feature-space $\{K(\cdot, x): x \in \cal X\}$ remain well-separated after such operation, which suggests similar benefits as in KPCA and/or K-$k$-means, albeit at the much cheaper cost of a random projection. In particular, our convergence rates suggest that, given a large dataset $\{X_i\}_{i=1}^N$ of size $N$, we can build the Gram matrix $\boldsymbol K$ on a much smaller subsample of size $n\ll N$, so that the sketch $Z\boldsymbol K$ is very cheap to obtain and subsequently apply as a projection operator on the original data $\{X_i\}_{i=1}^N$. We verify these insights empirically on synthetic data, and on real-world clustering applications.
△ Less
Submitted 11 March, 2020; v1 submitted 15 August, 2019;
originally announced August 2019.
-
Gain with no Pain: Efficient Kernel-PCA by Nyström Sampling
Authors:
Nicholas Sterge,
Bharath Sriperumbudur,
Lorenzo Rosasco,
Alessandro Rudi
Abstract:
In this paper, we propose and study a Nyström based approach to efficient large scale kernel principal component analysis (PCA). The latter is a natural nonlinear extension of classical PCA based on considering a nonlinear feature map or the corresponding kernel. Like other kernel approaches, kernel PCA enjoys good mathematical and statistical properties but, numerically, it scales poorly with the…
▽ More
In this paper, we propose and study a Nyström based approach to efficient large scale kernel principal component analysis (PCA). The latter is a natural nonlinear extension of classical PCA based on considering a nonlinear feature map or the corresponding kernel. Like other kernel approaches, kernel PCA enjoys good mathematical and statistical properties but, numerically, it scales poorly with the sample size. Our analysis shows that Nyström sampling greatly improves computational efficiency without incurring any loss of statistical accuracy. While similar effects have been observed in supervised learning, this is the first such result for PCA. Our theoretical findings, which are also illustrated by numerical results, are based on a combination of analytic and concentration of measure techniques. Our study is more broadly motivated by the question of understanding the interplay between statistical and computational requirements for learning.
△ Less
Submitted 11 July, 2019;
originally announced July 2019.
-
Local minimax rates for closeness testing of discrete distributions
Authors:
Joseph Lam-Weil,
Alexandra Carpentier,
Bharath K. Sriperumbudur
Abstract:
We consider the closeness testing problem for discrete distributions. The goal is to distinguish whether two samples are drawn from the same unspecified distribution, or whether their respective distributions are separated in $L_1$-norm. In this paper, we focus on adapting the rate to the shape of the underlying distributions, i.e. we consider \textit{a local minimax setting}. We provide, to the b…
▽ More
We consider the closeness testing problem for discrete distributions. The goal is to distinguish whether two samples are drawn from the same unspecified distribution, or whether their respective distributions are separated in $L_1$-norm. In this paper, we focus on adapting the rate to the shape of the underlying distributions, i.e. we consider \textit{a local minimax setting}. We provide, to the best of our knowledge, the first local minimax rate for the separation distance up to logarithmic factors, together with a test that achieves it. In view of the rate, closeness testing turns out to be substantially harder than the related one-sample testing problem over a wide range of cases.
△ Less
Submitted 19 January, 2021; v1 submitted 1 February, 2019;
originally announced February 2019.
-
On Kernel Derivative Approximation with Random Fourier Features
Authors:
Zoltan Szabo,
Bharath K. Sriperumbudur
Abstract:
Random Fourier features (RFF) represent one of the most popular and wide-spread techniques in machine learning to scale up kernel algorithms. Despite the numerous successful applications of RFFs, unfortunately, quite little is understood theoretically on their optimality and limitations of their performance. Only recently, precise statistical-computational trade-offs have been established for RFFs…
▽ More
Random Fourier features (RFF) represent one of the most popular and wide-spread techniques in machine learning to scale up kernel algorithms. Despite the numerous successful applications of RFFs, unfortunately, quite little is understood theoretically on their optimality and limitations of their performance. Only recently, precise statistical-computational trade-offs have been established for RFFs in the approximation of kernel values, kernel ridge regression, kernel PCA and SVM classification. Our goal is to spark the investigation of optimality of RFF-based approximations in tasks involving not only function values but derivatives, which naturally lead to optimization problems with kernel derivatives. Particularly, in this paper, we focus on the approximation quality of RFFs for kernel derivatives and prove that the existing finite-sample guarantees can be improved exponentially in terms of the domain where they hold, using recent tools from unbounded empirical process theory. Our result implies that the same approximation guarantee is attainable for kernel derivatives using RFF as achieved for kernel values.
△ Less
Submitted 9 February, 2019; v1 submitted 11 October, 2018;
originally announced October 2018.
-
Gaussian Processes and Kernel Methods: A Review on Connections and Equivalences
Authors:
Motonobu Kanagawa,
Philipp Hennig,
Dino Sejdinovic,
Bharath K Sriperumbudur
Abstract:
This paper is an attempt to bridge the conceptual gaps between researchers working on the two widely used approaches based on positive definite kernels: Bayesian learning or inference using Gaussian processes on the one side, and frequentist kernel methods based on reproducing kernel Hilbert spaces on the other. It is widely known in machine learning that these two formalisms are closely related;…
▽ More
This paper is an attempt to bridge the conceptual gaps between researchers working on the two widely used approaches based on positive definite kernels: Bayesian learning or inference using Gaussian processes on the one side, and frequentist kernel methods based on reproducing kernel Hilbert spaces on the other. It is widely known in machine learning that these two formalisms are closely related; for instance, the estimator of kernel ridge regression is identical to the posterior mean of Gaussian process regression. However, they have been studied and developed almost independently by two essentially separate communities, and this makes it difficult to seamlessly transfer results between them. Our aim is to overcome this potential difficulty. To this end, we review several old and new results and concepts from either side, and juxtapose algorithmic quantities from each framework to highlight close similarities. We also provide discussions on subtle philosophical and theoretical differences between the two approaches.
△ Less
Submitted 6 July, 2018;
originally announced July 2018.
-
Minimax Estimation of Quadratic Fourier Functionals
Authors:
Shashank Singh,
Bharath K. Sriperumbudur,
Barnabás Póczos
Abstract:
We study estimation of (semi-)inner products between two nonparametric probability distributions, given IID samples from each distribution. These products include relatively well-studied classical $\mathcal{L}^2$ and Sobolev inner products, as well as those induced by translation-invariant reproducing kernels, for which we believe our results are the first. We first propose estimators for these qu…
▽ More
We study estimation of (semi-)inner products between two nonparametric probability distributions, given IID samples from each distribution. These products include relatively well-studied classical $\mathcal{L}^2$ and Sobolev inner products, as well as those induced by translation-invariant reproducing kernels, for which we believe our results are the first. We first propose estimators for these quantities, and the induced (semi)norms and (pseudo)metrics. We then prove non-asymptotic upper bounds on their mean squared error, in terms of weights both of the inner product and of the two distributions, in the Fourier basis. Finally, we prove minimax lower bounds that imply rate-optimality of the proposed estimators over Fourier ellipsoids.
△ Less
Submitted 1 September, 2018; v1 submitted 30 March, 2018;
originally announced March 2018.
-
Convergence Analysis of Deterministic Kernel-Based Quadrature Rules in Misspecified Settings
Authors:
Motonobu Kanagawa,
Bharath K. Sriperumbudur,
Kenji Fukumizu
Abstract:
This paper presents a convergence analysis of kernel-based quadrature rules in misspecified settings, focusing on deterministic quadrature in Sobolev spaces. In particular, we deal with misspecified settings where a test integrand is less smooth than a Sobolev RKHS based on which a quadrature rule is constructed. We provide convergence guarantees based on two different assumptions on a quadrature…
▽ More
This paper presents a convergence analysis of kernel-based quadrature rules in misspecified settings, focusing on deterministic quadrature in Sobolev spaces. In particular, we deal with misspecified settings where a test integrand is less smooth than a Sobolev RKHS based on which a quadrature rule is constructed. We provide convergence guarantees based on two different assumptions on a quadrature rule: one on quadrature weights, and the other on design points. More precisely, we show that convergence rates can be derived (i) if the sum of absolute weights remains constant (or does not increase quickly), or (ii) if the minimum distance between design points does not decrease very quickly. As a consequence of the latter result, we derive a rate of convergence for Bayesian quadrature in misspecified settings. We reveal a condition on design points to make Bayesian quadrature robust to misspecification, and show that, under this condition, it may adaptively achieve the optimal rate of convergence in the Sobolev space of a lesser order (i.e., of the unknown smoothness of a test integrand), under a slightly stronger regularity condition on the integrand.
△ Less
Submitted 30 October, 2018; v1 submitted 1 September, 2017;
originally announced September 2017.
-
Characteristic and Universal Tensor Product Kernels
Authors:
Zoltan Szabo,
Bharath K. Sriperumbudur
Abstract:
Maximum mean discrepancy (MMD), also called energy distance or N-distance in statistics and Hilbert-Schmidt independence criterion (HSIC), specifically distance covariance in statistics, are among the most popular and successful approaches to quantify the difference and independence of random variables, respectively. Thanks to their kernel-based foundations, MMD and HSIC are applicable on a wide v…
▽ More
Maximum mean discrepancy (MMD), also called energy distance or N-distance in statistics and Hilbert-Schmidt independence criterion (HSIC), specifically distance covariance in statistics, are among the most popular and successful approaches to quantify the difference and independence of random variables, respectively. Thanks to their kernel-based foundations, MMD and HSIC are applicable on a wide variety of domains. Despite their tremendous success, quite little is known about when HSIC characterizes independence and when MMD with tensor product kernel can discriminate probability distributions. In this paper, we answer these questions by studying various notions of characteristic property of the tensor product kernel.
△ Less
Submitted 2 August, 2018; v1 submitted 27 August, 2017;
originally announced August 2017.
-
Adaptive Clustering Using Kernel Density Estimators
Authors:
Ingo Steinwart,
Bharath K. Sriperumbudur,
Philipp Thomann
Abstract:
We derive and analyze a generic, recursive algorithm for estimating all splits in a finite cluster tree as well as the corresponding clusters. We further investigate statistical properties of this generic clustering algorithm when it receives level set estimates from a kernel density estimator. In particular, we derive finite sample guarantees, consistency, rates of convergence, and an adaptive da…
▽ More
We derive and analyze a generic, recursive algorithm for estimating all splits in a finite cluster tree as well as the corresponding clusters. We further investigate statistical properties of this generic clustering algorithm when it receives level set estimates from a kernel density estimator. In particular, we derive finite sample guarantees, consistency, rates of convergence, and an adaptive data-driven strategy for choosing the kernel bandwidth. For these results we do not need continuity assumptions on the density such as Hölder continuity, but only require intuitive geometric assumptions of non-parametric nature.
△ Less
Submitted 1 November, 2021; v1 submitted 17 August, 2017;
originally announced August 2017.
-
Approximate Kernel PCA Using Random Features: Computational vs. Statistical Trade-off
Authors:
Bharath Sriperumbudur,
Nicholas Sterge
Abstract:
Kernel methods are powerful learning methodologies that allow to perform non-linear data analysis. Despite their popularity, they suffer from poor scalability in big data scenarios. Various approximation methods, including random feature approximation, have been proposed to alleviate the problem. However, the statistical consistency of most of these approximate kernel methods is not well understoo…
▽ More
Kernel methods are powerful learning methodologies that allow to perform non-linear data analysis. Despite their popularity, they suffer from poor scalability in big data scenarios. Various approximation methods, including random feature approximation, have been proposed to alleviate the problem. However, the statistical consistency of most of these approximate kernel methods is not well understood except for kernel ridge regression wherein it has been shown that the random feature approximation is not only computationally efficient but also statistically consistent with a minimax optimal rate of convergence. In this paper, we investigate the efficacy of random feature approximation in the context of kernel principal component analysis (KPCA) by studying the trade-off between computational and statistical behaviors of approximate KPCA. We show that the approximate KPCA is both computationally and statistically efficient compared to KPCA in terms of the error associated with reconstructing a kernel function based on its projection onto the corresponding eigenspaces. The analysis hinges on Bernstein-type inequalities for the operator and Hilbert-Schmidt norms of a self-adjoint Hilbert-Schmidt operator-valued U-statistics, which are of independent interest.
△ Less
Submitted 11 June, 2022; v1 submitted 20 June, 2017;
originally announced June 2017.
-
Kernel Mean Embedding of Distributions: A Review and Beyond
Authors:
Krikamol Muandet,
Kenji Fukumizu,
Bharath Sriperumbudur,
Bernhard Schölkopf
Abstract:
A Hilbert space embedding of a distribution---in short, a kernel mean embedding---has recently emerged as a powerful tool for machine learning and inference. The basic idea behind this framework is to map distributions into a reproducing kernel Hilbert space (RKHS) in which the whole arsenal of kernel methods can be extended to probability measures. It can be viewed as a generalization of the orig…
▽ More
A Hilbert space embedding of a distribution---in short, a kernel mean embedding---has recently emerged as a powerful tool for machine learning and inference. The basic idea behind this framework is to map distributions into a reproducing kernel Hilbert space (RKHS) in which the whole arsenal of kernel methods can be extended to probability measures. It can be viewed as a generalization of the original "feature map" common to support vector machines (SVMs) and other kernel methods. While initially closely associated with the latter, it has meanwhile found application in fields ranging from kernel machines and probabilistic modeling to statistical inference, causal discovery, and deep learning. The goal of this survey is to give a comprehensive review of existing work and recent advances in this research area, and to discuss the most challenging issues and open problems that could lead to new research directions. The survey begins with a brief introduction to the RKHS and positive definite kernels which forms the backbone of this survey, followed by a thorough discussion of the Hilbert space embedding of marginal distributions, theoretical guarantees, and a review of its applications. The embedding of distributions enables us to apply RKHS methods to probability measures which prompts a wide range of applications such as kernel two-sample testing, independent testing, and learning on distributional data. Next, we discuss the Hilbert space embedding for conditional distributions, give theoretical insights, and review some applications. The conditional mean embedding enables us to perform sum, product, and Bayes' rules---which are ubiquitous in graphical model, probabilistic inference, and reinforcement learning---in a non-parametric way. We then discuss relationships between this framework and other related areas. Lastly, we give some suggestions on future research directions.
△ Less
Submitted 13 December, 2020; v1 submitted 31 May, 2016;
originally announced May 2016.
-
Convergence guarantees for kernel-based quadrature rules in misspecified settings
Authors:
Motonobu Kanagawa,
Bharath K. Sriperumbudur,
Kenji Fukumizu
Abstract:
Kernel-based quadrature rules are becoming important in machine learning and statistics, as they achieve super-$\sqrt{n}$ convergence rates in numerical integration, and thus provide alternatives to Monte Carlo integration in challenging settings where integrands are expensive to evaluate or where integrands are high dimensional. These rules are based on the assumption that the integrand has a cer…
▽ More
Kernel-based quadrature rules are becoming important in machine learning and statistics, as they achieve super-$\sqrt{n}$ convergence rates in numerical integration, and thus provide alternatives to Monte Carlo integration in challenging settings where integrands are expensive to evaluate or where integrands are high dimensional. These rules are based on the assumption that the integrand has a certain degree of smoothness, which is expressed as that the integrand belongs to a certain reproducing kernel Hilbert space (RKHS). However, this assumption can be violated in practice (e.g., when the integrand is a black box function), and no general theory has been established for the convergence of kernel quadratures in such misspecified settings. Our contribution is in proving that kernel quadratures can be consistent even when the integrand does not belong to the assumed RKHS, i.e., when the integrand is less smooth than assumed. Specifically, we derive convergence rates that depend on the (unknown) lesser smoothness of the integrand, where the degree of smoothness is expressed via powers of RKHSs or via Sobolev spaces.
△ Less
Submitted 28 October, 2016; v1 submitted 23 May, 2016;
originally announced May 2016.
-
Optimal Rates for Random Fourier Features
Authors:
Bharath K. Sriperumbudur,
Zoltan Szabo
Abstract:
Kernel methods represent one of the most powerful tools in machine learning to tackle problems expressed in terms of function values and derivatives due to their capability to represent and model complex relations. While these methods show good versatility, they are computationally intensive and have poor scalability to large data as they require operations on Gram matrices. In order to mitigate t…
▽ More
Kernel methods represent one of the most powerful tools in machine learning to tackle problems expressed in terms of function values and derivatives due to their capability to represent and model complex relations. While these methods show good versatility, they are computationally intensive and have poor scalability to large data as they require operations on Gram matrices. In order to mitigate this serious computational limitation, recently randomized constructions have been proposed in the literature, which allow the application of fast linear algorithms. Random Fourier features (RFF) are among the most popular and widely applied constructions: they provide an easily computable, low-dimensional feature representation for shift-invariant kernels. Despite the popularity of RFFs, very little is understood theoretically about their approximation quality. In this paper, we provide a detailed finite-sample theoretical analysis about the approximation quality of RFFs by (i) establishing optimal (in terms of the RFF dimension, and growing set size) performance guarantees in uniform norm, and (ii) presenting guarantees in $L^r$ ($1\le r<\infty$) norms. We also propose an RFF approximation to derivatives of a kernel with a theoretical study on its approximation quality.
△ Less
Submitted 4 November, 2015; v1 submitted 6 June, 2015;
originally announced June 2015.
-
Learning Theory for Distribution Regression
Authors:
Zoltan Szabo,
Bharath Sriperumbudur,
Barnabas Poczos,
Arthur Gretton
Abstract:
We focus on the distribution regression problem: regressing to vector-valued outputs from probability measures. Many important machine learning and statistical tasks fit into this framework, including multi-instance learning and point estimation problems without analytical solution (such as hyperparameter or entropy estimation). Despite the large number of available heuristics in the literature, t…
▽ More
We focus on the distribution regression problem: regressing to vector-valued outputs from probability measures. Many important machine learning and statistical tasks fit into this framework, including multi-instance learning and point estimation problems without analytical solution (such as hyperparameter or entropy estimation). Despite the large number of available heuristics in the literature, the inherent two-stage sampled nature of the problem makes the theoretical analysis quite challenging, since in practice only samples from sampled distributions are observable, and the estimates have to rely on similarities computed between sets of points. To the best of our knowledge, the only existing technique with consistency guarantees for distribution regression requires kernel density estimation as an intermediate step (which often performs poorly in practice), and the domain of the distributions to be compact Euclidean. In this paper, we study a simple, analytically computable, ridge regression-based alternative to distribution regression, where we embed the distributions to a reproducing kernel Hilbert space, and learn the regressor from the embeddings to the outputs. Our main contribution is to prove that this scheme is consistent in the two-stage sampled setup under mild conditions (on separable topological domains enriched with kernels): we present an exact computational-statistical efficiency trade-off analysis showing that our estimator is able to match the one-stage sampled minimax optimal rate [Caponnetto and De Vito, 2007; Steinwart et al., 2009]. This result answers a 17-year-old open question, establishing the consistency of the classical set kernel [Haussler, 1999; Gaertner et. al, 2002] in regression. We also cover consistency for more recent kernels on distributions, including those due to [Christmann and Steinwart, 2010].
△ Less
Submitted 21 October, 2016; v1 submitted 7 November, 2014;
originally announced November 2014.
-
Kernel Mean Estimation via Spectral Filtering
Authors:
Krikamol Muandet,
Bharath Sriperumbudur,
Bernhard Schölkopf
Abstract:
The problem of estimating the kernel mean in a reproducing kernel Hilbert space (RKHS) is central to kernel methods in that it is used by classical approaches (e.g., when centering a kernel PCA matrix), and it also forms the core inference step of modern kernel methods (e.g., kernel-based non-parametric tests) that rely on embedding probability distributions in RKHSs. Muandet et al. (2014) has sho…
▽ More
The problem of estimating the kernel mean in a reproducing kernel Hilbert space (RKHS) is central to kernel methods in that it is used by classical approaches (e.g., when centering a kernel PCA matrix), and it also forms the core inference step of modern kernel methods (e.g., kernel-based non-parametric tests) that rely on embedding probability distributions in RKHSs. Muandet et al. (2014) has shown that shrinkage can help in constructing "better" estimators of the kernel mean than the empirical estimator. The present paper studies the consistency and admissibility of the estimators in Muandet et al. (2014), and proposes a wider class of shrinkage estimators that improve upon the empirical estimator by considering appropriate basis functions. Using the kernel PCA basis, we show that some of these estimators can be constructed using spectral filtering algorithms which are shown to be consistent under some technical assumptions. Our theoretical analysis also reveals a fundamental connection to the kernel-based supervised learning framework. The proposed estimators are simple to implement and perform well in practice.
△ Less
Submitted 4 November, 2014;
originally announced November 2014.
-
Kernel Mean Shrinkage Estimators
Authors:
Krikamol Muandet,
Bharath Sriperumbudur,
Kenji Fukumizu,
Arthur Gretton,
Bernhard Schölkopf
Abstract:
A mean function in a reproducing kernel Hilbert space (RKHS), or a kernel mean, is central to kernel methods in that it is used by many classical algorithms such as kernel principal component analysis, and it also forms the core inference step of modern kernel methods that rely on embedding probability distributions in RKHSs. Given a finite sample, an empirical average has been used commonly as a…
▽ More
A mean function in a reproducing kernel Hilbert space (RKHS), or a kernel mean, is central to kernel methods in that it is used by many classical algorithms such as kernel principal component analysis, and it also forms the core inference step of modern kernel methods that rely on embedding probability distributions in RKHSs. Given a finite sample, an empirical average has been used commonly as a standard estimator of the true kernel mean. Despite a widespread use of this estimator, we show that it can be improved thanks to the well-known Stein phenomenon. We propose a new family of estimators called kernel mean shrinkage estimators (KMSEs), which benefit from both theoretical justifications and good empirical performance. The results demonstrate that the proposed estimators outperform the standard one, especially in a "large d, small n" paradigm.
△ Less
Submitted 25 February, 2016; v1 submitted 21 May, 2014;
originally announced May 2014.
-
Two-stage Sampled Learning Theory on Distributions
Authors:
Zoltan Szabo,
Arthur Gretton,
Barnabas Poczos,
Bharath Sriperumbudur
Abstract:
We focus on the distribution regression problem: regressing to a real-valued response from a probability distribution. Although there exist a large number of similarity measures between distributions, very little is known about their generalization performance in specific learning tasks. Learning problems formulated on distributions have an inherent two-stage sampled difficulty: in practice only s…
▽ More
We focus on the distribution regression problem: regressing to a real-valued response from a probability distribution. Although there exist a large number of similarity measures between distributions, very little is known about their generalization performance in specific learning tasks. Learning problems formulated on distributions have an inherent two-stage sampled difficulty: in practice only samples from sampled distributions are observable, and one has to build an estimate on similarities computed between sets of points. To the best of our knowledge, the only existing method with consistency guarantees for distribution regression requires kernel density estimation as an intermediate step (which suffers from slow convergence issues in high dimensions), and the domain of the distributions to be compact Euclidean. In this paper, we provide theoretical guarantees for a remarkably simple algorithmic alternative to solve the distribution regression problem: embed the distributions to a reproducing kernel Hilbert space, and learn a ridge regressor from the embeddings to the outputs. Our main contribution is to prove the consistency of this technique in the two-stage sampled setting under mild conditions (on separable, topological domains endowed with kernels). For a given total number of observations, we derive convergence rates as an explicit function of the problem difficulty. As a special case, we answer a 15-year-old open question: we establish the consistency of the classical set kernel [Haussler, 1999; Gartner et. al, 2002] in regression, and cover more recent kernels on distributions, including those due to [Christmann and Steinwart, 2010].
△ Less
Submitted 26 January, 2015; v1 submitted 7 February, 2014;
originally announced February 2014.
-
Density Estimation in Infinite Dimensional Exponential Families
Authors:
Bharath Sriperumbudur,
Kenji Fukumizu,
Arthur Gretton,
Aapo Hyvärinen,
Revant Kumar
Abstract:
In this paper, we consider an infinite dimensional exponential family, $\mathcal{P}$ of probability densities, which are parametrized by functions in a reproducing kernel Hilbert space, $H$ and show it to be quite rich in the sense that a broad class of densities on $\mathbb{R}^d$ can be approximated arbitrarily well in Kullback-Leibler (KL) divergence by elements in $\mathcal{P}$. The main goal o…
▽ More
In this paper, we consider an infinite dimensional exponential family, $\mathcal{P}$ of probability densities, which are parametrized by functions in a reproducing kernel Hilbert space, $H$ and show it to be quite rich in the sense that a broad class of densities on $\mathbb{R}^d$ can be approximated arbitrarily well in Kullback-Leibler (KL) divergence by elements in $\mathcal{P}$. The main goal of the paper is to estimate an unknown density, $p_0$ through an element in $\mathcal{P}$. Standard techniques like maximum likelihood estimation (MLE) or pseudo MLE (based on the method of sieves), which are based on minimizing the KL divergence between $p_0$ and $\mathcal{P}$, do not yield practically useful estimators because of their inability to efficiently handle the log-partition function. Instead, we propose an estimator, $\hat{p}_n$ based on minimizing the \emph{Fisher divergence}, $J(p_0\Vert p)$ between $p_0$ and $p\in \mathcal{P}$, which involves solving a simple finite-dimensional linear system. When $p_0\in\mathcal{P}$, we show that the proposed estimator is consistent, and provide a convergence rate of $n^{-\min\left\{\frac{2}{3},\frac{2β+1}{2β+2}\right\}}$ in Fisher divergence under the smoothness assumption that $\log p_0\in\mathcal{R}(C^β)$ for some $β\ge 0$, where $C$ is a certain Hilbert-Schmidt operator on $H$ and $\mathcal{R}(C^β)$ denotes the image of $C^β$. We also investigate the misspecified case of $p_0\notin\mathcal{P}$ and show that $J(p_0\Vert\hat{p}_n)\rightarrow \inf_{p\in\mathcal{P}}J(p_0\Vert p)$ as $n\rightarrow\infty$, and provide a rate for this convergence under a similar smoothness condition as above. Through numerical simulations we demonstrate that the proposed estimator outperforms the non-parametric kernel density estimator, and that the advantage with the proposed estimator grows as $d$ increases.
△ Less
Submitted 26 May, 2017; v1 submitted 12 December, 2013;
originally announced December 2013.
-
Kernel Mean Estimation and Stein's Effect
Authors:
Krikamol Muandet,
Kenji Fukumizu,
Bharath Sriperumbudur,
Arthur Gretton,
Bernhard Schölkopf
Abstract:
A mean function in reproducing kernel Hilbert space, or a kernel mean, is an important part of many applications ranging from kernel principal component analysis to Hilbert-space embedding of distributions. Given finite samples, an empirical average is the standard estimate for the true kernel mean. We show that this estimator can be improved via a well-known phenomenon in statistics called Stein'…
▽ More
A mean function in reproducing kernel Hilbert space, or a kernel mean, is an important part of many applications ranging from kernel principal component analysis to Hilbert-space embedding of distributions. Given finite samples, an empirical average is the standard estimate for the true kernel mean. We show that this estimator can be improved via a well-known phenomenon in statistics called Stein's phenomenon. After consideration, our theoretical analysis reveals the existence of a wide class of estimators that are better than the standard. Focusing on a subset of this class, we propose efficient shrinkage estimators for the kernel mean. Empirical evaluations on several benchmark applications clearly demonstrate that the proposed estimators outperform the standard kernel mean estimator.
△ Less
Submitted 6 June, 2013; v1 submitted 4 June, 2013;
originally announced June 2013.
-
On the Generalization Ability of Online Learning Algorithms for Pairwise Loss Functions
Authors:
Purushottam Kar,
Bharath K Sriperumbudur,
Prateek Jain,
Harish C Karnick
Abstract:
In this paper, we study the generalization properties of online learning based stochastic methods for supervised learning problems where the loss function is dependent on more than one training sample (e.g., metric learning, ranking). We present a generic decoupling technique that enables us to provide Rademacher complexity-based generalization error bounds. Our bounds are in general tighter than…
▽ More
In this paper, we study the generalization properties of online learning based stochastic methods for supervised learning problems where the loss function is dependent on more than one training sample (e.g., metric learning, ranking). We present a generic decoupling technique that enables us to provide Rademacher complexity-based generalization error bounds. Our bounds are in general tighter than those obtained by Wang et al (COLT 2012) for the same problem. Using our decoupling technique, we are further able to obtain fast convergence rates for strongly convex pairwise loss functions. We are also able to analyze a class of memory efficient online learning algorithms for pairwise learning problems that use only a bounded subset of past training samples to update the hypothesis at each step. Finally, in order to complement our generalization bounds, we propose a novel memory efficient online learning algorithm for higher order learning problems with bounded regret guarantees.
△ Less
Submitted 11 May, 2013;
originally announced May 2013.
-
Equivalence of distance-based and RKHS-based statistics in hypothesis testing
Authors:
Dino Sejdinovic,
Bharath Sriperumbudur,
Arthur Gretton,
Kenji Fukumizu
Abstract:
We provide a unifying framework linking two classes of statistics used in two-sample and independence testing: on the one hand, the energy distances and distance covariances from the statistics literature; on the other, maximum mean discrepancies (MMD), that is, distances between embeddings of distributions to reproducing kernel Hilbert spaces (RKHS), as established in machine learning. In the cas…
▽ More
We provide a unifying framework linking two classes of statistics used in two-sample and independence testing: on the one hand, the energy distances and distance covariances from the statistics literature; on the other, maximum mean discrepancies (MMD), that is, distances between embeddings of distributions to reproducing kernel Hilbert spaces (RKHS), as established in machine learning. In the case where the energy distance is computed with a semimetric of negative type, a positive definite kernel, termed distance kernel, may be defined such that the MMD corresponds exactly to the energy distance. Conversely, for any positive definite kernel, we can interpret the MMD as energy distance with respect to some negative-type semimetric. This equivalence readily extends to distance covariance using kernels on the product space. We determine the class of probability distributions for which the test statistics are consistent against all alternatives. Finally, we investigate the performance of the family of distance kernels in two-sample and independence tests: we show in particular that the energy distance most commonly employed in statistics is just one member of a parametric family of kernels, and that other choices from this family can yield more powerful tests.
△ Less
Submitted 12 November, 2013; v1 submitted 25 July, 2012;
originally announced July 2012.
-
Hypothesis testing using pairwise distances and associated kernels (with Appendix)
Authors:
Dino Sejdinovic,
Arthur Gretton,
Bharath Sriperumbudur,
Kenji Fukumizu
Abstract:
We provide a unifying framework linking two classes of statistics used in two-sample and independence testing: on the one hand, the energy distances and distance covariances from the statistics literature; on the other, distances between embeddings of distributions to reproducing kernel Hilbert spaces (RKHS), as established in machine learning. The equivalence holds when energy distances are compu…
▽ More
We provide a unifying framework linking two classes of statistics used in two-sample and independence testing: on the one hand, the energy distances and distance covariances from the statistics literature; on the other, distances between embeddings of distributions to reproducing kernel Hilbert spaces (RKHS), as established in machine learning. The equivalence holds when energy distances are computed with semimetrics of negative type, in which case a kernel may be defined such that the RKHS distance between distributions corresponds exactly to the energy distance. We determine the class of probability distributions for which kernels induced by semimetrics are characteristic (that is, for which embeddings of the distributions to an RKHS are injective). Finally, we investigate the performance of this family of kernels in two-sample and independence tests: we show in particular that the energy distance most commonly employed in statistics is just one member of a parametric family of kernels, and that other choices from this family can yield more powerful tests.
△ Less
Submitted 21 May, 2012; v1 submitted 2 May, 2012;
originally announced May 2012.
-
Discussion of: Brownian distance covariance
Authors:
Arthur Gretton,
Kenji Fukumizu,
Bharath K. Sriperumbudur
Abstract:
Discussion on "Brownian distance covariance" by Gábor J. Székely and Maria L. Rizzo [arXiv:1010.0297]
Discussion on "Brownian distance covariance" by Gábor J. Székely and Maria L. Rizzo [arXiv:1010.0297]
△ Less
Submitted 5 October, 2010;
originally announced October 2010.
-
Universality, Characteristic Kernels and RKHS Embedding of Measures
Authors:
Bharath K. Sriperumbudur,
Kenji Fukumizu,
Gert R. G. Lanckriet
Abstract:
A Hilbert space embedding for probability measures has recently been proposed, wherein any probability measure is represented as a mean element in a reproducing kernel Hilbert space (RKHS). Such an embedding has found applications in homogeneity testing, independence testing, dimensionality reduction, etc., with the requirement that the reproducing kernel is characteristic, i.e., the embedding i…
▽ More
A Hilbert space embedding for probability measures has recently been proposed, wherein any probability measure is represented as a mean element in a reproducing kernel Hilbert space (RKHS). Such an embedding has found applications in homogeneity testing, independence testing, dimensionality reduction, etc., with the requirement that the reproducing kernel is characteristic, i.e., the embedding is injective.
In this paper, we generalize this embedding to finite signed Borel measures, wherein any finite signed Borel measure is represented as a mean element in an RKHS. We show that the proposed embedding is injective if and only if the kernel is universal. This therefore, provides a novel characterization of universal kernels, which are proposed in the context of achieving the Bayes risk by kernel-based classification/regression algorithms. By exploiting this relation between universality and the embedding of finite signed Borel measures into an RKHS, we establish the relation between universal and characteristic kernels.
△ Less
Submitted 3 March, 2010;
originally announced March 2010.
-
Hilbert space embeddings and metrics on probability measures
Authors:
Bharath K. Sriperumbudur,
Arthur Gretton,
Kenji Fukumizu,
Bernhard Schölkopf,
Gert R. G. Lanckriet
Abstract:
A Hilbert space embedding for probability measures has recently been proposed, with applications including dimensionality reduction, homogeneity testing, and independence testing. This embedding represents any probability measure as a mean element in a reproducing kernel Hilbert space (RKHS). A pseudometric on the space of probability measures can be defined as the distance between distribution…
▽ More
A Hilbert space embedding for probability measures has recently been proposed, with applications including dimensionality reduction, homogeneity testing, and independence testing. This embedding represents any probability measure as a mean element in a reproducing kernel Hilbert space (RKHS). A pseudometric on the space of probability measures can be defined as the distance between distribution embeddings: we denote this as $γ_k$, indexed by the kernel function $k$ that defines the inner product in the RKHS.
We present three theoretical properties of $γ_k$. First, we consider the question of determining the conditions on the kernel $k$ for which $γ_k$ is a metric: such $k$ are denoted {\em characteristic kernels}. Unlike pseudometrics, a metric is zero only when two distributions coincide, thus ensuring the RKHS embedding maps all distributions uniquely (i.e., the embedding is injective). While previously published conditions may apply only in restricted circumstances (e.g. on compact domains), and are difficult to check, our conditions are straightforward and intuitive: bounded continuous strictly positive definite kernels are characteristic. Alternatively, if a bounded continuous kernel is translation-invariant on $\bb{R}^d$, then it is characteristic if and only if the support of its Fourier transform is the entire $\bb{R}^d$. Second, we show that there exist distinct distributions that are arbitrarily close in $γ_k$. Third, to understand the nature of the topology induced by $γ_k$, we relate $γ_k$ to other popular metrics on probability measures, and present conditions on the kernel $k$ under which $γ_k$ metrizes the weak topology.
△ Less
Submitted 29 January, 2010; v1 submitted 30 July, 2009;
originally announced July 2009.
-
A D.C. Programming Approach to the Sparse Generalized Eigenvalue Problem
Authors:
Bharath Sriperumbudur,
David Torres,
Gert Lanckriet
Abstract:
In this paper, we consider the sparse eigenvalue problem wherein the goal is to obtain a sparse solution to the generalized eigenvalue problem. We achieve this by constraining the cardinality of the solution to the generalized eigenvalue problem and obtain sparse principal component analysis (PCA), sparse canonical correlation analysis (CCA) and sparse Fisher discriminant analysis (FDA) as speci…
▽ More
In this paper, we consider the sparse eigenvalue problem wherein the goal is to obtain a sparse solution to the generalized eigenvalue problem. We achieve this by constraining the cardinality of the solution to the generalized eigenvalue problem and obtain sparse principal component analysis (PCA), sparse canonical correlation analysis (CCA) and sparse Fisher discriminant analysis (FDA) as special cases. Unlike the $\ell_1$-norm approximation to the cardinality constraint, which previous methods have used in the context of sparse PCA, we propose a tighter approximation that is related to the negative log-likelihood of a Student's t-distribution. The problem is then framed as a d.c. (difference of convex functions) program and is solved as a sequence of convex programs by invoking the majorization-minimization method. The resulting algorithm is proved to exhibit \emph{global convergence} behavior, i.e., for any random initialization, the sequence (subsequence) of iterates generated by the algorithm converges to a stationary point of the d.c. program. The performance of the algorithm is empirically demonstrated on both sparse PCA (finding few relevant genes that explain as much variance as possible in a high-dimensional gene dataset) and sparse CCA (cross-language document retrieval and vocabulary selection for music retrieval) applications.
△ Less
Submitted 12 October, 2009; v1 submitted 12 January, 2009;
originally announced January 2009.
-
Metric Embedding for Nearest Neighbor Classification
Authors:
Bharath K. Sriperumbudur,
Gert R. G. Lanckriet
Abstract:
The distance metric plays an important role in nearest neighbor (NN) classification. Usually the Euclidean distance metric is assumed or a Mahalanobis distance metric is optimized to improve the NN performance. In this paper, we study the problem of embedding arbitrary metric spaces into a Euclidean space with the goal to improve the accuracy of the NN classifier. We propose a solution by appeal…
▽ More
The distance metric plays an important role in nearest neighbor (NN) classification. Usually the Euclidean distance metric is assumed or a Mahalanobis distance metric is optimized to improve the NN performance. In this paper, we study the problem of embedding arbitrary metric spaces into a Euclidean space with the goal to improve the accuracy of the NN classifier. We propose a solution by appealing to the framework of regularization in a reproducing kernel Hilbert space and prove a representer-like theorem for NN classification. The embedding function is then determined by solving a semidefinite program which has an interesting connection to the soft-margin linear binary support vector machine classifier. Although the main focus of this paper is to present a general, theoretical framework for metric embedding in a NN setting, we demonstrate the performance of the proposed method on some benchmark datasets and show that it performs better than the Mahalanobis metric learning algorithm in terms of leave-one-out and generalization errors.
△ Less
Submitted 24 June, 2007;
originally announced June 2007.