Search | arXiv e-print repository

Optimal Rates for Functional Linear Regression with General Regularization

Authors: Naveen Gupta, S. Sivananthan, Bharath K. Sriperumbudur

Abstract: Functional linear regression is one of the fundamental and well-studied methods in functional data analysis. In this work, we investigate the functional linear regression model within the context of reproducing kernel Hilbert space by employing general spectral regularization to approximate the slope function with certain smoothness assumptions. We establish optimal convergence rates for estimatio… ▽ More Functional linear regression is one of the fundamental and well-studied methods in functional data analysis. In this work, we investigate the functional linear regression model within the context of reproducing kernel Hilbert space by employing general spectral regularization to approximate the slope function with certain smoothness assumptions. We establish optimal convergence rates for estimation and prediction errors associated with the proposed method under a Hölder type source condition, which generalizes and sharpens all the known results in the literature. △ Less

Submitted 14 June, 2024; originally announced June 2024.

arXiv:2406.08401 [pdf, other]

Nyström Kernel Stein Discrepancy

Authors: Florian Kalinke, Zoltan Szabo, Bharath K. Sriperumbudur

Abstract: Kernel methods underpin many of the most successful approaches in data science and statistics, and they allow representing probability measures as elements of a reproducing kernel Hilbert space without loss of information. Recently, the kernel Stein discrepancy (KSD), which combines Stein's method with kernel techniques, gained considerable attention. Through the Stein operator, KSD allows the con… ▽ More Kernel methods underpin many of the most successful approaches in data science and statistics, and they allow representing probability measures as elements of a reproducing kernel Hilbert space without loss of information. Recently, the kernel Stein discrepancy (KSD), which combines Stein's method with kernel techniques, gained considerable attention. Through the Stein operator, KSD allows the construction of powerful goodness-of-fit tests where it is sufficient to know the target distribution up to a multiplicative constant. However, the typical U- and V-statistic-based KSD estimators suffer from a quadratic runtime complexity, which hinders their application in large-scale settings. In this work, we propose a Nyström-based KSD acceleration -- with runtime $\mathcal O\!\left(mn+m^3\right)$ for $n$ samples and $m\ll n$ Nyström points -- , show its $\sqrt{n}$-consistency under the null with a classical sub-Gaussian assumption, and demonstrate its applicability for goodness-of-fit testing on a suite of benchmarks. △ Less

Submitted 12 June, 2024; originally announced June 2024.

MSC Class: 46E22 (Primary) 62G10 (Secondary) ACM Class: G.3; I.2.6

arXiv:2404.08278 [pdf, other]

Minimax Optimal Goodness-of-Fit Testing with Kernel Stein Discrepancy

Authors: Omar Hagrass, Bharath Sriperumbudur, Krishnakumar Balasubramanian

Abstract: We explore the minimax optimality of goodness-of-fit tests on general domains using the kernelized Stein discrepancy (KSD). The KSD framework offers a flexible approach for goodness-of-fit testing, avoiding strong distributional assumptions, accommodating diverse data structures beyond Euclidean spaces, and relying only on partial knowledge of the reference distribution, while maintaining computat… ▽ More We explore the minimax optimality of goodness-of-fit tests on general domains using the kernelized Stein discrepancy (KSD). The KSD framework offers a flexible approach for goodness-of-fit testing, avoiding strong distributional assumptions, accommodating diverse data structures beyond Euclidean spaces, and relying only on partial knowledge of the reference distribution, while maintaining computational efficiency. We establish a general framework and an operator-theoretic representation of the KSD, encompassing many existing KSD tests in the literature, which vary depending on the domain. We reveal the characteristics and limitations of KSD and demonstrate its non-optimality under a certain alternative space, defined over general domains when considering $χ^2$-divergence as the separation metric. To address this issue of non-optimality, we propose a modified, minimax optimal test by incorporating a spectral regularizer, thereby overcoming the shortcomings of standard KSD tests. Our results are established under a weak moment condition on the Stein kernel, which relaxes the bounded kernel assumption required by prior work in the analysis of kernel-based hypothesis testing. Additionally, we introduce an adaptive test capable of achieving minimax optimality up to a logarithmic factor by adapting to unknown parameters. Through numerical experiments, we illustrate the superior performance of our proposed tests across various domains compared to their unregularized counterparts. △ Less

Submitted 20 May, 2024; v1 submitted 12 April, 2024; originally announced April 2024.

Comments: 54 pages

MSC Class: Primary: 62G10; Secondary: 65J20; 65J22; 46E22; 47A52

arXiv:2310.02607 [pdf, ps, other]

Convergence Analysis of Kernel Conjugate Gradient for Functional Linear Regression

Authors: Naveen Gupta, S. Sivananthan, Bharath K. Sriperumbudur

Abstract: In this paper, we discuss the convergence analysis of the conjugate gradient-based algorithm for the functional linear model in the reproducing kernel Hilbert space framework, utilizing early stop** results in regularization against over-fitting. We establish the convergence rates depending on the regularity condition of the slope function and the decay rate of the eigenvalues of the operator co… ▽ More In this paper, we discuss the convergence analysis of the conjugate gradient-based algorithm for the functional linear model in the reproducing kernel Hilbert space framework, utilizing early stop** results in regularization against over-fitting. We establish the convergence rates depending on the regularity condition of the slope function and the decay rate of the eigenvalues of the operator composition of covariance and kernel operator. Our convergence rates match the minimax rate available from the literature. △ Less

Submitted 4 October, 2023; originally announced October 2023.

MSC Class: 62R10; 62G20; 65F22

arXiv:2308.04561 [pdf, other]

Spectral Regularized Kernel Goodness-of-Fit Tests

Authors: Omar Hagrass, Bharath K. Sriperumbudur, Bing Li

Abstract: Maximum mean discrepancy (MMD) has enjoyed a lot of success in many machine learning and statistical applications, including non-parametric hypothesis testing, because of its ability to handle non-Euclidean data. Recently, it has been demonstrated in Balasubramanian et al.(2021) that the goodness-of-fit test based on MMD is not minimax optimal while a Tikhonov regularized version of it is, for an… ▽ More Maximum mean discrepancy (MMD) has enjoyed a lot of success in many machine learning and statistical applications, including non-parametric hypothesis testing, because of its ability to handle non-Euclidean data. Recently, it has been demonstrated in Balasubramanian et al.(2021) that the goodness-of-fit test based on MMD is not minimax optimal while a Tikhonov regularized version of it is, for an appropriate choice of the regularization parameter. However, the results in Balasubramanian et al. (2021) are obtained under the restrictive assumptions of the mean element being zero, and the uniform boundedness condition on the eigenfunctions of the integral operator. Moreover, the test proposed in Balasubramanian et al. (2021) is not practical as it is not computable for many kernels. In this paper, we address these shortcomings and extend the results to general spectral regularizers that include Tikhonov regularization. △ Less

Submitted 8 August, 2023; originally announced August 2023.

Comments: 44 pages. arXiv admin note: text overlap with arXiv:2212.09201

MSC Class: 62G10 (Primary); 65J20; 65J22; 46E22; 47A52 (Secondary)

arXiv:2306.17329 [pdf, other]

Kernel $ε$-Greedy for Contextual Bandits

Authors: Sakshi Arya, Bharath K. Sriperumbudur

Abstract: We consider a kernelized version of the $ε$-greedy strategy for contextual bandits. More precisely, in a setting with finitely many arms, we consider that the mean reward functions lie in a reproducing kernel Hilbert space (RKHS). We propose an online weighted kernel ridge regression estimator for the reward functions. Under some conditions on the exploration probability sequence, $\{ε_t\}_t$, and… ▽ More We consider a kernelized version of the $ε$-greedy strategy for contextual bandits. More precisely, in a setting with finitely many arms, we consider that the mean reward functions lie in a reproducing kernel Hilbert space (RKHS). We propose an online weighted kernel ridge regression estimator for the reward functions. Under some conditions on the exploration probability sequence, $\{ε_t\}_t$, and choice of the regularization parameter, $\{λ_t\}_t$, we show that the proposed estimator is consistent. We also show that for any choice of kernel and the corresponding RKHS, we achieve a sub-linear regret rate depending on the intrinsic dimensionality of the RKHS. Furthermore, we achieve the optimal regret rate of $\sqrt{T}$ under a margin condition for finite-dimensional RKHS. △ Less

Submitted 29 June, 2023; originally announced June 2023.

MSC Class: 62L10; 62G05; 68T05

arXiv:2212.12848 [pdf, other]

Gromov-Wasserstein Distances: Entropic Regularization, Duality, and Sample Complexity

Authors: Zhengxin Zhang, Ziv Goldfeld, Youssef Mroueh, Bharath K. Sriperumbudur

Abstract: The Gromov-Wasserstein (GW) distance, rooted in optimal transport (OT) theory, quantifies dissimilarity between metric measure spaces and provides a framework for aligning heterogeneous datasets. While computational aspects of the GW problem have been widely studied, a duality theory and fundamental statistical questions concerning empirical convergence rates remained obscure. This work closes the… ▽ More The Gromov-Wasserstein (GW) distance, rooted in optimal transport (OT) theory, quantifies dissimilarity between metric measure spaces and provides a framework for aligning heterogeneous datasets. While computational aspects of the GW problem have been widely studied, a duality theory and fundamental statistical questions concerning empirical convergence rates remained obscure. This work closes these gaps for the quadratic GW distance over Euclidean spaces of different dimensions $d_x$ and $d_y$. We treat both the standard and the entropically regularized GW distance, and derive dual forms that represent them in terms of the well-understood OT and entropic OT (EOT) problems, respectively. This enables employing proof techniques from statistical OT based on regularity analysis of dual potentials and empirical process theory, using which we establish the first GW empirical convergence rates. The derived two-sample rates are $n^{-2/\max\{\min\{d_x,d_y\},4\}}$ (up to a log factor when $\min\{d_x,d_y\}=4$) for standard GW and $n^{-1/2}$ for EGW, which matches the corresponding rates for standard and entropic OT. The parametric rate for EGW is evidently optimal, while for standard GW we provide matching lower bounds, which establish sharpness of the derived rates. We also study stability of EGW in the entropic regularization parameter and prove approximation and continuity results for the cost and optimal couplings. Lastly, the duality is leveraged to shed new light on the open problem of the one-dimensional GW distance between uniform distributions on $n$ points, illuminating why the identity and anti-identity permutations may not be optimal. Our results serve as a first step towards a comprehensive statistical theory as well as computational advancements for GW distances, based on the discovered dual formulations. △ Less

Submitted 28 September, 2023; v1 submitted 24 December, 2022; originally announced December 2022.

Comments: 47 pages

arXiv:2212.09201 [pdf, other]

Spectral Regularized Kernel Two-Sample Tests

Authors: Omar Hagrass, Bharath K. Sriperumbudur, Bing Li

Abstract: Over the last decade, an approach that has gained a lot of popularity to tackle nonparametric testing problems on general (i.e., non-Euclidean) domains is based on the notion of reproducing kernel Hilbert space (RKHS) embedding of probability distributions. The main goal of our work is to understand the optimality of two-sample tests constructed based on this approach. First, we show the popular M… ▽ More Over the last decade, an approach that has gained a lot of popularity to tackle nonparametric testing problems on general (i.e., non-Euclidean) domains is based on the notion of reproducing kernel Hilbert space (RKHS) embedding of probability distributions. The main goal of our work is to understand the optimality of two-sample tests constructed based on this approach. First, we show the popular MMD (maximum mean discrepancy) two-sample test to be not optimal in terms of the separation boundary measured in Hellinger distance. Second, we propose a modification to the MMD test based on spectral regularization by taking into account the covariance information (which is not captured by the MMD test) and prove the proposed test to be minimax optimal with a smaller separation boundary than that achieved by the MMD test. Third, we propose an adaptive version of the above test which involves a data-driven strategy to choose the regularization parameter and show the adaptive test to be almost minimax optimal up to a logarithmic factor. Moreover, our results hold for the permutation variant of the test where the test threshold is chosen elegantly through the permutation of the samples. Through numerical experiments on synthetic and real data, we demonstrate the superior performance of the proposed test in comparison to the MMD test and other popular tests in the literature. △ Less

Submitted 1 May, 2024; v1 submitted 18 December, 2022; originally announced December 2022.

Comments: 75 pages, to be published in the Annals of Statistics

MSC Class: Primary: 62G10; Secondary: 65J20; 65J22; 46E22; 47A52

arXiv:2211.07861 [pdf, other]

Regularized Stein Variational Gradient Flow

Authors: Ye He, Krishnakumar Balasubramanian, Bharath K. Sriperumbudur, Jianfeng Lu

Abstract: The Stein Variational Gradient Descent (SVGD) algorithm is a deterministic particle method for sampling. However, a mean-field analysis reveals that the gradient flow corresponding to the SVGD algorithm (i.e., the Stein Variational Gradient Flow) only provides a constant-order approximation to the Wasserstein Gradient Flow corresponding to the KL-divergence minimization. In this work, we propose t… ▽ More The Stein Variational Gradient Descent (SVGD) algorithm is a deterministic particle method for sampling. However, a mean-field analysis reveals that the gradient flow corresponding to the SVGD algorithm (i.e., the Stein Variational Gradient Flow) only provides a constant-order approximation to the Wasserstein Gradient Flow corresponding to the KL-divergence minimization. In this work, we propose the Regularized Stein Variational Gradient Flow, which interpolates between the Stein Variational Gradient Flow and the Wasserstein Gradient Flow. We establish various theoretical properties of the Regularized Stein Variational Gradient Flow (and its time-discretization) including convergence to equilibrium, existence and uniqueness of weak solutions, and stability of the solutions. We provide preliminary numerical evidence of the improved performance offered by the regularization. △ Less

Submitted 8 May, 2024; v1 submitted 14 November, 2022; originally announced November 2022.

arXiv:2207.06357 [pdf, ps, other]

Shrinkage Estimation of Higher Order Bochner Integrals

Authors: Saiteja Utpala, Bharath K. Sriperumbudur

Abstract: We consider shrinkage estimation of higher order Hilbert space valued Bochner integrals in a non-parametric setting. We propose estimators that shrink the $U$-statistic estimator of the Bochner integral towards a pre-specified target element in the Hilbert space. Depending on the degeneracy of the kernel of the $U$-statistic, we construct consistent shrinkage estimators with fast rates of converge… ▽ More We consider shrinkage estimation of higher order Hilbert space valued Bochner integrals in a non-parametric setting. We propose estimators that shrink the $U$-statistic estimator of the Bochner integral towards a pre-specified target element in the Hilbert space. Depending on the degeneracy of the kernel of the $U$-statistic, we construct consistent shrinkage estimators with fast rates of convergence, and develop oracle inequalities comparing the risks of the the $U$-statistic estimator and its shrinkage version. Surprisingly, we show that the shrinkage estimator designed by assuming complete degeneracy of the kernel of the $U$-statistic is a consistent estimator even when the kernel is not complete degenerate. This work subsumes and improves upon Krikamol et al., 2016, JMLR and Zhou et al., 2019, JMVA, which only handle mean element and covariance operator estimation in a reproducing kernel Hilbert space. We also specialize our results to normal mean estimation and show that for $d\ge 3$, the proposed estimator strictly improves upon the sample mean in terms of the mean squared error. △ Less

Submitted 21 July, 2022; v1 submitted 13 July, 2022; originally announced July 2022.

Comments: 33 pages; Under Review

MSC Class: 62G05(Primary); 62F10; 62J07(Secondary)

arXiv:2206.03975 [pdf, other]

Functional linear and single-index models: A unified approach via Gaussian Stein identity

Authors: Krishnakumar Balasubramanian, Hans-Georg Müller, Bharath K. Sriperumbudur

Abstract: Functional linear and single-index models are core regression methods in functional data analysis and are widely used for performing regression in a wide range of applications when the covariates are random functions coupled with scalar responses. In the existing literature, however, the construction of associated estimators and the study of their theoretical properties is invariably carried out o… ▽ More Functional linear and single-index models are core regression methods in functional data analysis and are widely used for performing regression in a wide range of applications when the covariates are random functions coupled with scalar responses. In the existing literature, however, the construction of associated estimators and the study of their theoretical properties is invariably carried out on a case-by-case basis for specific models under consideration. In this work, assuming the predictors are Gaussian processes, we provide a unified methodological and theoretical framework for estimating the index in functional linear, and its direction in single-index models. In the latter case, the proposed approach does not require the specification of the link function. In terms of methodology, we show that the reproducing kernel Hilbert space (RKHS) based functional linear least-squares estimator, when viewed through the lens of an infinite-dimensional Gaussian Stein's identity, also provides an estimator of the index of the single-index model. Theoretically, we characterize the convergence rates of the proposed estimators for both linear and single-index models. Our analysis has several key advantages: (i) it does not require restrictive commutativity assumptions for the covariance operator of the random covariates and the integral operator associated with the reproducing kernel; and (ii) the true index parameter can lie outside of the chosen RKHS, thereby allowing for index misspecification as well as for quantifying the degree of such index misspecification. Several existing results emerge as special cases of our analysis. △ Less

Submitted 26 March, 2024; v1 submitted 8 June, 2022; originally announced June 2022.

Comments: To appear in Bernoulli Journal

arXiv:2206.01795 [pdf, other]

Robust Topological Inference in the Presence of Outliers

Authors: Siddharth Vishwanath, Bharath K. Sriperumbudur, Kenji Fukumizu, Satoshi Kuriki

Abstract: The distance function to a compact set plays a crucial role in the paradigm of topological data analysis. In particular, the sublevel sets of the distance function are used in the computation of persistent homology -- a backbone of the topological data analysis pipeline. Despite its stability to perturbations in the Hausdorff distance, persistent homology is highly sensitive to outliers. In this w… ▽ More The distance function to a compact set plays a crucial role in the paradigm of topological data analysis. In particular, the sublevel sets of the distance function are used in the computation of persistent homology -- a backbone of the topological data analysis pipeline. Despite its stability to perturbations in the Hausdorff distance, persistent homology is highly sensitive to outliers. In this work, we develop a framework of statistical inference for persistent homology in the presence of outliers. Drawing inspiration from recent developments in robust statistics, we propose a $\textit{median-of-means}$ variant of the distance function ($\textsf{MoM Dist}$), and establish its statistical properties. In particular, we show that, even in the presence of outliers, the sublevel filtrations and weighted filtrations induced by $\textsf{MoM Dist}$ are both consistent estimators of the true underlying population counterpart, and their rates of convergence in the bottleneck metric are controlled by the fraction of outliers in the data. Finally, we demonstrate the advantages of the proposed methodology through simulations and applications. △ Less

Submitted 3 June, 2022; originally announced June 2022.

Comments: 50 pages, 10 figures

MSC Class: 62R40; 55N31; 68T09

arXiv:2105.08875 [pdf, ps, other]

Statistical Optimality and Computational Efficiency of Nyström Kernel PCA

Authors: Nicholas Sterge, Bharath Sriperumbudur

Abstract: Kernel methods provide an elegant framework for develo** nonlinear learning algorithms from simple linear methods. Though these methods have superior empirical performance in several real data applications, their usefulness is inhibited by the significant computational burden incurred in large sample situations. Various approximation schemes have been proposed in the literature to alleviate thes… ▽ More Kernel methods provide an elegant framework for develo** nonlinear learning algorithms from simple linear methods. Though these methods have superior empirical performance in several real data applications, their usefulness is inhibited by the significant computational burden incurred in large sample situations. Various approximation schemes have been proposed in the literature to alleviate these computational issues, and the approximate kernel machines are shown to retain the empirical performance. However, the theoretical properties of these approximate kernel machines are less well understood. In this work, we theoretically study the trade-off between computational complexity and statistical accuracy in Nyström approximate kernel principal component analysis (KPCA), wherein we show that the Nyström approximate KPCA matches the statistical performance of (non-approximate) KPCA while remaining computationally beneficial. Additionally, we show that Nyström approximate KPCA outperforms the statistical behavior of another popular approximation scheme, the random feature approximation, when applied to KPCA. △ Less

Submitted 18 May, 2021; originally announced May 2021.

Comments: 26 pages

MSC Class: Primary: 65R15; Secondary: 62H25; 46E22; 65F55

arXiv:2010.08071 [pdf, other]

Shrinkage Estimation for the Diagonal Multivariate Exponential Families

Authors: Nikolas Siapoutis, Donald Richards, Bharath K. Sriperumbudur

Abstract: We study shrinkage estimation of the mean parameters of a class of multivariate distributions for which the diagonal entries of the corresponding covariance matrix are certain quadratic functions of the mean parameter. This class of distributions includes the diagonal multivariate natural exponential families. We propose two classes of semi-parametric shrinkage estimators for the mean and construc… ▽ More We study shrinkage estimation of the mean parameters of a class of multivariate distributions for which the diagonal entries of the corresponding covariance matrix are certain quadratic functions of the mean parameter. This class of distributions includes the diagonal multivariate natural exponential families. We propose two classes of semi-parametric shrinkage estimators for the mean and construct unbiased estimators of the corresponding risk. We establish the asymptotic consistency and convergence rates for these shrinkage estimators under squared error loss as both $n$, the sample size, and $p$, the dimension, tend to infinity. Next, we specialize these results to the diagonal multivariate natural exponential families, which have been classified as consisting of the normal, Poisson, gamma, multinomial, negative multinomial, and hybrid classes of distributions. We establish the consistency of our estimators in the normal, gamma, and negative multinomial cases subject to the condition that $p n^{-1/3} (\log{n})^{4/3} \to 0$, and in the Poisson and multinomial cases if $p n^{-1/2} \to 0$, as $n,p \to \infty$. Simulation studies are provided to evaluate the performance of our estimators and we illustrate that, in the gamma and Poisson cases, our estimators achieve lower risk than the maximum likelihood estimator, thereby demonstrating the superiority of our estimators over the maximum likelihood estimator. △ Less

Submitted 1 July, 2022; v1 submitted 15 October, 2020; originally announced October 2020.

Comments: 36 pages, 2 figures

MSC Class: 62F12; 62H05 (Primary) 62J07; 62G05 (Secondary)

arXiv:2006.10012 [pdf, other]

Robust Persistence Diagrams using Reproducing Kernels

Authors: Siddharth Vishwanath, Kenji Fukumizu, Satoshi Kuriki, Bharath Sriperumbudur

Abstract: Persistent homology has become an important tool for extracting geometric and topological features from data, whose multi-scale features are summarized in a persistence diagram. From a statistical perspective, however, persistence diagrams are very sensitive to perturbations in the input space. In this work, we develop a framework for constructing robust persistence diagrams from superlevel filtra… ▽ More Persistent homology has become an important tool for extracting geometric and topological features from data, whose multi-scale features are summarized in a persistence diagram. From a statistical perspective, however, persistence diagrams are very sensitive to perturbations in the input space. In this work, we develop a framework for constructing robust persistence diagrams from superlevel filtrations of robust density estimators constructed using reproducing kernels. Using an analogue of the influence function on the space of persistence diagrams, we establish the proposed framework to be less sensitive to outliers. The robust persistence diagrams are shown to be consistent estimators in bottleneck distance, with the convergence rate controlled by the smoothness of the kernel. This, in turn, allows us to construct uniform confidence bands in the space of persistence diagrams. Finally, we demonstrate the superiority of the proposed approach on benchmark datasets. △ Less

Submitted 3 June, 2022; v1 submitted 17 June, 2020; originally announced June 2020.

MSC Class: 55N31; 62R40; 62G07; 46E22

arXiv:2001.00220 [pdf, other]

On the Limits of Topological Data Analysis for Statistical Inference

Authors: Siddharth Vishwanath, Kenji Fukumizu, Satoshi Kuriki, Bharath Sriperumbudur

Abstract: Topological data analysis has emerged as a powerful tool for extracting the metric, geometric and topological features underlying the data as a multi-resolution summary statistic, and has found applications in several areas where data arises from complex sources. In this paper, we examine the use of topological summary statistics through the lens of statistical inference. We investigate necessary… ▽ More Topological data analysis has emerged as a powerful tool for extracting the metric, geometric and topological features underlying the data as a multi-resolution summary statistic, and has found applications in several areas where data arises from complex sources. In this paper, we examine the use of topological summary statistics through the lens of statistical inference. We investigate necessary and sufficient conditions under which \textit{valid statistical inference} is possible using {topological summary statistics}. Additionally, we provide examples of models that demonstrate invariance with respect to topological summaries. △ Less

Submitted 15 February, 2024; v1 submitted 1 January, 2020; originally announced January 2020.

Comments: 36 pages, 9 figures

MSC Class: 62F30; 55N31; 62R40

arXiv:1912.01103 [pdf, ps, other]

On Distance and Kernel Measures of Conditional Independence

Authors: Tianhong Sheng, Bharath K. Sriperumbudur

Abstract: Measuring conditional independence is one of the important tasks in statistical inference and is fundamental in causal discovery, feature selection, dimensionality reduction, Bayesian network learning, and others. In this work, we explore the connection between conditional independence measures induced by distances on a metric space and reproducing kernels associated with a reproducing kernel Hilb… ▽ More Measuring conditional independence is one of the important tasks in statistical inference and is fundamental in causal discovery, feature selection, dimensionality reduction, Bayesian network learning, and others. In this work, we explore the connection between conditional independence measures induced by distances on a metric space and reproducing kernels associated with a reproducing kernel Hilbert space (RKHS). For certain distance and kernel pairs, we show the distance-based conditional independence measures to be equivalent to that of kernel-based measures. On the other hand, we also show that some popular---in machine learning---kernel conditional independence measures based on the Hilbert-Schmidt norm of a certain cross-conditional covariance operator, do not have a simple distance representation, except in some limiting cases. This paper, therefore, shows the distance and kernel measures of conditional independence to be not quite equivalent unlike in the case of joint independence as shown by Sejdinovic et al. (2013). △ Less

Submitted 17 August, 2020; v1 submitted 2 December, 2019; originally announced December 2019.

arXiv:1908.05818 [pdf, other]

Gaussian Sketching yields a J-L Lemma in RKHS

Authors: Samory Kpotufe, Bharath K. Sriperumbudur

Abstract: The main contribution of the paper is to show that Gaussian sketching of a kernel-Gram matrix $\boldsymbol K$ yields an operator whose counterpart in an RKHS $\mathcal H$, is a \emph{random projection} operator---in the spirit of Johnson-Lindenstrauss (J-L) lemma. To be precise, given a random matrix $Z$ with i.i.d. Gaussian entries, we show that a sketch $Z\boldsymbol{K}$ corresponds to a particu… ▽ More The main contribution of the paper is to show that Gaussian sketching of a kernel-Gram matrix $\boldsymbol K$ yields an operator whose counterpart in an RKHS $\mathcal H$, is a \emph{random projection} operator---in the spirit of Johnson-Lindenstrauss (J-L) lemma. To be precise, given a random matrix $Z$ with i.i.d. Gaussian entries, we show that a sketch $Z\boldsymbol{K}$ corresponds to a particular random operator in (infinite-dimensional) Hilbert space $\mathcal H$ that maps functions $f \in \mathcal H$ to a low-dimensional space $\mathbb R^d$, while preserving a weighted RKHS inner-product of the form $\langle f, g \rangle_Σ \doteq \langle f, Σ^3 g \rangle_{\mathcal H}$, where $Σ$ is the \emph{covariance} operator induced by the data distribution. In particular, under similar assumptions as in kernel PCA (KPCA), or kernel $k$-means (K-$k$-means), well-separated subsets of feature-space $\{K(\cdot, x): x \in \cal X\}$ remain well-separated after such operation, which suggests similar benefits as in KPCA and/or K-$k$-means, albeit at the much cheaper cost of a random projection. In particular, our convergence rates suggest that, given a large dataset $\{X_i\}_{i=1}^N$ of size $N$, we can build the Gram matrix $\boldsymbol K$ on a much smaller subsample of size $n\ll N$, so that the sketch $Z\boldsymbol K$ is very cheap to obtain and subsequently apply as a projection operator on the original data $\{X_i\}_{i=1}^N$. We verify these insights empirically on synthetic data, and on real-world clustering applications. △ Less

Submitted 11 March, 2020; v1 submitted 15 August, 2019; originally announced August 2019.

Comments: 16 pages

arXiv:1907.05226 [pdf, other]

Gain with no Pain: Efficient Kernel-PCA by Nyström Sampling

Authors: Nicholas Sterge, Bharath Sriperumbudur, Lorenzo Rosasco, Alessandro Rudi

Abstract: In this paper, we propose and study a Nyström based approach to efficient large scale kernel principal component analysis (PCA). The latter is a natural nonlinear extension of classical PCA based on considering a nonlinear feature map or the corresponding kernel. Like other kernel approaches, kernel PCA enjoys good mathematical and statistical properties but, numerically, it scales poorly with the… ▽ More In this paper, we propose and study a Nyström based approach to efficient large scale kernel principal component analysis (PCA). The latter is a natural nonlinear extension of classical PCA based on considering a nonlinear feature map or the corresponding kernel. Like other kernel approaches, kernel PCA enjoys good mathematical and statistical properties but, numerically, it scales poorly with the sample size. Our analysis shows that Nyström sampling greatly improves computational efficiency without incurring any loss of statistical accuracy. While similar effects have been observed in supervised learning, this is the first such result for PCA. Our theoretical findings, which are also illustrated by numerical results, are based on a combination of analytic and concentration of measure techniques. Our study is more broadly motivated by the question of understanding the interplay between statistical and computational requirements for learning. △ Less

Submitted 11 July, 2019; originally announced July 2019.

Comments: 19 pages, 2 figures

MSC Class: 62H25; 62H12; 46E22

arXiv:1902.07284 [pdf, other]

Optimal Function-on-Scalar Regression over Complex Domains

Authors: Matthew Reimherr, Bharath Sriperumbudur, Hyun Bin Kang

Abstract: In this work we consider the problem of estimating function-on-scalar regression models when the functions are observed over multi-dimensional or manifold domains and with potentially multivariate output. We establish the minimax rates of convergence and present an estimator based on reproducing kernel Hilbert spaces that achieves the minimax rate. To better interpret the derived rates, we extend… ▽ More In this work we consider the problem of estimating function-on-scalar regression models when the functions are observed over multi-dimensional or manifold domains and with potentially multivariate output. We establish the minimax rates of convergence and present an estimator based on reproducing kernel Hilbert spaces that achieves the minimax rate. To better interpret the derived rates, we extend well-known links between RKHS and Sobolev spaces to the case where the domain is a compact Riemannian manifold. This is accomplished using an interesting connection to Weyl's Law from partial differential equations. We conclude with a numerical study and an application to 3D facial imaging. △ Less

Submitted 19 February, 2019; originally announced February 2019.

arXiv:1902.01219 [pdf, ps, other]

Local minimax rates for closeness testing of discrete distributions

Authors: Joseph Lam-Weil, Alexandra Carpentier, Bharath K. Sriperumbudur

Abstract: We consider the closeness testing problem for discrete distributions. The goal is to distinguish whether two samples are drawn from the same unspecified distribution, or whether their respective distributions are separated in $L_1$-norm. In this paper, we focus on adapting the rate to the shape of the underlying distributions, i.e. we consider \textit{a local minimax setting}. We provide, to the b… ▽ More We consider the closeness testing problem for discrete distributions. The goal is to distinguish whether two samples are drawn from the same unspecified distribution, or whether their respective distributions are separated in $L_1$-norm. In this paper, we focus on adapting the rate to the shape of the underlying distributions, i.e. we consider \textit{a local minimax setting}. We provide, to the best of our knowledge, the first local minimax rate for the separation distance up to logarithmic factors, together with a test that achieves it. In view of the rate, closeness testing turns out to be substantially harder than the related one-sample testing problem over a wide range of cases. △ Less

Submitted 19 January, 2021; v1 submitted 1 February, 2019; originally announced February 2019.

MSC Class: 62F03; 62G10; 62F35 ACM Class: G.3; I.2.6

arXiv:1810.05207 [pdf, ps, other]

On Kernel Derivative Approximation with Random Fourier Features

Authors: Zoltan Szabo, Bharath K. Sriperumbudur

Abstract: Random Fourier features (RFF) represent one of the most popular and wide-spread techniques in machine learning to scale up kernel algorithms. Despite the numerous successful applications of RFFs, unfortunately, quite little is understood theoretically on their optimality and limitations of their performance. Only recently, precise statistical-computational trade-offs have been established for RFFs… ▽ More Random Fourier features (RFF) represent one of the most popular and wide-spread techniques in machine learning to scale up kernel algorithms. Despite the numerous successful applications of RFFs, unfortunately, quite little is understood theoretically on their optimality and limitations of their performance. Only recently, precise statistical-computational trade-offs have been established for RFFs in the approximation of kernel values, kernel ridge regression, kernel PCA and SVM classification. Our goal is to spark the investigation of optimality of RFF-based approximations in tasks involving not only function values but derivatives, which naturally lead to optimization problems with kernel derivatives. Particularly, in this paper, we focus on the approximation quality of RFFs for kernel derivatives and prove that the existing finite-sample guarantees can be improved exponentially in terms of the domain where they hold, using recent tools from unbounded empirical process theory. Our result implies that the same approximation guarantee is attainable for kernel derivatives using RFF as achieved for kernel values. △ Less

Submitted 9 February, 2019; v1 submitted 11 October, 2018; originally announced October 2018.

Comments: AISTATS-2019

MSC Class: 60E10; 42Bxx; 46E22 ACM Class: G.3; I.2.6

arXiv:1803.11451 [pdf, ps, other]

Minimax Estimation of Quadratic Fourier Functionals

Authors: Shashank Singh, Bharath K. Sriperumbudur, Barnabás Póczos

Abstract: We study estimation of (semi-)inner products between two nonparametric probability distributions, given IID samples from each distribution. These products include relatively well-studied classical $\mathcal{L}^2$ and Sobolev inner products, as well as those induced by translation-invariant reproducing kernels, for which we believe our results are the first. We first propose estimators for these qu… ▽ More We study estimation of (semi-)inner products between two nonparametric probability distributions, given IID samples from each distribution. These products include relatively well-studied classical $\mathcal{L}^2$ and Sobolev inner products, as well as those induced by translation-invariant reproducing kernels, for which we believe our results are the first. We first propose estimators for these quantities, and the induced (semi)norms and (pseudo)metrics. We then prove non-asymptotic upper bounds on their mean squared error, in terms of weights both of the inner product and of the two distributions, in the Fourier basis. Finally, we prove minimax lower bounds that imply rate-optimality of the proposed estimators over Fourier ellipsoids. △ Less

Submitted 1 September, 2018; v1 submitted 30 March, 2018; originally announced March 2018.

arXiv:1709.00147 [pdf, other]

Convergence Analysis of Deterministic Kernel-Based Quadrature Rules in Misspecified Settings

Authors: Motonobu Kanagawa, Bharath K. Sriperumbudur, Kenji Fukumizu

Abstract: This paper presents a convergence analysis of kernel-based quadrature rules in misspecified settings, focusing on deterministic quadrature in Sobolev spaces. In particular, we deal with misspecified settings where a test integrand is less smooth than a Sobolev RKHS based on which a quadrature rule is constructed. We provide convergence guarantees based on two different assumptions on a quadrature… ▽ More This paper presents a convergence analysis of kernel-based quadrature rules in misspecified settings, focusing on deterministic quadrature in Sobolev spaces. In particular, we deal with misspecified settings where a test integrand is less smooth than a Sobolev RKHS based on which a quadrature rule is constructed. We provide convergence guarantees based on two different assumptions on a quadrature rule: one on quadrature weights, and the other on design points. More precisely, we show that convergence rates can be derived (i) if the sum of absolute weights remains constant (or does not increase quickly), or (ii) if the minimum distance between design points does not decrease very quickly. As a consequence of the latter result, we derive a rate of convergence for Bayesian quadrature in misspecified settings. We reveal a condition on design points to make Bayesian quadrature robust to misspecification, and show that, under this condition, it may adaptively achieve the optimal rate of convergence in the Sobolev space of a lesser order (i.e., of the unknown smoothness of a test integrand), under a slightly stronger regularity condition on the integrand. △ Less

Submitted 30 October, 2018; v1 submitted 1 September, 2017; originally announced September 2017.

Comments: 36 pages

MSC Class: 65D30 (Primary); 65D32; 65D05; 46E35; 46E22 (Secondary)

arXiv:1708.03372 [pdf, other]

Optimal Prediction for Additive Function-on-Function Regression

Authors: Matthew Reimherr, Bharath Sriperumbudur, Bahaeddine Taoufik

Abstract: As with classic statistics, functional regression models are invaluable in the analysis of functional data. While there are now extensive tools with accompanying theory available for linear models, there is still a great deal of work to be done concerning nonlinear models for functional data. In this work we consider the Additive Function-on-Function Regression model, a type of nonlinear model tha… ▽ More As with classic statistics, functional regression models are invaluable in the analysis of functional data. While there are now extensive tools with accompanying theory available for linear models, there is still a great deal of work to be done concerning nonlinear models for functional data. In this work we consider the Additive Function-on-Function Regression model, a type of nonlinear model that uses an additive relationship between the functional outcome and functional covariate. We present an estimation methodology built upon Reproducing Kernel Hilbert Spaces, and establish optimal rates of convergence for our estimates in terms of prediction error. We also discuss computational challenges that arise with such complex models, develo** a representer theorem for our estimate as well as a more practical and computationally efficient approximation. Simulations and an application to Cumulative Intraday Returns around the 2008 financial crisis are also provided. △ Less

Submitted 22 June, 2018; v1 submitted 10 August, 2017; originally announced August 2017.

arXiv:1706.06296 [pdf, ps, other]

Approximate Kernel PCA Using Random Features: Computational vs. Statistical Trade-off

Authors: Bharath Sriperumbudur, Nicholas Sterge

Abstract: Kernel methods are powerful learning methodologies that allow to perform non-linear data analysis. Despite their popularity, they suffer from poor scalability in big data scenarios. Various approximation methods, including random feature approximation, have been proposed to alleviate the problem. However, the statistical consistency of most of these approximate kernel methods is not well understoo… ▽ More Kernel methods are powerful learning methodologies that allow to perform non-linear data analysis. Despite their popularity, they suffer from poor scalability in big data scenarios. Various approximation methods, including random feature approximation, have been proposed to alleviate the problem. However, the statistical consistency of most of these approximate kernel methods is not well understood except for kernel ridge regression wherein it has been shown that the random feature approximation is not only computationally efficient but also statistically consistent with a minimax optimal rate of convergence. In this paper, we investigate the efficacy of random feature approximation in the context of kernel principal component analysis (KPCA) by studying the trade-off between computational and statistical behaviors of approximate KPCA. We show that the approximate KPCA is both computationally and statistically efficient compared to KPCA in terms of the error associated with reconstructing a kernel function based on its projection onto the corresponding eigenspaces. The analysis hinges on Bernstein-type inequalities for the operator and Hilbert-Schmidt norms of a self-adjoint Hilbert-Schmidt operator-valued U-statistics, which are of independent interest. △ Less

Submitted 11 June, 2022; v1 submitted 20 June, 2017; originally announced June 2017.

Comments: 65 pages

MSC Class: 62H25; 62G05

arXiv:1602.04361 [pdf, ps, other]

Minimax Estimation of Kernel Mean Embeddings

Authors: Ilya Tolstikhin, Bharath Sriperumbudur, Krikamol Muandet

Abstract: In this paper, we study the minimax estimation of the Bochner integral $$μ_k(P):=\int_{\mathcal{X}} k(\cdot,x)\,dP(x),$$ also called as the kernel mean embedding, based on random samples drawn i.i.d.~from $P$, where $k:\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R}$ is a positive definite kernel. Various estimators (including the empirical estimator), $\hatθ_n$ of $μ_k(P)$ are studied in the li… ▽ More In this paper, we study the minimax estimation of the Bochner integral $$μ_k(P):=\int_{\mathcal{X}} k(\cdot,x)\,dP(x),$$ also called as the kernel mean embedding, based on random samples drawn i.i.d.~from $P$, where $k:\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R}$ is a positive definite kernel. Various estimators (including the empirical estimator), $\hatθ_n$ of $μ_k(P)$ are studied in the literature wherein all of them satisfy $\bigl\| \hatθ_n-μ_k(P)\bigr\|_{\mathcal{H}_k}=O_P(n^{-1/2})$ with $\mathcal{H}_k$ being the reproducing kernel Hilbert space induced by $k$. The main contribution of the paper is in showing that the above mentioned rate of $n^{-1/2}$ is minimax in $\|\cdot\|_{\mathcal{H}_k}$ and $\|\cdot\|_{L^2(\mathbb{R}^d)}$-norms over the class of discrete measures and the class of measures that has an infinitely differentiable density, with $k$ being a continuous translation-invariant kernel on $\mathbb{R}^d$. The interesting aspect of this result is that the minimax rate is independent of the smoothness of the kernel and the density of $P$ (if it exists). This result has practical consequences in statistical applications as the mean embedding has been widely employed in non-parametric hypothesis testing, density estimation, causal inference and feature selection, through its relation to energy distance (and distance covariance). △ Less

Submitted 31 July, 2017; v1 submitted 13 February, 2016; originally announced February 2016.

MSC Class: 62G05; 62G07

arXiv:1506.02155 [pdf, ps, other]

Optimal Rates for Random Fourier Features

Authors: Bharath K. Sriperumbudur, Zoltan Szabo

Abstract: Kernel methods represent one of the most powerful tools in machine learning to tackle problems expressed in terms of function values and derivatives due to their capability to represent and model complex relations. While these methods show good versatility, they are computationally intensive and have poor scalability to large data as they require operations on Gram matrices. In order to mitigate t… ▽ More Kernel methods represent one of the most powerful tools in machine learning to tackle problems expressed in terms of function values and derivatives due to their capability to represent and model complex relations. While these methods show good versatility, they are computationally intensive and have poor scalability to large data as they require operations on Gram matrices. In order to mitigate this serious computational limitation, recently randomized constructions have been proposed in the literature, which allow the application of fast linear algorithms. Random Fourier features (RFF) are among the most popular and widely applied constructions: they provide an easily computable, low-dimensional feature representation for shift-invariant kernels. Despite the popularity of RFFs, very little is understood theoretically about their approximation quality. In this paper, we provide a detailed finite-sample theoretical analysis about the approximation quality of RFFs by (i) establishing optimal (in terms of the RFF dimension, and growing set size) performance guarantees in uniform norm, and (ii) presenting guarantees in $L^r$ ($1\le r<\infty$) norms. We also propose an RFF approximation to derivatives of a kernel with a theoretical study on its approximation quality. △ Less

Submitted 4 November, 2015; v1 submitted 6 June, 2015; originally announced June 2015.

Comments: To appear at NIPS-2015

MSC Class: 60E10; 62Gxx; 62Exx; 62H12; 42Bxx; 46E22 ACM Class: G.3; I.2.6; F.2

arXiv:1411.2066 [pdf, ps, other]

Learning Theory for Distribution Regression

Authors: Zoltan Szabo, Bharath Sriperumbudur, Barnabas Poczos, Arthur Gretton

Abstract: We focus on the distribution regression problem: regressing to vector-valued outputs from probability measures. Many important machine learning and statistical tasks fit into this framework, including multi-instance learning and point estimation problems without analytical solution (such as hyperparameter or entropy estimation). Despite the large number of available heuristics in the literature, t… ▽ More We focus on the distribution regression problem: regressing to vector-valued outputs from probability measures. Many important machine learning and statistical tasks fit into this framework, including multi-instance learning and point estimation problems without analytical solution (such as hyperparameter or entropy estimation). Despite the large number of available heuristics in the literature, the inherent two-stage sampled nature of the problem makes the theoretical analysis quite challenging, since in practice only samples from sampled distributions are observable, and the estimates have to rely on similarities computed between sets of points. To the best of our knowledge, the only existing technique with consistency guarantees for distribution regression requires kernel density estimation as an intermediate step (which often performs poorly in practice), and the domain of the distributions to be compact Euclidean. In this paper, we study a simple, analytically computable, ridge regression-based alternative to distribution regression, where we embed the distributions to a reproducing kernel Hilbert space, and learn the regressor from the embeddings to the outputs. Our main contribution is to prove that this scheme is consistent in the two-stage sampled setup under mild conditions (on separable topological domains enriched with kernels): we present an exact computational-statistical efficiency trade-off analysis showing that our estimator is able to match the one-stage sampled minimax optimal rate [Caponnetto and De Vito, 2007; Steinwart et al., 2009]. This result answers a 17-year-old open question, establishing the consistency of the classical set kernel [Haussler, 1999; Gaertner et. al, 2002] in regression. We also cover consistency for more recent kernels on distributions, including those due to [Christmann and Steinwart, 2010]. △ Less

Submitted 21 October, 2016; v1 submitted 7 November, 2014; originally announced November 2014.

Comments: Final version appeared at JMLR, with supplement. Code: https://bitbucket.org/szzoli/ite/. arXiv admin note: text overlap with arXiv:1402.1754

MSC Class: 62G08; 46E22; 47B32 ACM Class: G.3; I.2.6

Journal ref: Journal of Machine Learning Research, 17(152):1-40, 2016

arXiv:1411.0900 [pdf, ps, other]

Kernel Mean Estimation via Spectral Filtering

Authors: Krikamol Muandet, Bharath Sriperumbudur, Bernhard Schölkopf

Abstract: The problem of estimating the kernel mean in a reproducing kernel Hilbert space (RKHS) is central to kernel methods in that it is used by classical approaches (e.g., when centering a kernel PCA matrix), and it also forms the core inference step of modern kernel methods (e.g., kernel-based non-parametric tests) that rely on embedding probability distributions in RKHSs. Muandet et al. (2014) has sho… ▽ More The problem of estimating the kernel mean in a reproducing kernel Hilbert space (RKHS) is central to kernel methods in that it is used by classical approaches (e.g., when centering a kernel PCA matrix), and it also forms the core inference step of modern kernel methods (e.g., kernel-based non-parametric tests) that rely on embedding probability distributions in RKHSs. Muandet et al. (2014) has shown that shrinkage can help in constructing "better" estimators of the kernel mean than the empirical estimator. The present paper studies the consistency and admissibility of the estimators in Muandet et al. (2014), and proposes a wider class of shrinkage estimators that improve upon the empirical estimator by considering appropriate basis functions. Using the kernel PCA basis, we show that some of these estimators can be constructed using spectral filtering algorithms which are shown to be consistent under some technical assumptions. Our theoretical analysis also reveals a fundamental connection to the kernel-based supervised learning framework. The proposed estimators are simple to implement and perform well in practice. △ Less

Submitted 4 November, 2014; originally announced November 2014.

Comments: To appear at the 28th Annual Conference on Neural Information Processing Systems (NIPS 2014). 16 pages

arXiv:1402.1754 [pdf, ps, other]

Two-stage Sampled Learning Theory on Distributions

Authors: Zoltan Szabo, Arthur Gretton, Barnabas Poczos, Bharath Sriperumbudur

Abstract: We focus on the distribution regression problem: regressing to a real-valued response from a probability distribution. Although there exist a large number of similarity measures between distributions, very little is known about their generalization performance in specific learning tasks. Learning problems formulated on distributions have an inherent two-stage sampled difficulty: in practice only s… ▽ More We focus on the distribution regression problem: regressing to a real-valued response from a probability distribution. Although there exist a large number of similarity measures between distributions, very little is known about their generalization performance in specific learning tasks. Learning problems formulated on distributions have an inherent two-stage sampled difficulty: in practice only samples from sampled distributions are observable, and one has to build an estimate on similarities computed between sets of points. To the best of our knowledge, the only existing method with consistency guarantees for distribution regression requires kernel density estimation as an intermediate step (which suffers from slow convergence issues in high dimensions), and the domain of the distributions to be compact Euclidean. In this paper, we provide theoretical guarantees for a remarkably simple algorithmic alternative to solve the distribution regression problem: embed the distributions to a reproducing kernel Hilbert space, and learn a ridge regressor from the embeddings to the outputs. Our main contribution is to prove the consistency of this technique in the two-stage sampled setting under mild conditions (on separable, topological domains endowed with kernels). For a given total number of observations, we derive convergence rates as an explicit function of the problem difficulty. As a special case, we answer a 15-year-old open question: we establish the consistency of the classical set kernel [Haussler, 1999; Gartner et. al, 2002] in regression, and cover more recent kernels on distributions, including those due to [Christmann and Steinwart, 2010]. △ Less

Submitted 26 January, 2015; v1 submitted 7 February, 2014; originally announced February 2014.

Comments: v6: accepted at AISTATS-2015 for oral presentation; final version; code: https://bitbucket.org/szzoli/ite/; extension to the misspecified and vector-valued case: http://arxiv.longhoe.net/abs/1411.2066

MSC Class: 62G08; 46E22; 47B32 ACM Class: G.3; I.2.6

arXiv:1312.3516 [pdf, ps, other]

Density Estimation in Infinite Dimensional Exponential Families

Authors: Bharath Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Aapo Hyvärinen, Revant Kumar

Abstract: In this paper, we consider an infinite dimensional exponential family, $\mathcal{P}$ of probability densities, which are parametrized by functions in a reproducing kernel Hilbert space, $H$ and show it to be quite rich in the sense that a broad class of densities on $\mathbb{R}^d$ can be approximated arbitrarily well in Kullback-Leibler (KL) divergence by elements in $\mathcal{P}$. The main goal o… ▽ More In this paper, we consider an infinite dimensional exponential family, $\mathcal{P}$ of probability densities, which are parametrized by functions in a reproducing kernel Hilbert space, $H$ and show it to be quite rich in the sense that a broad class of densities on $\mathbb{R}^d$ can be approximated arbitrarily well in Kullback-Leibler (KL) divergence by elements in $\mathcal{P}$. The main goal of the paper is to estimate an unknown density, $p_0$ through an element in $\mathcal{P}$. Standard techniques like maximum likelihood estimation (MLE) or pseudo MLE (based on the method of sieves), which are based on minimizing the KL divergence between $p_0$ and $\mathcal{P}$, do not yield practically useful estimators because of their inability to efficiently handle the log-partition function. Instead, we propose an estimator, $\hat{p}_n$ based on minimizing the \emph{Fisher divergence}, $J(p_0\Vert p)$ between $p_0$ and $p\in \mathcal{P}$, which involves solving a simple finite-dimensional linear system. When $p_0\in\mathcal{P}$, we show that the proposed estimator is consistent, and provide a convergence rate of $n^{-\min\left\{\frac{2}{3},\frac{2β+1}{2β+2}\right\}}$ in Fisher divergence under the smoothness assumption that $\log p_0\in\mathcal{R}(C^β)$ for some $β\ge 0$, where $C$ is a certain Hilbert-Schmidt operator on $H$ and $\mathcal{R}(C^β)$ denotes the image of $C^β$. We also investigate the misspecified case of $p_0\notin\mathcal{P}$ and show that $J(p_0\Vert\hat{p}_n)\rightarrow \inf_{p\in\mathcal{P}}J(p_0\Vert p)$ as $n\rightarrow\infty$, and provide a rate for this convergence under a similar smoothness condition as above. Through numerical simulations we demonstrate that the proposed estimator outperforms the non-parametric kernel density estimator, and that the advantage with the proposed estimator grows as $d$ increases. △ Less

Submitted 26 May, 2017; v1 submitted 12 December, 2013; originally announced December 2013.

Comments: 58 pages, 8 figures; Fixed some errors and typos

arXiv:1310.8240 [pdf, ps, other]

doi 10.3150/15-BEJ713

On the optimal estimation of probability measures in weak and strong topologies

Authors: Bharath Sriperumbudur

Abstract: Given random samples drawn i.i.d. from a probability measure $\mathbb{P}$ (defined on say, $\mathbb{R}^d$), it is well-known that the empirical estimator is an optimal estimator of $\mathbb{P}$ in weak topology but not even a consistent estimator of its density (if it exists) in the strong topology (induced by the total variation distance). On the other hand, various popular density estimators suc… ▽ More Given random samples drawn i.i.d. from a probability measure $\mathbb{P}$ (defined on say, $\mathbb{R}^d$), it is well-known that the empirical estimator is an optimal estimator of $\mathbb{P}$ in weak topology but not even a consistent estimator of its density (if it exists) in the strong topology (induced by the total variation distance). On the other hand, various popular density estimators such as kernel and wavelet density estimators are optimal in the strong topology in the sense of achieving the minimax rate over all estimators for a Sobolev ball of densities. Recently, it has been shown in a series of papers by Giné and Nickl that these density estimators on $\mathbb{R}$ that are optimal in strong topology are also optimal in $\|\cdot\|_{\mathcal{F}}$ for certain choices of $\mathcal{F}$ such that $\|\cdot\|_{\mathcal{F}}$ metrizes the weak topology, where $\|\mathbb{P}\|_{\mathcal{F}}:=\sup\{\int f\,\mathrm{d}\mathbb{P}: f\in\mathcal{F}\}$. In this paper, we investigate this problem of optimal estimation in weak and strong topologies by choosing $\mathcal{F}$ to be a unit ball in a reproducing kernel Hilbert space (say $\mathcal{F}_H$ defined over $\mathbb{R}^d$), where this choice is both of theoretical and computational interest. Under some mild conditions on the reproducing kernel, we show that $\|\cdot\|_{\mathcal{F}_H}$ metrizes the weak topology and the kernel density estimator (with $L^1$ optimal bandwidth) estimates $\mathbb{P}$ at dimension independent optimal rate of $n^{-1/2}$ in $\|\cdot\|_{\mathcal{F}_H}$ along with providing a uniform central limit theorem for the kernel density estimator. △ Less

Submitted 30 March, 2016; v1 submitted 30 October, 2013; originally announced October 2013.

Comments: Published at http://dx.doi.org/10.3150/15-BEJ713 in the Bernoulli (http://isi.cbs.nl/bernoulli/) by the International Statistical Institute/Bernoulli Society (http://isi.cbs.nl/BS/bshome.htm)

Report number: IMS-BEJ-BEJ713

Journal ref: Bernoulli 2016, Vol. 22, No. 3, 1839-1893

arXiv:1306.0842 [pdf, ps, other]

Kernel Mean Estimation and Stein's Effect

Authors: Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, Arthur Gretton, Bernhard Schölkopf

Abstract: A mean function in reproducing kernel Hilbert space, or a kernel mean, is an important part of many applications ranging from kernel principal component analysis to Hilbert-space embedding of distributions. Given finite samples, an empirical average is the standard estimate for the true kernel mean. We show that this estimator can be improved via a well-known phenomenon in statistics called Stein'… ▽ More A mean function in reproducing kernel Hilbert space, or a kernel mean, is an important part of many applications ranging from kernel principal component analysis to Hilbert-space embedding of distributions. Given finite samples, an empirical average is the standard estimate for the true kernel mean. We show that this estimator can be improved via a well-known phenomenon in statistics called Stein's phenomenon. After consideration, our theoretical analysis reveals the existence of a wide class of estimators that are better than the standard. Focusing on a subset of this class, we propose efficient shrinkage estimators for the kernel mean. Empirical evaluations on several benchmark applications clearly demonstrate that the proposed estimators outperform the standard kernel mean estimator. △ Less

Submitted 6 June, 2013; v1 submitted 4 June, 2013; originally announced June 2013.

Comments: first draft

arXiv:1207.6076 [pdf, ps, other]

doi 10.1214/13-AOS1140

Equivalence of distance-based and RKHS-based statistics in hypothesis testing

Authors: Dino Sejdinovic, Bharath Sriperumbudur, Arthur Gretton, Kenji Fukumizu

Abstract: We provide a unifying framework linking two classes of statistics used in two-sample and independence testing: on the one hand, the energy distances and distance covariances from the statistics literature; on the other, maximum mean discrepancies (MMD), that is, distances between embeddings of distributions to reproducing kernel Hilbert spaces (RKHS), as established in machine learning. In the cas… ▽ More We provide a unifying framework linking two classes of statistics used in two-sample and independence testing: on the one hand, the energy distances and distance covariances from the statistics literature; on the other, maximum mean discrepancies (MMD), that is, distances between embeddings of distributions to reproducing kernel Hilbert spaces (RKHS), as established in machine learning. In the case where the energy distance is computed with a semimetric of negative type, a positive definite kernel, termed distance kernel, may be defined such that the MMD corresponds exactly to the energy distance. Conversely, for any positive definite kernel, we can interpret the MMD as energy distance with respect to some negative-type semimetric. This equivalence readily extends to distance covariance using kernels on the product space. We determine the class of probability distributions for which the test statistics are consistent against all alternatives. Finally, we investigate the performance of the family of distance kernels in two-sample and independence tests: we show in particular that the energy distance most commonly employed in statistics is just one member of a parametric family of kernels, and that other choices from this family can yield more powerful tests. △ Less

Submitted 12 November, 2013; v1 submitted 25 July, 2012; originally announced July 2012.

Comments: Published in at http://dx.doi.org/10.1214/13-AOS1140 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOS-AOS1140

Journal ref: Annals of Statistics 2013, Vol. 41, No. 5, 2263-2291

arXiv:1003.0887 [pdf, ps, other]

Universality, Characteristic Kernels and RKHS Embedding of Measures

Authors: Bharath K. Sriperumbudur, Kenji Fukumizu, Gert R. G. Lanckriet

Abstract: A Hilbert space embedding for probability measures has recently been proposed, wherein any probability measure is represented as a mean element in a reproducing kernel Hilbert space (RKHS). Such an embedding has found applications in homogeneity testing, independence testing, dimensionality reduction, etc., with the requirement that the reproducing kernel is characteristic, i.e., the embedding i… ▽ More A Hilbert space embedding for probability measures has recently been proposed, wherein any probability measure is represented as a mean element in a reproducing kernel Hilbert space (RKHS). Such an embedding has found applications in homogeneity testing, independence testing, dimensionality reduction, etc., with the requirement that the reproducing kernel is characteristic, i.e., the embedding is injective. In this paper, we generalize this embedding to finite signed Borel measures, wherein any finite signed Borel measure is represented as a mean element in an RKHS. We show that the proposed embedding is injective if and only if the kernel is universal. This therefore, provides a novel characterization of universal kernels, which are proposed in the context of achieving the Bayes risk by kernel-based classification/regression algorithms. By exploiting this relation between universality and the embedding of finite signed Borel measures into an RKHS, we establish the relation between universal and characteristic kernels. △ Less

Submitted 3 March, 2010; originally announced March 2010.

Comments: 30 pages, 1 figure

arXiv:0907.5309 [pdf, ps, other]

Hilbert space embeddings and metrics on probability measures

Authors: Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Schölkopf, Gert R. G. Lanckriet

Abstract: A Hilbert space embedding for probability measures has recently been proposed, with applications including dimensionality reduction, homogeneity testing, and independence testing. This embedding represents any probability measure as a mean element in a reproducing kernel Hilbert space (RKHS). A pseudometric on the space of probability measures can be defined as the distance between distribution… ▽ More A Hilbert space embedding for probability measures has recently been proposed, with applications including dimensionality reduction, homogeneity testing, and independence testing. This embedding represents any probability measure as a mean element in a reproducing kernel Hilbert space (RKHS). A pseudometric on the space of probability measures can be defined as the distance between distribution embeddings: we denote this as $γ_k$, indexed by the kernel function $k$ that defines the inner product in the RKHS. We present three theoretical properties of $γ_k$. First, we consider the question of determining the conditions on the kernel $k$ for which $γ_k$ is a metric: such $k$ are denoted {\em characteristic kernels}. Unlike pseudometrics, a metric is zero only when two distributions coincide, thus ensuring the RKHS embedding maps all distributions uniquely (i.e., the embedding is injective). While previously published conditions may apply only in restricted circumstances (e.g. on compact domains), and are difficult to check, our conditions are straightforward and intuitive: bounded continuous strictly positive definite kernels are characteristic. Alternatively, if a bounded continuous kernel is translation-invariant on $\bb{R}^d$, then it is characteristic if and only if the support of its Fourier transform is the entire $\bb{R}^d$. Second, we show that there exist distinct distributions that are arbitrarily close in $γ_k$. Third, to understand the nature of the topology induced by $γ_k$, we relate $γ_k$ to other popular metrics on probability measures, and present conditions on the kernel $k$ under which $γ_k$ metrizes the weak topology. △ Less

Submitted 29 January, 2010; v1 submitted 30 July, 2009; originally announced July 2009.

Comments: 48 pages

Showing 1–37 of 37 results for author: Sriperumbudur, B