Search | arXiv e-print repository

High-probability minimax lower bounds

Authors: Tianyi Ma, Kabir A. Verchand, Richard J. Samworth

Abstract: The minimax risk is often considered as a gold standard against which we can compare specific statistical procedures. Nevertheless, as has been observed recently in robust and heavy-tailed estimation problems, the inherent reduction of the (random) loss to its expectation may entail a significant loss of information regarding its tail behaviour. In an attempt to avoid such a loss, we introduce the… ▽ More The minimax risk is often considered as a gold standard against which we can compare specific statistical procedures. Nevertheless, as has been observed recently in robust and heavy-tailed estimation problems, the inherent reduction of the (random) loss to its expectation may entail a significant loss of information regarding its tail behaviour. In an attempt to avoid such a loss, we introduce the notion of a minimax quantile, and seek to articulate its dependence on the quantile level. To this end, we develop high-probability variants of the classical Le Cam and Fano methods, as well as a technique to convert local minimax risk lower bounds to lower bounds on minimax quantiles. To illustrate the power of our framework, we deploy our techniques on several examples, recovering recent results in robust mean estimation and stochastic convex optimisation, as well as obtaining several new results in covariance matrix estimation, sparse linear regression, nonparametric density estimation and isotonic regression. Our overall goal is to argue that minimax quantiles can provide a finer-grained understanding of the difficulty of statistical problems, and that, in wide generality, lower bounds on these quantities can be obtained via user-friendly tools. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: 37 pages, 3 figures

MSC Class: 62C20; 62B10

arXiv:2403.16688 [pdf, other]

Optimal convex $M$-estimation via score matching

Authors: Oliver Y. Feng, Yu-Chun Kao, Min Xu, Richard J. Samworth

Abstract: In the context of linear regression, we construct a data-driven convex loss function with respect to which empirical risk minimisation yields optimal asymptotic variance in the downstream estimation of the regression coefficients. Our semiparametric approach targets the best decreasing approximation of the derivative of the log-density of the noise distribution. At the population level, this fitti… ▽ More In the context of linear regression, we construct a data-driven convex loss function with respect to which empirical risk minimisation yields optimal asymptotic variance in the downstream estimation of the regression coefficients. Our semiparametric approach targets the best decreasing approximation of the derivative of the log-density of the noise distribution. At the population level, this fitting process is a nonparametric extension of score matching, corresponding to a log-concave projection of the noise distribution with respect to the Fisher divergence. The procedure is computationally efficient, and we prove that our procedure attains the minimal asymptotic covariance among all convex $M$-estimators. As an example of a non-log-concave setting, for Cauchy errors, the optimal convex loss function is Huber-like, and our procedure yields an asymptotic efficiency greater than 0.87 relative to the oracle maximum likelihood estimator of the regression coefficients that uses knowledge of this error distribution; in this sense, we obtain robustness without sacrificing much efficiency. Numerical experiments confirm the practical merits of our proposal. △ Less

Submitted 25 March, 2024; originally announced March 2024.

Comments: 69 pages, 12 figures and 4 tables

arXiv:2305.04852 [pdf, other]

Isotonic subgroup selection

Authors: Manuel M. Müller, Henry W. J. Reeve, Timothy I. Cannings, Richard J. Samworth

Abstract: Given a sample of covariate-response pairs, we consider the subgroup selection problem of identifying a subset of the covariate domain where the regression function exceeds a pre-determined threshold. We introduce a computationally-feasible approach for subgroup selection in the context of multivariate isotonic regression based on martingale tests and multiple testing procedures for logically-stru… ▽ More Given a sample of covariate-response pairs, we consider the subgroup selection problem of identifying a subset of the covariate domain where the regression function exceeds a pre-determined threshold. We introduce a computationally-feasible approach for subgroup selection in the context of multivariate isotonic regression based on martingale tests and multiple testing procedures for logically-structured hypotheses. Our proposed procedure satisfies a non-asymptotic, uniform Type I error rate guarantee with power that attains the minimax optimal rate up to poly-logarithmic factors. Extensions cover classification, isotonic quantile regression and heterogeneous treatment effect settings. Numerical studies on both simulated and real data confirm the practical effectiveness of our proposal, which is implemented in the R package ISS. △ Less

Submitted 28 June, 2023; v1 submitted 8 May, 2023; originally announced May 2023.

Comments: 69 pages, 20 figures

MSC Class: 62G08; 62H15

arXiv:2304.09154 [pdf, other]

Sharp-SSL: Selective high-dimensional axis-aligned random projections for semi-supervised learning

Authors: Tengyao Wang, Edgar Dobriban, Milana Gataric, Richard J. Samworth

Abstract: We propose a new method for high-dimensional semi-supervised learning problems based on the careful aggregation of the results of a low-dimensional procedure applied to many axis-aligned random projections of the data. Our primary goal is to identify important variables for distinguishing between the classes; existing low-dimensional methods can then be applied for final class assignment. Motivate… ▽ More We propose a new method for high-dimensional semi-supervised learning problems based on the careful aggregation of the results of a low-dimensional procedure applied to many axis-aligned random projections of the data. Our primary goal is to identify important variables for distinguishing between the classes; existing low-dimensional methods can then be applied for final class assignment. Motivated by a generalized Rayleigh quotient, we score projections according to the traces of the estimated whitened between-class covariance matrices on the projected data. This enables us to assign an importance weight to each variable for a given projection, and to select our signal variables by aggregating these weights over high-scoring projections. Our theory shows that the resulting Sharp-SSL algorithm is able to recover the signal coordinates with high probability when we aggregate over sufficiently many random projections and when the base procedure estimates the whitened between-class covariance matrix sufficiently well. The Gaussian EM algorithm is a natural choice as a base procedure, and we provide a new analysis of its performance in semi-supervised settings that controls the parameter estimation error in terms of the proportion of labeled data in the sample. Numerical results on both simulated data and a real colon tumor dataset support the excellent empirical performance of the method. △ Less

Submitted 18 April, 2023; originally announced April 2023.

Comments: 49 pages, 4 figures

MSC Class: 62H30

arXiv:2211.02039 [pdf, other]

The Projected Covariance Measure for assumption-lean variable significance testing

Authors: Anton Rask Lundborg, Ilmun Kim, Rajen D. Shah, Richard J. Samworth

Abstract: Testing the significance of a variable or group of variables $X$ for predicting a response $Y$, given additional covariates $Z$, is a ubiquitous task in statistics. A simple but common approach is to specify a linear model, and then test whether the regression coefficient for $X$ is non-zero. However, when the model is misspecified, the test may have poor power, for example when $X$ is involved in… ▽ More Testing the significance of a variable or group of variables $X$ for predicting a response $Y$, given additional covariates $Z$, is a ubiquitous task in statistics. A simple but common approach is to specify a linear model, and then test whether the regression coefficient for $X$ is non-zero. However, when the model is misspecified, the test may have poor power, for example when $X$ is involved in complex interactions, or lead to many false rejections. In this work we study the problem of testing the model-free null of conditional mean independence, i.e. that the conditional mean of $Y$ given $X$ and $Z$ does not depend on $X$. We propose a simple and general framework that can leverage flexible nonparametric or machine learning methods, such as additive models or random forests, to yield both robust error control and high power. The procedure involves using these methods to perform regressions, first to estimate a form of projection of $Y$ on $X$ and $Z$ using one half of the data, and then to estimate the expected conditional covariance between this projection and $Y$ on the remaining half of the data. While the approach is general, we show that a version of our procedure using spline regression achieves what we show is the minimax optimal rate in this nonparametric testing problem. Numerical experiments demonstrate the effectiveness of our approach both in terms of maintaining Type I error control, and power, compared to several existing approaches. △ Less

Submitted 7 May, 2024; v1 submitted 3 November, 2022; originally announced November 2022.

Comments: 97 pages, 5 figures

MSC Class: 62G10

arXiv:2205.08627 [pdf, other]

Optimal nonparametric testing of Missing Completely At Random, and its connections to compatibility

Authors: Thomas B Berrett, Richard J Samworth

Abstract: Given a set of incomplete observations, we study the nonparametric problem of testing whether data are Missing Completely At Random (MCAR). Our first contribution is to characterise precisely the set of alternatives that can be distinguished from the MCAR null hypothesis. This reveals interesting and novel links to the theory of Fréchet classes (in particular, compatible distributions) and linear… ▽ More Given a set of incomplete observations, we study the nonparametric problem of testing whether data are Missing Completely At Random (MCAR). Our first contribution is to characterise precisely the set of alternatives that can be distinguished from the MCAR null hypothesis. This reveals interesting and novel links to the theory of Fréchet classes (in particular, compatible distributions) and linear programming, that allow us to propose MCAR tests that are consistent against all detectable alternatives. We define an incompatibility index as a natural measure of ease of detectability, establish its key properties, and show how it can be computed exactly in some cases and bounded in others. Moreover, we prove that our tests can attain the minimax separation rate according to this measure, up to logarithmic factors. Our methodology does not require any complete cases to be effective, and is available in the R package MCARtest. △ Less

Submitted 17 May, 2022; originally announced May 2022.

Comments: 66 pages, 4 figures

arXiv:2111.01640 [pdf, other]

Inference in high-dimensional online changepoint detection

Authors: Yudong Chen, Tengyao Wang, Richard J. Samworth

Abstract: We introduce and study two new inferential challenges associated with the sequential detection of change in a high-dimensional mean vector. First, we seek a confidence interval for the changepoint, and second, we estimate the set of indices of coordinates in which the mean changes. We propose an online algorithm that produces an interval with guaranteed nominal coverage, and whose length is, with… ▽ More We introduce and study two new inferential challenges associated with the sequential detection of change in a high-dimensional mean vector. First, we seek a confidence interval for the changepoint, and second, we estimate the set of indices of coordinates in which the mean changes. We propose an online algorithm that produces an interval with guaranteed nominal coverage, and whose length is, with high probability, of the same order as the average detection delay, up to a logarithmic factor. The corresponding support estimate enjoys control of both false negatives and false positives. Simulations confirm the effectiveness of our methodology, and we also illustrate its applicability on the US excess deaths data from 2017--2020. △ Less

Submitted 2 March, 2023; v1 submitted 2 November, 2021; originally announced November 2021.

Comments: 40 pages, 3 figures

arXiv:2109.01077 [pdf, ps, other]

Optimal subgroup selection

Authors: Henry W. J. Reeve, Timothy I. Cannings, Richard J. Samworth

Abstract: In clinical trials and other applications, we often see regions of the feature space that appear to exhibit interesting behaviour, but it is unclear whether these observed phenomena are reflected at the population level. Focusing on a regression setting, we consider the subgroup selection challenge of identifying a region of the feature space on which the regression function exceeds a pre-determin… ▽ More In clinical trials and other applications, we often see regions of the feature space that appear to exhibit interesting behaviour, but it is unclear whether these observed phenomena are reflected at the population level. Focusing on a regression setting, we consider the subgroup selection challenge of identifying a region of the feature space on which the regression function exceeds a pre-determined threshold. We formulate the problem as one of constrained optimisation, where we seek a low-complexity, data-dependent selection set on which, with a guaranteed probability, the regression function is uniformly at least as large as the threshold; subject to this constraint, we would like the region to contain as much mass under the marginal feature distribution as possible. This leads to a natural notion of regret, and our main contribution is to determine the minimax optimal rate for this regret in both the sample size and the Type I error probability. The rate involves a delicate interplay between parameters that control the smoothness of the regression function, as well as exponents that quantify the extent to which the optimal selection set at the population level can be approximated by families of well-behaved subsets. Finally, we expand the scope of our previous results by illustrating how they may be generalised to a treatment and control setting, where interest lies in the heterogeneous treatment effect. △ Less

Submitted 20 September, 2023; v1 submitted 2 September, 2021; originally announced September 2021.

Comments: 65 pages, 2 figures, to appear in the Annals of Statistics

MSC Class: 62-XX; 62G08; 62Gxx; 62C20

arXiv:2108.01525 [pdf, other]

High-dimensional changepoint estimation with heterogeneous missingness

Authors: Bertille Follain, Tengyao Wang, Richard J. Samworth

Abstract: We propose a new method for changepoint estimation in partially-observed, high-dimensional time series that undergo a simultaneous change in mean in a sparse subset of coordinates. Our first methodological contribution is to introduce a 'MissCUSUM' transformation (a generalisation of the popular Cumulative Sum statistics), that captures the interaction between the signal strength and the level of… ▽ More We propose a new method for changepoint estimation in partially-observed, high-dimensional time series that undergo a simultaneous change in mean in a sparse subset of coordinates. Our first methodological contribution is to introduce a 'MissCUSUM' transformation (a generalisation of the popular Cumulative Sum statistics), that captures the interaction between the signal strength and the level of missingness in each coordinate. In order to borrow strength across the coordinates, we propose to project these MissCUSUM statistics along a direction found as the solution to a penalised optimisation problem tailored to the specific sparsity structure. The changepoint can then be estimated as the location of the peak of the absolute value of the projected univariate series. In a model that allows different missingness probabilities in different component series, we identify that the key interaction between the missingness and the signal is a weighted sum of squares of the signal change in each coordinate, with weights given by the observation probabilities. More specifically, we prove that the angle between the estimated and oracle projection directions, as well as the changepoint location error, are controlled with high probability by the sum of two terms, both involving this weighted sum of squares, and representing the error incurred due to noise and the error due to missingness respectively. A lower bound confirms that our changepoint estimator, which we call 'MissInspect', is optimal up to a logarithmic factor. The striking effectiveness of the MissInspect methodology is further demonstrated both on simulated data, and on an oceanographic data set covering the Neogene period. △ Less

Submitted 3 August, 2021; originally announced August 2021.

Comments: 36 pages, 4 figures

arXiv:2107.07257 [pdf, other]

Nonparametric, tuning-free estimation of S-shaped functions

Authors: Oliver Y. Feng, Yining Chen, Qiyang Han, Raymond J. Carroll, Richard J. Samworth

Abstract: We consider the nonparametric estimation of an S-shaped regression function. The least squares estimator provides a very natural, tuning-free approach, but results in a non-convex optimisation problem, since the inflection point is unknown. We show that the estimator may nevertheless be regarded as a projection onto a finite union of convex cones, which allows us to propose a mixed primal-dual bas… ▽ More We consider the nonparametric estimation of an S-shaped regression function. The least squares estimator provides a very natural, tuning-free approach, but results in a non-convex optimisation problem, since the inflection point is unknown. We show that the estimator may nevertheless be regarded as a projection onto a finite union of convex cones, which allows us to propose a mixed primal-dual bases algorithm for its efficient, sequential computation. After develo** a projection framework that demonstrates the consistency and robustness to misspecification of the estimator, our main theoretical results provide sharp oracle inequalities that yield worst-case and adaptive risk bounds for the estimation of the regression function, as well as a rate of convergence for the estimation of the inflection point. These results reveal not only that the estimator achieves the minimax optimal rate of convergence for both the estimation of the regression function and its inflection point (up to a logarithmic factor in the latter case), but also that it is able to achieve an almost-parametric rate when the true regression function is piecewise affine with not too many affine pieces. Simulations and a real data application to air pollution modelling also confirm the desirable finite-sample properties of the estimator, and our algorithm is implemented in the R package Sshaped. △ Less

Submitted 15 July, 2021; originally announced July 2021.

Comments: 79 pages, 10 figures

arXiv:2106.04455 [pdf, other]

Adaptive transfer learning

Authors: Henry W. J. Reeve, Timothy I. Cannings, Richard J. Samworth

Abstract: In transfer learning, we wish to make inference about a target population when we have access to data both from the distribution itself, and from a different but related source distribution. We introduce a flexible framework for transfer learning in the context of binary classification, allowing for covariate-dependent relationships between the source and target distributions that are not required… ▽ More In transfer learning, we wish to make inference about a target population when we have access to data both from the distribution itself, and from a different but related source distribution. We introduce a flexible framework for transfer learning in the context of binary classification, allowing for covariate-dependent relationships between the source and target distributions that are not required to preserve the Bayes decision boundary. Our main contributions are to derive the minimax optimal rates of convergence (up to poly-logarithmic factors) in this problem, and show that the optimal rate can be achieved by an algorithm that adapts to key aspects of the unknown transfer relationship, as well as the smoothness and tail parameters of our distributional classes. This optimal rate turns out to have several regimes, depending on the interplay between the relative sample sizes and the strength of the transfer relationship, and our algorithm achieves optimality by careful, decision tree-based calibration of local nearest-neighbour procedures. △ Less

Submitted 8 June, 2021; originally announced June 2021.

MSC Class: 62G05

arXiv:2105.11387 [pdf, other]

A new computational framework for log-concave density estimation

Authors: Wenyu Chen, Rahul Mazumder, Richard J. Samworth

Abstract: In Statistics, log-concave density estimation is a central problem within the field of nonparametric inference under shape constraints. Despite great progress in recent years on the statistical theory of the canonical estimator, namely the log-concave maximum likelihood estimator, adoption of this method has been hampered by the complexities of the non-smooth convex optimization problem that under… ▽ More In Statistics, log-concave density estimation is a central problem within the field of nonparametric inference under shape constraints. Despite great progress in recent years on the statistical theory of the canonical estimator, namely the log-concave maximum likelihood estimator, adoption of this method has been hampered by the complexities of the non-smooth convex optimization problem that underpins its computation. We provide enhanced understanding of the structural properties of this optimization problem, which motivates the proposal of new algorithms, based on both randomized and Nesterov smoothing, combined with an appropriate integral discretization of increasing accuracy. We prove that these methods enjoy, both with high probability and in expectation, a convergence rate of order $1/T$ up to logarithmic factors on the objective function scale, where $T$ denotes the number of iterations. The benefits of our new computational framework are demonstrated on both synthetic and real data, and our implementation is available in a github repository \texttt{LogConcComp} (Log-Concave Computation). △ Less

Submitted 28 February, 2023; v1 submitted 24 May, 2021; originally announced May 2021.

arXiv:2105.02180 [pdf, other]

A unifying tutorial on Approximate Message Passing

Authors: Oliver Y. Feng, Ramji Venkataramanan, Cynthia Rush, Richard J. Samworth

Abstract: Over the last decade or so, Approximate Message Passing (AMP) algorithms have become extremely popular in various structured high-dimensional statistical problems. The fact that the origins of these techniques can be traced back to notions of belief propagation in the statistical physics literature lends a certain mystique to the area for many statisticians. Our goal in this work is to present the… ▽ More Over the last decade or so, Approximate Message Passing (AMP) algorithms have become extremely popular in various structured high-dimensional statistical problems. The fact that the origins of these techniques can be traced back to notions of belief propagation in the statistical physics literature lends a certain mystique to the area for many statisticians. Our goal in this work is to present the main ideas of AMP from a statistical perspective, to illustrate the power and flexibility of the AMP framework. Along the way, we strengthen and unify many of the results in the existing literature. △ Less

Submitted 5 May, 2021; originally announced May 2021.

Comments: 99 pages, 2 figures

arXiv:2101.10880 [pdf, other]

doi 10.1098/rspa.2021.0549

USP: an independence test that improves on Pearson's chi-squared and the $G$-test

Authors: Thomas B. Berrett, Richard J. Samworth

Abstract: We present the $U$-Statistic Permutation (USP) test of independence in the context of discrete data displayed in a contingency table. Either Pearson's chi-squared test of independence, or the $G$-test, are typically used for this task, but we argue that these tests have serious deficiencies, both in terms of their inability to control the size of the test, and their power properties. By contrast,… ▽ More We present the $U$-Statistic Permutation (USP) test of independence in the context of discrete data displayed in a contingency table. Either Pearson's chi-squared test of independence, or the $G$-test, are typically used for this task, but we argue that these tests have serious deficiencies, both in terms of their inability to control the size of the test, and their power properties. By contrast, the USP test is guaranteed to control the size of the test at the nominal level for all sample sizes, has no issues with small (or zero) cell counts, and is able to detect distributions that violate independence in only a minimal way. The test statistic is derived from a $U$-statistic estimator of a natural population measure of dependence, and we prove that this is the unique minimum variance unbiased estimator of this population quantity. The practical utility of the USP test is demonstrated on both simulated data, where its power can be dramatically greater than those of Pearson's test and the $G$-test, and on real data. The USP test is implemented in the R package USP. △ Less

Submitted 26 January, 2021; originally announced January 2021.

Comments: 27 pages, 7 figures

MSC Class: 62H17; 62H20; 62F03; 62F05; 62E20

arXiv:2009.02609 [pdf, ps, other]

Isotonic regression with unknown permutations: Statistics, computation, and adaptation

Authors: Ashwin Pananjady, Richard J. Samworth

Abstract: Motivated by models for multiway comparison data, we consider the problem of estimating a coordinate-wise isotonic function on the domain $[0, 1]^d$ from noisy observations collected on a uniform lattice, but where the design points have been permuted along each dimension. While the univariate and bivariate versions of this problem have received significant attention, our focus is on the multivari… ▽ More Motivated by models for multiway comparison data, we consider the problem of estimating a coordinate-wise isotonic function on the domain $[0, 1]^d$ from noisy observations collected on a uniform lattice, but where the design points have been permuted along each dimension. While the univariate and bivariate versions of this problem have received significant attention, our focus is on the multivariate case $d \geq 3$. We study both the minimax risk of estimation (in empirical $L_2$ loss) and the fundamental limits of adaptation (quantified by the adaptivity index) to a family of piecewise constant functions. We provide a computationally efficient Mirsky partition estimator that is minimax optimal while also achieving the smallest adaptivity index possible for polynomial time procedures. Thus, from a worst-case perspective and in sharp contrast to the bivariate case, the latent permutations in the model do not introduce significant computational difficulties over and above vanilla isotonic regression. On the other hand, the fundamental limits of adaptation are significantly different with and without unknown permutations: Assuming a hardness conjecture from average-case complexity theory, a statistical-computational gap manifests in the former case. In a complementary direction, we show that natural modifications of existing estimators fail to satisfy at least one of the desiderata of optimal worst-case statistical performance, computational efficiency, and fast adaptation. Along the way to showing our results, we improve adaptation results in the special case $d = 2$ and establish some properties of estimators for vanilla isotonic regression, both of which may be of independent interest. △ Less

Submitted 24 June, 2021; v1 submitted 5 September, 2020; originally announced September 2020.

Comments: Version v2 contains reorganized material, one figure, and expanded discussions

arXiv:2003.03668 [pdf, other]

High-dimensional, multiscale online changepoint detection

Authors: Yudong Chen, Tengyao Wang, Richard J. Samworth

Abstract: We introduce a new method for high-dimensional, online changepoint detection in settings where a $p$-variate Gaussian data stream may undergo a change in mean. The procedure works by performing likelihood ratio tests against simple alternatives of different scales in each coordinate, and then aggregating test statistics across scales and coordinates. The algorithm is online in the sense that both… ▽ More We introduce a new method for high-dimensional, online changepoint detection in settings where a $p$-variate Gaussian data stream may undergo a change in mean. The procedure works by performing likelihood ratio tests against simple alternatives of different scales in each coordinate, and then aggregating test statistics across scales and coordinates. The algorithm is online in the sense that both its storage requirements and worst-case computational complexity per new observation are independent of the number of previous observations; in practice, it may even be significantly faster than this. We prove that the patience, or average run length under the null, of our procedure is at least at the desired nominal level, and provide guarantees on its response delay under the alternative that depend on the sparsity of the vector of mean change. Simulations confirm the practical effectiveness of our proposal, which is implemented in the R package 'ocd', and we also demonstrate its utility on a seismology data set. △ Less

Submitted 10 October, 2020; v1 submitted 7 March, 2020; originally announced March 2020.

Comments: 40 pages, 3 figures

MSC Class: 62H99; 62L99

arXiv:2002.06117 [pdf, ps, other]

Local continuity of log-concave projection, with applications to estimation under model misspecification

Authors: Rina Foygel Barber, Richard J. Samworth

Abstract: The log-concave projection is an operator that maps a d-dimensional distribution P to an approximating log-concave density. Prior work by D{ü}mbgen et al. (2011) establishes that, with suitable metrics on the underlying spaces, this projection is continuous, but not uniformly continuous. In this work we prove a local uniform continuity result for log-concave projection -- in particular, establishi… ▽ More The log-concave projection is an operator that maps a d-dimensional distribution P to an approximating log-concave density. Prior work by D{ü}mbgen et al. (2011) establishes that, with suitable metrics on the underlying spaces, this projection is continuous, but not uniformly continuous. In this work we prove a local uniform continuity result for log-concave projection -- in particular, establishing that this map is locally H{ö}lder-(1/4) continuous. A matching lower bound verifies that this exponent cannot be improved. We also examine the implications of this continuity result for the empirical setting -- given a sample drawn from a distribution P, we bound the squared Hellinger distance between the log-concave projection of the empirical distribution of the sample, and the log-concave projection of P. In particular, this yields interesting statistical results for the misspecified setting, where P is not itself log-concave. △ Less

Submitted 18 December, 2020; v1 submitted 14 February, 2020; originally announced February 2020.

arXiv:2001.05513 [pdf, other]

Optimal rates for independence testing via $U$-statistic permutation tests

Authors: Thomas B. Berrett, Ioannis Kontoyiannis, Richard J. Samworth

Abstract: We study the problem of independence testing given independent and identically distributed pairs taking values in a $σ$-finite, separable measure space. Defining a natural measure of dependence $D(f)$ as the squared $L^2$-distance between a joint density $f$ and the product of its marginals, we first show that there is no valid test of independence that is uniformly consistent against alternatives… ▽ More We study the problem of independence testing given independent and identically distributed pairs taking values in a $σ$-finite, separable measure space. Defining a natural measure of dependence $D(f)$ as the squared $L^2$-distance between a joint density $f$ and the product of its marginals, we first show that there is no valid test of independence that is uniformly consistent against alternatives of the form $\{f: D(f) \geq ρ^2 \}$. We therefore restrict attention to alternatives that impose additional Sobolev-type smoothness constraints, and define a permutation test based on a basis expansion and a $U$-statistic estimator of $D(f)$ that we prove is minimax optimal in terms of its separation rates in many instances. Finally, for the case of a Fourier basis on $[0,1]^2$, we provide an approximation to the power function that offers several additional insights. Our methodology is implemented in the R package USP. △ Less

Submitted 6 November, 2020; v1 submitted 15 January, 2020; originally announced January 2020.

Comments: 58 pages, 4 figures

MSC Class: 62C20; 62G10; 62H20

arXiv:1908.03606 [pdf, other]

Goodness-of-fit testing in high-dimensional generalized linear models

Authors: Jana Janková, Rajen D. Shah, Peter Bühlmann, Richard J. Samworth

Abstract: We propose a family of tests to assess the goodness-of-fit of a high-dimensional generalized linear model. Our framework is flexible and may be used to construct an omnibus test or directed against testing specific non-linearities and interaction effects, or for testing the significance of groups of variables. The methodology is based on extracting left-over signal in the residuals from an initial… ▽ More We propose a family of tests to assess the goodness-of-fit of a high-dimensional generalized linear model. Our framework is flexible and may be used to construct an omnibus test or directed against testing specific non-linearities and interaction effects, or for testing the significance of groups of variables. The methodology is based on extracting left-over signal in the residuals from an initial fit of a generalized linear model. This can be achieved by predicting this signal from the residuals using modern flexible regression or machine learning methods such as random forests or boosted trees. Under the null hypothesis that the generalized linear model is correct, no signal is left in the residuals and our test statistic has a Gaussian limiting distribution, translating to asymptotic control of type I error. Under a local alternative, we establish a guarantee on the power of the test. We illustrate the effectiveness of the methodology on simulated and real data examples by testing goodness-of-fit in logistic regression models. Software implementing the methodology is available in the R package `GRPtests'. △ Less

Submitted 12 November, 2019; v1 submitted 9 August, 2019; originally announced August 2019.

Comments: 40 pages, 4 figures

arXiv:1907.10012 [pdf, other]

Minimax rates in sparse, high-dimensional changepoint detection

Authors: Haoyang Liu, Chao Gao, Richard J. Samworth

Abstract: We study the detection of a sparse change in a high-dimensional mean vector as a minimax testing problem. Our first main contribution is to derive the exact minimax testing rate across all parameter regimes for $n$ independent, $p$-variate Gaussian observations. This rate exhibits a phase transition when the sparsity level is of order $\sqrt{p \log \log (8n)}$ and has a very delicate dependence on… ▽ More We study the detection of a sparse change in a high-dimensional mean vector as a minimax testing problem. Our first main contribution is to derive the exact minimax testing rate across all parameter regimes for $n$ independent, $p$-variate Gaussian observations. This rate exhibits a phase transition when the sparsity level is of order $\sqrt{p \log \log (8n)}$ and has a very delicate dependence on the sample size: in a certain sparsity regime it involves a triple iterated logarithmic factor in~$n$. Further, in a dense asymptotic regime, we identify the sharp leading constant, while in the corresponding sparse asymptotic regime, this constant is determined to within a factor of $\sqrt{2}$. Extensions that cover spatial and temporal dependence, primarily in the dense case, are also provided. △ Less

Submitted 17 November, 2020; v1 submitted 23 July, 2019; originally announced July 2019.

arXiv:1906.12125 [pdf, other]

High-dimensional principal component analysis with heterogeneous missingness

Authors: Ziwei Zhu, Tengyao Wang, Richard J. Samworth

Abstract: We study the problem of high-dimensional Principal Component Analysis (PCA) with missing observations. In simple, homogeneous missingness settings with a noise level of constant order, we show that an existing inverse-probability weighted (IPW) estimator of the leading principal components can (nearly) attain the minimax optimal rate of convergence. However, deeper investigation reveals both that,… ▽ More We study the problem of high-dimensional Principal Component Analysis (PCA) with missing observations. In simple, homogeneous missingness settings with a noise level of constant order, we show that an existing inverse-probability weighted (IPW) estimator of the leading principal components can (nearly) attain the minimax optimal rate of convergence. However, deeper investigation reveals both that, particularly in more realistic settings where the missingness mechanism is heterogeneous, the empirical performance of the IPW estimator can be unsatisfactory, and moreover that, in the noiseless case, it fails to provide exact recovery of the principal components. Our main contribution, then, is to introduce a new method for high-dimensional PCA, called `primePCA', that is designed to cope with situations where observations may be missing in a heterogeneous manner. Starting from the IPW estimator, primePCA iteratively projects the observed entries of the data matrix onto the column space of our current estimate to impute the missing entries, and then updates our estimate by computing the leading right singular space of the imputed data matrix. It turns out that the interaction between the heterogeneity of missingness and the low-dimensional structure is crucial in determining the feasibility of the problem. We therefore introduce an incoherence condition on the principal components and prove that in the noiseless case, the error of primePCA converges to zero at a geometric rate when the signal strength is not too small. An important feature of our theoretical guarantees is that they depend on average, as opposed to worst-case, properties of the missingness mechanism. Our numerical studies on both simulated and real data reveal that primePCA exhibits very encouraging performance across a wide range of scenarios. △ Less

Submitted 28 June, 2019; originally announced June 2019.

Comments: 42 pages, 4 figures

MSC Class: 62H25

arXiv:1904.09347 [pdf, ps, other]

Efficient functional estimation and the super-oracle phenomenon

Authors: Thomas B. Berrett, Richard J. Samworth

Abstract: We consider the estimation of two-sample integral functionals, of the type that occur naturally, for example, when the object of interest is a divergence between unknown probability densities. Our first main result is that, in wide generality, a weighted nearest neighbour estimator is efficient, in the sense of achieving the local asymptotic minimax lower bound. Moreover, we also prove a correspon… ▽ More We consider the estimation of two-sample integral functionals, of the type that occur naturally, for example, when the object of interest is a divergence between unknown probability densities. Our first main result is that, in wide generality, a weighted nearest neighbour estimator is efficient, in the sense of achieving the local asymptotic minimax lower bound. Moreover, we also prove a corresponding central limit theorem, which facilitates the construction of asymptotically valid confidence intervals for the functional, having asymptotically minimal width. One interesting consequence of our results is the discovery that, for certain functionals, the worst-case performance of our estimator may improve on that of the natural `oracle' estimator, which is given access to the values of the unknown densities at the observations. △ Less

Submitted 30 January, 2023; v1 submitted 18 April, 2019; originally announced April 2019.

Comments: 76 pages

MSC Class: 62G05; 62G20

arXiv:1903.06092 [pdf, other]

High-dimensional nonparametric density estimation via symmetry and shape constraints

Authors: Min Xu, Richard J. Samworth

Abstract: We tackle the problem of high-dimensional nonparametric density estimation by taking the class of log-concave densities on $\mathbb{R}^p$ and incorporating within it symmetry assumptions, which facilitate scalable estimation algorithms and can mitigate the curse of dimensionality. Our main symmetry assumption is that the super-level sets of the density are $K$-homothetic (i.e. scalar multiples of… ▽ More We tackle the problem of high-dimensional nonparametric density estimation by taking the class of log-concave densities on $\mathbb{R}^p$ and incorporating within it symmetry assumptions, which facilitate scalable estimation algorithms and can mitigate the curse of dimensionality. Our main symmetry assumption is that the super-level sets of the density are $K$-homothetic (i.e. scalar multiples of a convex body $K \subseteq \mathbb{R}^p$). When $K$ is known, we prove that the $K$-homothetic log-concave maximum likelihood estimator based on $n$ independent observations from such a density has a worst-case risk bound with respect to, e.g., squared Hellinger loss, of $O(n^{-4/5})$, independent of $p$. Moreover, we show that the estimator is adaptive in the sense that if the data generating density admits a special form, then a nearly parametric rate may be attained. We also provide worst-case and adaptive risk bounds in cases where $K$ is only known up to a positive definite transformation, and where it is completely unknown and must be estimated nonparametrically. Our estimation algorithms are fast even when $n$ and $p$ are on the order of hundreds of thousands, and we illustrate the strong finite-sample performance of our methods on simulated data. △ Less

Submitted 14 March, 2019; originally announced March 2019.

Comments: 93 pages; 5 figures

MSC Class: 62G07

arXiv:1812.11634 [pdf, other]

Adaptation in multivariate log-concave density estimation

Authors: Oliver Y. Feng, Adityanand Guntuboyina, Arlene K. H. Kim, Richard J. Samworth

Abstract: We study the adaptation properties of the multivariate log-concave maximum likelihood estimator over three subclasses of log-concave densities. The first consists of densities with polyhedral support whose logarithms are piecewise affine. The complexity of such densities~$f$ can be measured in terms of the sum $Γ(f)$ of the numbers of facets of the subdomains in the polyhedral subdivision of the s… ▽ More We study the adaptation properties of the multivariate log-concave maximum likelihood estimator over three subclasses of log-concave densities. The first consists of densities with polyhedral support whose logarithms are piecewise affine. The complexity of such densities~$f$ can be measured in terms of the sum $Γ(f)$ of the numbers of facets of the subdomains in the polyhedral subdivision of the support induced by $f$. Given $n$ independent observations from a $d$-dimensional log-concave density with $d \in \{2,3\}$, we prove a sharp oracle inequality, which in particular implies that the Kullback--Leibler risk of the log-concave maximum likelihood estimator for such densities is bounded above by $Γ(f)/n$, up to a polylogarithmic factor. Thus, the rate can be essentially parametric, even in this multivariate setting. For the second type of adaptation, we consider densities that are bounded away from zero on a polytopal support; we show that up to polylogarithmic factors, the log-concave maximum likelihood estimator attains the rate $n^{-4/7}$ when $d=3$, which is faster than the worst-case rate of $n^{-1/2}$. Finally, our third type of subclass consists of densities whose contours are well-separated; these new classes are constructed to be affine invariant and turn out to contain a wide variety of densities, including those that satisfy Hölder regularity conditions. Here, we prove another sharp oracle inequality, which reveals in particular that the log-concave maximum likelihood estimator attains a risk bound of order $n^{-\min\bigl(\frac{β+3}{β+7},\frac{4}{7}\bigr)}$ when $d=3$ over the class of $β$-Hölder log-concave densities with $β\in (1,3]$, again up to a polylogarithmic factor. △ Less

Submitted 18 October, 2019; v1 submitted 30 December, 2018; originally announced December 2018.

Comments: 97 pages, 6 figures

MSC Class: 62G07; 62G20

arXiv:1807.05405 [pdf, other]

The conditional permutation test for independence while controlling for confounders

Authors: Thomas B. Berrett, Yi Wang, Rina Foygel Barber, Richard J. Samworth

Abstract: We propose a general new method, the conditional permutation test, for testing the conditional independence of variables $X$ and $Y$ given a potentially high-dimensional random vector $Z$ that may contain confounding factors. The proposed test permutes entries of $X$ non-uniformly, so as to respect the existing dependence between $X$ and $Z$ and thus account for the presence of these confounders.… ▽ More We propose a general new method, the conditional permutation test, for testing the conditional independence of variables $X$ and $Y$ given a potentially high-dimensional random vector $Z$ that may contain confounding factors. The proposed test permutes entries of $X$ non-uniformly, so as to respect the existing dependence between $X$ and $Z$ and thus account for the presence of these confounders. Like the conditional randomization test of Candès et al. (2018), our test relies on the availability of an approximation to the distribution of $X \mid Z$. While Candès et al. (2018)'s test uses this estimate to draw new $X$ values, for our test we use this approximation to design an appropriate non-uniform distribution on permutations of the $X$ values already seen in the true data. We provide an efficient Markov Chain Monte Carlo sampler for the implementation of our method, and establish bounds on the Type I error in terms of the error in the approximation of the conditional distribution of $X\mid Z$, finding that, for the worst case test statistic, the inflation in Type I error of the conditional permutation test is no larger than that of the conditional randomization test. We validate these theoretical results with experiments on simulated data and on the Capital Bikeshare data set. △ Less

Submitted 7 May, 2019; v1 submitted 14 July, 2018; originally announced July 2018.

Comments: 31 pages, 4 figures

arXiv:1805.11505 [pdf, ps, other]

Classification with imperfect training labels

Authors: Timothy I. Cannings, Yingying Fan, Richard J. Samworth

Abstract: We study the effect of imperfect training data labels on the performance of classification methods. In a general setting, where the probability that an observation in the training dataset is mislabelled may depend on both the feature vector and the true label, we bound the excess risk of an arbitrary classifier trained with imperfect labels in terms of its excess risk for predicting a noisy label.… ▽ More We study the effect of imperfect training data labels on the performance of classification methods. In a general setting, where the probability that an observation in the training dataset is mislabelled may depend on both the feature vector and the true label, we bound the excess risk of an arbitrary classifier trained with imperfect labels in terms of its excess risk for predicting a noisy label. This reveals conditions under which a classifier trained with imperfect labels remains consistent for classifying uncorrupted test data points. Furthermore, under stronger conditions, we derive detailed asymptotic properties for the popular $k$-nearest neighbour ($k$nn), support vector machine (SVM) and linear discriminant analysis (LDA) classifiers. One consequence of these results is that the knn and SVM classifiers are robust to imperfect training labels, in the sense that the rate of convergence of the excess risks of these classifiers remains unchanged; in fact, our theoretical and empirical results even show that in some cases, imperfect labels may improve the performance of these methods. On the other hand, the LDA classifier is shown to be typically inconsistent in the presence of label noise unless the prior probabilities of each class are equal. Our theoretical results are supported by a simulation study. △ Less

Submitted 6 May, 2019; v1 submitted 29 May, 2018; originally announced May 2018.

Comments: 44 pages, 7 figures

MSC Class: 62H30

arXiv:1803.01150 [pdf, other]

Confidence intervals for high-dimensional Cox models

Authors: Yi Yu, Jelena Bradic, Richard J. Samworth

Abstract: The purpose of this paper is to construct confidence intervals for the regression coefficients in high-dimensional Cox proportional hazards regression models where the number of covariates may be larger than the sample size. Our debiased estimator construction is similar to those in Zhang and Zhang (2014) and van de Geer et al. (2014), but the time-dependent covariates and censored risk sets intro… ▽ More The purpose of this paper is to construct confidence intervals for the regression coefficients in high-dimensional Cox proportional hazards regression models where the number of covariates may be larger than the sample size. Our debiased estimator construction is similar to those in Zhang and Zhang (2014) and van de Geer et al. (2014), but the time-dependent covariates and censored risk sets introduce considerable additional challenges. Our theoretical results, which provide conditions under which our confidence intervals are asymptotically valid, are supported by extensive numerical experiments. △ Less

Submitted 3 March, 2018; originally announced March 2018.

Comments: 36 pages, 1 figure

MSC Class: 62N02; 62N03

arXiv:1712.05630 [pdf, other]

Sparse principal component analysis via axis-aligned random projections

Authors: Milana Gataric, Tengyao Wang, Richard J. Samworth

Abstract: We introduce a new method for sparse principal component analysis, based on the aggregation of eigenvector information from carefully-selected axis-aligned random projections of the sample covariance matrix. Unlike most alternative approaches, our algorithm is non-iterative, so is not vulnerable to a bad choice of initialisation. We provide theoretical guarantees under which our principal subspace… ▽ More We introduce a new method for sparse principal component analysis, based on the aggregation of eigenvector information from carefully-selected axis-aligned random projections of the sample covariance matrix. Unlike most alternative approaches, our algorithm is non-iterative, so is not vulnerable to a bad choice of initialisation. We provide theoretical guarantees under which our principal subspace estimator can attain the minimax optimal rate of convergence in polynomial time. In addition, our theory provides a more refined understanding of the statistical and computational trade-off in the problem of sparse principal component estimation, revealing a subtle interplay between the effective sample size and the number of random projections that are required to achieve the minimax optimal rate. Numerical studies provide further insight into the procedure and confirm its highly competitive finite-sample performance. △ Less

Submitted 6 May, 2019; v1 submitted 15 December, 2017; originally announced December 2017.

Comments: 32 pages

MSC Class: 62H25

arXiv:1711.06642 [pdf, other]

Nonparametric independence testing via mutual information

Authors: Thomas B. Berrett, Richard J. Samworth

Abstract: We propose a test of independence of two multivariate random vectors, given a sample from the underlying population. Our approach, which we call MINT, is based on the estimation of mutual information, whose decomposition into joint and marginal entropies facilitates the use of recently-developed efficient entropy estimators derived from nearest neighbour distances. The proposed critical values, wh… ▽ More We propose a test of independence of two multivariate random vectors, given a sample from the underlying population. Our approach, which we call MINT, is based on the estimation of mutual information, whose decomposition into joint and marginal entropies facilitates the use of recently-developed efficient entropy estimators derived from nearest neighbour distances. The proposed critical values, which may be obtained from simulation (in the case where one marginal is known) or resampling, guarantee that the test has nominal size, and we provide local power analyses, uniformly over classes of densities whose mutual information satisfies a lower bound. Our ideas may be extended to provide a new goodness-of-fit tests of normal linear models based on assessing the independence of our vector of covariates and an appropriately-defined notion of an error vector. The theory is supported by numerical studies on both simulated and real data. △ Less

Submitted 17 November, 2017; originally announced November 2017.

Comments: 46 pages, 2 figures

MSC Class: 62G10

arXiv:1709.03154 [pdf, other]

Recent progress in log-concave density estimation

Authors: Richard J. Samworth

Abstract: In recent years, log-concave density estimation via maximum likelihood estimation has emerged as a fascinating alternative to traditional nonparametric smoothing techniques, such as kernel density estimation, which require the choice of one or more bandwidths. The purpose of this article is to describe some of the properties of the class of log-concave densities on $\mathbb{R}^d$ which make it so… ▽ More In recent years, log-concave density estimation via maximum likelihood estimation has emerged as a fascinating alternative to traditional nonparametric smoothing techniques, such as kernel density estimation, which require the choice of one or more bandwidths. The purpose of this article is to describe some of the properties of the class of log-concave densities on $\mathbb{R}^d$ which make it so attractive from a statistical perspective, and to outline the latest methodological, theoretical and computational advances in the area. △ Less

Submitted 10 September, 2017; originally announced September 2017.

Comments: 25 pages, 8 figures

MSC Class: 62G05; 62G07

arXiv:1708.09468 [pdf, ps, other]

Isotonic regression in general dimensions

Authors: Qiyang Han, Tengyao Wang, Sabyasachi Chatterjee, Richard J. Samworth

Abstract: We study the least squares regression function estimator over the class of real-valued functions on $[0,1]^d$ that are increasing in each coordinate. For uniformly bounded signals and with a fixed, cubic lattice design, we establish that the estimator achieves the minimax rate of order $n^{-\min\{2/(d+2),1/d\}}$ in the empirical $L_2$ loss, up to poly-logarithmic factors. Further, we prove a sharp… ▽ More We study the least squares regression function estimator over the class of real-valued functions on $[0,1]^d$ that are increasing in each coordinate. For uniformly bounded signals and with a fixed, cubic lattice design, we establish that the estimator achieves the minimax rate of order $n^{-\min\{2/(d+2),1/d\}}$ in the empirical $L_2$ loss, up to poly-logarithmic factors. Further, we prove a sharp oracle inequality, which reveals in particular that when the true regression function is piecewise constant on $k$ hyperrectangles, the least squares estimator enjoys a faster, adaptive rate of convergence of $(k/n)^{\min(1,2/d)}$, again up to poly-logarithmic factors. Previous results are confined to the case $d \leq 2$. Finally, we establish corresponding bounds (which are new even in the case $d=2$) in the more challenging random design setting. There are two surprising features of these results: first, they demonstrate that it is possible for a global empirical risk minimisation procedure to be rate optimal up to poly-logarithmic factors even when the corresponding entropy integral for the function class diverges rapidly; second, they indicate that the adaptation rate for shape-constrained estimators can be strictly worse than the parametric rate. △ Less

Submitted 30 August, 2017; originally announced August 2017.

Comments: 36 pages

MSC Class: 62G08; 62G05

arXiv:1704.00642 [pdf, ps, other]

Local nearest neighbour classification with applications to semi-supervised learning

Authors: Timothy I. Cannings, Thomas B. Berrett, Richard J. Samworth

Abstract: We derive a new asymptotic expansion for the global excess risk of a local-$k$-nearest neighbour classifier, where the choice of $k$ may depend upon the test point. This expansion elucidates conditions under which the dominant contribution to the excess risk comes from the decision boundary of the optimal Bayes classifier, but we also show that if these conditions are not satisfied, then the domin… ▽ More We derive a new asymptotic expansion for the global excess risk of a local-$k$-nearest neighbour classifier, where the choice of $k$ may depend upon the test point. This expansion elucidates conditions under which the dominant contribution to the excess risk comes from the decision boundary of the optimal Bayes classifier, but we also show that if these conditions are not satisfied, then the dominant contribution may arise from the tails of the marginal distribution of the features. Moreover, we prove that, provided the $d$-dimensional marginal distribution of the features has a finite $ρ$th moment for some $ρ> 4$ (as well as other regularity conditions), a local choice of $k$ can yield a rate of convergence of the excess risk of $O(n^{-4/(d+4)})$, where $n$ is the sample size, whereas for the standard $k$-nearest neighbour classifier, our theory would require $d \geq 5$ and $ρ> 4d/(d-4)$ finite moments to achieve this rate. These results motivate a new $k$-nearest neighbour classifier for semi-supervised learning problems, where the unlabelled data are used to obtain an estimate of the marginal feature density, and fewer neighbours are used for classification when this density estimate is small. Our worst-case rates are complemented by a minimax lower bound, which reveals that the local, semi-supervised $k$-nearest neighbour classifier attains the minimax optimal rate over our classes for the excess risk, up to a subpolynomial factor in $n$. These theoretical improvements over the standard $k$-nearest neighbour classifier are also illustrated through a simulation study. △ Less

Submitted 18 May, 2019; v1 submitted 3 April, 2017; originally announced April 2017.

Comments: 60 pages

MSC Class: 62G20

arXiv:1703.10143 [pdf, ps, other]

Comments on `High-dimensional simultaneous inference with the bootstrap'

Authors: Richard A. Lockhart, Richard J. Samworth

Abstract: We provide some comments on the article `High-dimensional simultaneous inference with the bootstrap' by Ruben Dezeure, Peter Buhlmann and Cun-Hui Zhang. We provide some comments on the article `High-dimensional simultaneous inference with the bootstrap' by Ruben Dezeure, Peter Buhlmann and Cun-Hui Zhang. △ Less

Submitted 29 March, 2017; originally announced March 2017.

Comments: 5 pages

arXiv:1609.00861 [pdf, ps, other]

Adaptation in log-concave density estimation

Authors: Arlene K. H. Kim, Adityanand Guntuboyina, Richard J. Samworth

Abstract: The log-concave maximum likelihood estimator of a density on the real line based on a sample of size $n$ is known to attain the minimax optimal rate of convergence of $O(n^{-4/5})$ with respect to, e.g., squared Hellinger distance. In this paper, we show that it also enjoys attractive adaptation properties, in the sense that it achieves a faster rate of convergence when the logarithm of the true d… ▽ More The log-concave maximum likelihood estimator of a density on the real line based on a sample of size $n$ is known to attain the minimax optimal rate of convergence of $O(n^{-4/5})$ with respect to, e.g., squared Hellinger distance. In this paper, we show that it also enjoys attractive adaptation properties, in the sense that it achieves a faster rate of convergence when the logarithm of the true density is $k$-affine (i.e.\ made up of $k$ affine pieces), provided $k$ is not too large. Our results use two different techniques: the first relies on a new Marshall's inequality for log-concave density estimation, and reveals that when the true density is close to log-linear on its support, the log-concave maximum likelihood estimator can achieve the parametric rate of convergence in total variation distance. Our second approach depends on local bracketing entropy methods, and allows us to prove a sharp oracle inequality, which implies in particular that the rate of convergence with respect to various global loss functions, including Kullback--Leibler divergence, is $O\bigl(\frac{k}{n}\log^{5/4} n\bigr)$ when the true density is log-concave and its logarithm is close to $k$-affine. △ Less

Submitted 3 September, 2016; originally announced September 2016.

Comments: 38 pages

MSC Class: 62G05; 62G07

arXiv:1606.06246 [pdf, other]

High-dimensional changepoint estimation via sparse projection

Authors: Tengyao Wang, Richard J. Samworth

Abstract: Changepoints are a very common feature of Big Data that arrive in the form of a data stream. In this paper, we study high-dimensional time series in which, at certain time points, the mean structure changes in a sparse subset of the coordinates. The challenge is to borrow strength across the coordinates in order to detect smaller changes than could be observed in any individual component series. W… ▽ More Changepoints are a very common feature of Big Data that arrive in the form of a data stream. In this paper, we study high-dimensional time series in which, at certain time points, the mean structure changes in a sparse subset of the coordinates. The challenge is to borrow strength across the coordinates in order to detect smaller changes than could be observed in any individual component series. We propose a two-stage procedure called `inspect' for estimation of the changepoints: first, we argue that a good projection direction can be obtained as the leading left singular vector of the matrix that solves a convex optimisation problem derived from the CUSUM transformation of the time series. We then apply an existing univariate changepoint estimation algorithm to the projected series. Our theory provides strong guarantees on both the number of estimated changepoints and the rates of convergence of their locations, and our numerical studies validate its highly competitive empirical performance for a wide range of data generating mechanisms. Software implementing the methodology is available in the R package `InspectChangepoint'. △ Less

Submitted 17 March, 2017; v1 submitted 20 June, 2016; originally announced June 2016.

Comments: 59 pages, 6 figures

MSC Class: 62H99

arXiv:1606.00304 [pdf, ps, other]

Efficient multivariate entropy estimation via $k$-nearest neighbour distances

Authors: Thomas B. Berrett, Richard J. Samworth, Ming Yuan

Abstract: Many statistical procedures, including goodness-of-fit tests and methods for independent component analysis, rely critically on the estimation of the entropy of a distribution. In this paper, we seek entropy estimators that are efficient and achieve the local asymptotic minimax lower bound with respect to squared error loss. To this end, we study weighted averages of the estimators originally prop… ▽ More Many statistical procedures, including goodness-of-fit tests and methods for independent component analysis, rely critically on the estimation of the entropy of a distribution. In this paper, we seek entropy estimators that are efficient and achieve the local asymptotic minimax lower bound with respect to squared error loss. To this end, we study weighted averages of the estimators originally proposed by Kozachenko and Leonenko (1987), based on the $k$-nearest neighbour distances of a sample of $n$ independent and identically distributed random vectors in $\mathbb{R}^d$. A careful choice of weights enables us to obtain an efficient estimator in arbitrary dimensions, given sufficient smoothness, while the original unweighted estimator is typically only efficient when $d \leq 3$. In addition to the new estimator proposed and theoretical understanding provided, our results facilitate the construction of asymptotically valid confidence intervals for the entropy of asymptotically minimal width. △ Less

Submitted 22 June, 2017; v1 submitted 1 June, 2016; originally announced June 2016.

Comments: 69 pages, 0 figures

MSC Class: 62G05; 62G20

arXiv:1408.5369 [pdf, ps, other]

doi 10.1214/15-AOS1369

Statistical and computational trade-offs in estimation of sparse principal components

Authors: Tengyao Wang, Quentin Berthet, Richard J. Samworth

Abstract: In recent years, sparse principal component analysis has emerged as an extremely popular dimension reduction technique for high-dimensional data. The theoretical challenge, in the simplest case, is to estimate the leading eigenvector of a population covariance matrix under the assumption that this eigenvector is sparse. An impressive range of estimators have been proposed; some of these are fast t… ▽ More In recent years, sparse principal component analysis has emerged as an extremely popular dimension reduction technique for high-dimensional data. The theoretical challenge, in the simplest case, is to estimate the leading eigenvector of a population covariance matrix under the assumption that this eigenvector is sparse. An impressive range of estimators have been proposed; some of these are fast to compute, while others are known to achieve the minimax optimal rate over certain Gaussian or sub-Gaussian classes. In this paper, we show that, under a widely-believed assumption from computational complexity theory, there is a fundamental trade-off between statistical and computational performance in this problem. More precisely, working with new, larger classes satisfying a restricted covariance concentration condition, we show that there is an effective sample size regime in which no randomised polynomial time algorithm can achieve the minimax optimal rate. We also study the theoretical performance of a (polynomial time) variant of the well-known semidefinite relaxation estimator, revealing a subtle interplay between statistical and computational efficiency. △ Less

Submitted 28 September, 2016; v1 submitted 22 August, 2014; originally announced August 2014.

Comments: Published at http://dx.doi.org/10.1214/15-AOS1369 in the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOS-AOS1369

Journal ref: Annals of Statistics 2016, Vol. 44, No. 5, 1896-1930

arXiv:1405.0680 [pdf, ps, other]

A useful variant of the Davis--Kahan theorem for statisticians

Authors: Yi Yu, Tengyao Wang, Richard J. Samworth

Abstract: The Davis--Kahan theorem is used in the analysis of many statistical procedures to bound the distance between subspaces spanned by population eigenvectors and their sample versions. It relies on an eigenvalue separation condition between certain relevant population and sample eigenvalues. We present a variant of this result that depends only on a population eigenvalue separation condition, making… ▽ More The Davis--Kahan theorem is used in the analysis of many statistical procedures to bound the distance between subspaces spanned by population eigenvectors and their sample versions. It relies on an eigenvalue separation condition between certain relevant population and sample eigenvalues. We present a variant of this result that depends only on a population eigenvalue separation condition, making it more natural and convenient for direct application in statistical contexts, and improving the bounds in some cases. We also provide an extension to situations where the matrices under study may be asymmetric or even non-square, and where interest is in the distance between subspaces spanned by corresponding singular vectors. △ Less

Submitted 4 May, 2014; originally announced May 2014.

Comments: 12 pages

MSC Class: 62H25

arXiv:1404.2957 [pdf, ps, other]

Generalised additive and index models with shape constraints

Authors: Yining Chen, Richard J. Samworth

Abstract: We study generalised additive models, with shape restrictions (e.g. monotonicity, convexity, concavity) imposed on each component of the additive prediction function. We show that this framework facilitates a nonparametric estimator of each additive component, obtained by maximising the likelihood. The procedure is free of tuning parameters and under mild conditions is proved to be uniformly consi… ▽ More We study generalised additive models, with shape restrictions (e.g. monotonicity, convexity, concavity) imposed on each component of the additive prediction function. We show that this framework facilitates a nonparametric estimator of each additive component, obtained by maximising the likelihood. The procedure is free of tuning parameters and under mild conditions is proved to be uniformly consistent on compact intervals. More generally, our methodology can be applied to generalised additive index models. Here again, the procedure can be justified on theoretical grounds and, like the original algorithm, possesses highly competitive finite-sample performance. Practical utility is illustrated through the use of these methods in the analysis of two real datasets. Our algorithms are publicly available in the \texttt{R} package \textbf{scar}, short for \textbf{s}hape-\textbf{c}onstrained \textbf{a}dditive \textbf{r}egression. △ Less

Submitted 10 April, 2014; originally announced April 2014.

Comments: 50 pages

MSC Class: 62G08

arXiv:1404.2298 [pdf, other]

Global rates of convergence in log-concave density estimation

Authors: Arlene K. H. Kim, Richard J. Samworth

Abstract: The estimation of a log-concave density on $\mathbb{R}^d$ represents a central problem in the area of nonparametric inference under shape constraints. In this paper, we study the performance of log-concave density estimators with respect to global loss functions, and adopt a minimax approach. We first show that no statistical procedure based on a sample of size $n$ can estimate a log-concave densi… ▽ More The estimation of a log-concave density on $\mathbb{R}^d$ represents a central problem in the area of nonparametric inference under shape constraints. In this paper, we study the performance of log-concave density estimators with respect to global loss functions, and adopt a minimax approach. We first show that no statistical procedure based on a sample of size $n$ can estimate a log-concave density with respect to the squared Hellinger loss function with supremum risk smaller than order $n^{-4/5}$, when $d=1$, and order $n^{-2/(d+1)}$ when $d \geq 2$. In particular, this reveals a sense in which, when $d \geq 3$, log-concave density estimation is fundamentally more challenging than the estimation of a density with two bounded derivatives (a problem to which it has been compared). Second, we show that for $d \leq 3$, the Hellinger $ε$-bracketing entropy of a class of log-concave densities with small mean and covariance matrix close to the identity grows like $\max\{ε^{-d/2},ε^{-(d-1)}\}$ (up to a logarithmic factor when $d=2$). This enables us to prove that when $d \leq 3$ the log-concave maximum likelihood estimator achieves the minimax optimal rate (up to logarithmic factors when $d = 2,3$) with respect to squared Hellinger loss. △ Less

Submitted 26 September, 2015; v1 submitted 8 April, 2014; originally announced April 2014.

Comments: 58 pages, 2 figures

arXiv:1206.0457 [pdf, ps, other]

Independent component analysis via nonparametric maximum likelihood estimation

Authors: Richard J. Samworth, Ming Yuan

Abstract: Independent Component Analysis (ICA) models are very popular semiparametric models in which we observe independent copies of a random vector $X = AS$, where $A$ is a non-singular matrix and $S$ has independent components. We propose a new way of estimating the unmixing matrix $W = A^{-1}$ and the marginal distributions of the components of $S$ using nonparametric maximum likelihood. Specifically,… ▽ More Independent Component Analysis (ICA) models are very popular semiparametric models in which we observe independent copies of a random vector $X = AS$, where $A$ is a non-singular matrix and $S$ has independent components. We propose a new way of estimating the unmixing matrix $W = A^{-1}$ and the marginal distributions of the components of $S$ using nonparametric maximum likelihood. Specifically, we study the projection of the empirical distribution onto the subset of ICA distributions having log-concave marginals. We show that, from the point of view of estimating the unmixing matrix, it makes no difference whether or not the log-concavity is correctly specified. The approach is further justified by both theoretical results and a simulation study. △ Less

Submitted 3 June, 2012; originally announced June 2012.

Comments: 28 pages, 6 figures

MSC Class: 62G07; 62G20

arXiv:1105.5578 [pdf, ps, other]

doi 10.1111/j.1467-9868.2011.01034.x

Variable selection with error control: Another look at Stability Selection

Authors: Rajen D. Shah, Richard J. Samworth

Abstract: Stability Selection was recently introduced by Meinshausen and Buhlmann (2010) as a very general technique designed to improve the performance of a variable selection algorithm. It is based on aggregating the results of applying a selection procedure to subsamples of the data. We introduce a variant, called Complementary Pairs Stability Selection (CPSS), and derive bounds both on the expected numb… ▽ More Stability Selection was recently introduced by Meinshausen and Buhlmann (2010) as a very general technique designed to improve the performance of a variable selection algorithm. It is based on aggregating the results of applying a selection procedure to subsamples of the data. We introduce a variant, called Complementary Pairs Stability Selection (CPSS), and derive bounds both on the expected number of variables included by CPSS that have low selection probability under the original procedure, and on the expected number of high selection probability variables that are excluded. These results require no (e.g. exchangeability) assumptions on the underlying model or on the quality of the original selection procedure. Under reasonable shape restrictions, the bounds can be further tightened, yielding improved error control, and therefore increasing the applicability of the methodology. △ Less

Submitted 5 October, 2011; v1 submitted 27 May, 2011; originally announced May 2011.

Comments: 25 pages, 9 figures

arXiv:1102.1191 [pdf, ps, other]

doi 10.5705/ss.2011.224

Smoothed log-concave maximum likelihood estimation with applications

Authors: Yining Chen, Richard J. Samworth

Abstract: We study the smoothed log-concave maximum likelihood estimator of a probability distribution on $\mathbb{R}^d$. This is a fully automatic nonparametric density estimator, obtained as a canonical smoothing of the log-concave maximum likelihood estimator. We demonstrate its attractive features both through an analysis of its theoretical properties and a simulation study. Moreover, we use our methodo… ▽ More We study the smoothed log-concave maximum likelihood estimator of a probability distribution on $\mathbb{R}^d$. This is a fully automatic nonparametric density estimator, obtained as a canonical smoothing of the log-concave maximum likelihood estimator. We demonstrate its attractive features both through an analysis of its theoretical properties and a simulation study. Moreover, we use our methodology to develop a new test of log-concavity, and show how the estimator can be used as an intermediate stage of more involved procedures, such as constructing a classifier or estimating a functional of the density. Here again, the use of these procedures can be justified both on theoretical grounds and through its finite sample performance, and we illustrate its use in a breast cancer diagnosis (classification) problem. △ Less

Submitted 10 June, 2012; v1 submitted 6 February, 2011; originally announced February 2011.

Comments: 29 pages, 3 figures

MSC Class: 62G07; 62E17; 62P10

Journal ref: Statist. Sinica. 23 (2013), 1373-1398

arXiv:1101.5783 [pdf, ps, other]

doi 10.1214/12-AOS1049

Optimal weighted nearest neighbour classifiers

Authors: Richard J. Samworth

Abstract: We derive an asymptotic expansion for the excess risk (regret) of a weighted nearest-neighbour classifier. This allows us to find the asymptotically optimal vector of nonnegative weights, which has a rather simple form. We show that the ratio of the regret of this classifier to that of an unweighted k-nearest neighbour classifier depends asymptotically only on the dimension d of the feature vector… ▽ More We derive an asymptotic expansion for the excess risk (regret) of a weighted nearest-neighbour classifier. This allows us to find the asymptotically optimal vector of nonnegative weights, which has a rather simple form. We show that the ratio of the regret of this classifier to that of an unweighted k-nearest neighbour classifier depends asymptotically only on the dimension d of the feature vectors, and not on the underlying populations. The improvement is greatest when d=4, but thereafter decreases as $d\rightarrow\infty$. The popular bagged nearest neighbour classifier can also be regarded as a weighted nearest neighbour classifier, and we show that its corresponding weights are somewhat suboptimal when d is small (in particular, worse than those of the unweighted k-nearest neighbour classifier when d=1), but are close to optimal when d is large. Finally, we argue that improvements in the rate of convergence are possible under stronger smoothness assumptions, provided we allow negative weights. Our findings are supported by an empirical performance comparison on both simulated and real data sets. △ Less

Submitted 18 February, 2013; v1 submitted 30 January, 2011; originally announced January 2011.

Comments: Published in at http://dx.doi.org/10.1214/12-AOS1049 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOS-AOS1049

Journal ref: Annals of Statistics 2012, Vol. 40, No. 5, 2733-2763

arXiv:1010.0591 [pdf, ps, other]

doi 10.1214/09-AOS766

Asymptotics and optimal bandwidth selection for highest density region estimation

Authors: R. J. Samworth, M. P. Wand

Abstract: We study kernel estimation of highest-density regions (HDR). Our main contributions are two-fold. First, we derive a uniform-in-bandwidth asymptotic approximation to a risk that is appropriate for HDR estimation. This approximation is then used to derive a bandwidth selection rule for HDR estimation possessing attractive asymptotic properties. We also present the results of numerical studies that… ▽ More We study kernel estimation of highest-density regions (HDR). Our main contributions are two-fold. First, we derive a uniform-in-bandwidth asymptotic approximation to a risk that is appropriate for HDR estimation. This approximation is then used to derive a bandwidth selection rule for HDR estimation possessing attractive asymptotic properties. We also present the results of numerical studies that illustrate the benefits of our theory and methodology. △ Less

Submitted 4 October, 2010; originally announced October 2010.

Comments: Published in at http://dx.doi.org/10.1214/09-AOS766 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOS-AOS766

Journal ref: Annals of Statistics 2010, Vol. 38, No. 3, 1767-1792

arXiv:0810.5276 [pdf, ps, other]

doi 10.1214/07-AOS537

Choice of neighbor order in nearest-neighbor classification

Authors: Peter Hall, Byeong U. Park, Richard J. Samworth

Abstract: The $k$th-nearest neighbor rule is arguably the simplest and most intuitively appealing nonparametric classification procedure. However, application of this method is inhibited by lack of knowledge about its properties, in particular, about the manner in which it is influenced by the value of $k$; and by the absence of techniques for empirical choice of $k$. In the present paper we detail the wa… ▽ More The $k$th-nearest neighbor rule is arguably the simplest and most intuitively appealing nonparametric classification procedure. However, application of this method is inhibited by lack of knowledge about its properties, in particular, about the manner in which it is influenced by the value of $k$; and by the absence of techniques for empirical choice of $k$. In the present paper we detail the way in which the value of $k$ determines the misclassification error. We consider two models, Poisson and Binomial, for the training samples. Under the first model, data are recorded in a Poisson stream and are "assigned" to one or other of the two populations in accordance with the prior probabilities. In particular, the total number of data in both training samples is a Poisson-distributed random variable. Under the Binomial model, however, the total number of data in the training samples is fixed, although again each data value is assigned in a random way. Although the values of risk and regret associated with the Poisson and Binomial models are different, they are asymptotically equivalent to first order, and also to the risks associated with kernel-based classifiers that are tailored to the case of two derivatives. These properties motivate new methods for choosing the value of $k$. △ Less

Submitted 29 October, 2008; originally announced October 2008.

Comments: Published in at http://dx.doi.org/10.1214/07-AOS537 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOS-AOS537 MSC Class: 62H30 (Primary); 62G20 (Secondary)

Journal ref: Annals of Statistics 2008, Vol. 36, No. 5, 2135-2152

Showing 1–46 of 46 results for author: Samworth, R J