Search | arXiv e-print repository

arXiv:1912.07879 [pdf, ps, other]

Nonparametric density estimation for intentionally corrupted functional data

Authors: Aurore Delaigle, Alexander Meister

Abstract: We consider statistical models where functional data are artificially contaminated by independent Wiener processes in order to satisfy privacy constraints. We show that the corrupted observations have a Wiener density which determines the distribution of the original functional random variables, masked near the origin, uniquely, and we construct a nonparametric estimator of that density. We derive… ▽ More We consider statistical models where functional data are artificially contaminated by independent Wiener processes in order to satisfy privacy constraints. We show that the corrupted observations have a Wiener density which determines the distribution of the original functional random variables, masked near the origin, uniquely, and we construct a nonparametric estimator of that density. We derive an upper bound for its mean integrated squared error which has a polynomial convergence rate, and we establish an asymptotic lower bound on the minimax convergence rates which is close to the rate attained by our estimator. Our estimator requires the choice of a basis and of two smoothing parameters. We propose data-driven ways of choosing them and prove that the asymptotic quality of our estimator is not significantly affected by the empirical parameter selection. We examine the numerical performance of our method via simulated examples. △ Less

Submitted 17 December, 2019; originally announced December 2019.

MSC Class: 62G07; 62M99; 62H30

arXiv:1809.06038 [pdf, other]

doi 10.3847/1538-4357/aae20a

Nonparametric estimation of the size and waiting time distributions of pulsar glitches

Authors: G. Howitt, A. Melatos, A. Delaigle

Abstract: Glitch size and waiting time probability density functions (PDFs) are estimated for the five pulsars that have glitched most using the nonparametric kernel density estimator. Two objects exhibit decreasing size and waiting time PDFs. Their activity is Poisson-like, and their size statistics are approximately scale-invariant. Three objects exhibit a statistically significant local maximum in the PD… ▽ More Glitch size and waiting time probability density functions (PDFs) are estimated for the five pulsars that have glitched most using the nonparametric kernel density estimator. Two objects exhibit decreasing size and waiting time PDFs. Their activity is Poisson-like, and their size statistics are approximately scale-invariant. Three objects exhibit a statistically significant local maximum in the PDFs, including one (PSR J1341$-$6220) which was classified as Poisson-like in previous analyses. Their activity is quasiperiodic, although the dispersion in waiting times is relatively broad. The classification is robust: it is preserved across a wide range of bandwidth choices. There is no compelling evidence for multimodality, but this issue should be revisited when more data become available. The implications for superfluid vortex avalanche models of pulsar glitches are explored briefly. △ Less

Submitted 17 September, 2018; originally announced September 2018.

Comments: ApJ accepted. 20 pages, 4 figures

arXiv:1801.06669 [pdf, ps, other]

doi 10.1093/biomet/asy006

A frequency domain analysis of the error distribution from noisy high-frequency data

Authors: **yuan Chang, Aurore Delaigle, Peter Hall, Cheng Yong Tang

Abstract: Data observed at high sampling frequency are typically assumed to be an additive composite of a relatively slow-varying continuous-time component, a latent stochastic process or a smooth random function, and measurement error. Supposing that the latent component is an Itô diffusion process, we propose to estimate the measurement error density function by applying a deconvolution technique with app… ▽ More Data observed at high sampling frequency are typically assumed to be an additive composite of a relatively slow-varying continuous-time component, a latent stochastic process or a smooth random function, and measurement error. Supposing that the latent component is an Itô diffusion process, we propose to estimate the measurement error density function by applying a deconvolution technique with appropriate localization. Our estimator, which does not require equally-spaced observed times, is consistent and minimax rate optimal. We also investigate estimators of the moments of the error distribution and their properties, propose a frequency domain estimator for the integrated volatility of the underlying stochastic process, and show that it achieves the optimal convergence rate. Simulations and a real data analysis validate our analysis. △ Less

Submitted 20 January, 2018; originally announced January 2018.

Journal ref: Biometrika 2018, Vol. 105, No. 2, 353-369

arXiv:1601.02739 [pdf, other]

Nonparametric covariate-adjusted regression

Authors: Aurore Delaigle, Peter Hall, Wen-Xin Zhou

Abstract: We consider nonparametric estimation of a regression curve when the data are observed with multiplicative distortion which depends on an observed confounding variable. We suggest several estimators, ranging from a relatively simple one that relies on restrictive assumptions usually made in the literature, to a sophisticated piecewise approach that involves reconstructing a smooth curve from an est… ▽ More We consider nonparametric estimation of a regression curve when the data are observed with multiplicative distortion which depends on an observed confounding variable. We suggest several estimators, ranging from a relatively simple one that relies on restrictive assumptions usually made in the literature, to a sophisticated piecewise approach that involves reconstructing a smooth curve from an estimator of a constant multiple of its absolute value, and which can be applied in much more general scenarios. We show that, although our nonparametric estimators are constructed from predictors of the unobserved undistorted data, they have the same first order asymptotic properties as the standard estimators that could be computed if the undistorted data were available. We illustrate the good numerical performance of our methods on both simulated and real datasets. △ Less

Submitted 12 January, 2016; originally announced January 2016.

Comments: 32 pages, 4 figures

arXiv:1312.5082 [pdf, ps, other]

doi 10.1214/13-AOS1158

Unexpected properties of bandwidth choice when smoothing discrete data for constructing a functional data classifier

Authors: Raymond J. Carroll, Aurore Delaigle, Peter Hall

Abstract: The data functions that are studied in the course of functional data analysis are assembled from discrete data, and the level of smoothing that is used is generally that which is appropriate for accurate approximation of the conceptually smooth functions that were not actually observed. Existing literature shows that this approach is effective, and even optimal, when using functional data methods… ▽ More The data functions that are studied in the course of functional data analysis are assembled from discrete data, and the level of smoothing that is used is generally that which is appropriate for accurate approximation of the conceptually smooth functions that were not actually observed. Existing literature shows that this approach is effective, and even optimal, when using functional data methods for prediction or hypothesis testing. However, in the present paper we show that this approach is not effective in classification problems. There a useful rule of thumb is that undersmoothing is often desirable, but there are several surprising qualifications to that approach. First, the effect of smoothing the training data can be more significant than that of smoothing the new data set to be classified; second, undersmoothing is not always the right approach, and in fact in some cases using a relatively large bandwidth can be more effective; and third, these perverse results are the consequence of very unusual properties of error rates, expressed as functions of smoothing parameters. For example, the orders of magnitude of optimal smoothing parameter choices depend on the signs and sizes of terms in an expansion of error rate, and those signs and sizes can vary dramatically from one setting to another, even for the same classifier. △ Less

Submitted 18 December, 2013; originally announced December 2013.

Comments: Published in at http://dx.doi.org/10.1214/13-AOS1158 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOS-AOS1158

Journal ref: Annals of Statistics 2013, Vol. 41, No. 6, 2739-2767

arXiv:1302.2635 [pdf, ps, other]

doi 10.1088/0004-637X/766/2/99

Reanalysis of F-statistic gravitational-wave searches with the higher criticism statistic

Authors: M. F. Bennett, A. Melatos, A. Delaigle, P. Hall

Abstract: We propose a new method of gravitational wave detection using a modified form of higher criticism, a statistical technique introduced by Donoho & ** (2004). Higher criticism is designed to detect a group of sparse, weak sources, none of which are strong enough to be reliably estimated or detected individually. We apply higher criticism as a second-pass method to synthetic F-statistic and C-statis… ▽ More We propose a new method of gravitational wave detection using a modified form of higher criticism, a statistical technique introduced by Donoho & ** (2004). Higher criticism is designed to detect a group of sparse, weak sources, none of which are strong enough to be reliably estimated or detected individually. We apply higher criticism as a second-pass method to synthetic F-statistic and C-statistic data for a monochromatic periodic source in a binary system and quantify the improvement relative to the first-pass methods. We find that higher criticism on C-statistic data is more sensitive by ~6% than the C-statistic alone under optimal conditions (i.e. binary orbit known exactly) and the relative advantage increases as the error in the orbital parameters increases. Higher criticism is robust even when the source is not monochromatic (e.g. phase wandering in an accreting system). Applying higher criticism to a phase-wandering source over multiple time intervals gives a >30% increase in detectability with few assumptions about the frequency evolution. By contrast, in all-sky searches for unknown periodic sources, which are dominated by the brightest source, second-pass higher criticism does not provide any benefits over a first pass search △ Less

Submitted 11 February, 2013; originally announced February 2013.

Comments: 28 pages, 9 figures, accepted for publication in ApJ

arXiv:1205.6367 [pdf, ps, other]

doi 10.1214/11-AOS958

Methodology and theory for partial least squares applied to functional data

Authors: Aurore Delaigle, Peter Hall

Abstract: The partial least squares procedure was originally developed to estimate the slope parameter in multivariate parametric models. More recently it has gained popularity in the functional data literature. There, the partial least squares estimator of slope is either used to construct linear predictive models, or as a tool to project the data onto a one-dimensional quantity that is employed for furthe… ▽ More The partial least squares procedure was originally developed to estimate the slope parameter in multivariate parametric models. More recently it has gained popularity in the functional data literature. There, the partial least squares estimator of slope is either used to construct linear predictive models, or as a tool to project the data onto a one-dimensional quantity that is employed for further statistical analysis. Although the partial least squares approach is often viewed as an attractive alternative to projections onto the principal component basis, its properties are less well known than those of the latter, mainly because of its iterative nature. We develop an explicit formulation of partial least squares for functional data, which leads to insightful results and motivates new theory, demonstrating consistency and establishing convergence rates. △ Less

Submitted 29 May, 2012; originally announced May 2012.

Comments: Published in at http://dx.doi.org/10.1214/11-AOS958 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOS-AOS958

Journal ref: Annals of Statistics 2012, Vol. 40, No. 1, 322-352

arXiv:1205.6102 [pdf, ps, other]

doi 10.1214/11-AOS952

Nonparametric regression with homogeneous group testing data

Authors: Aurore Delaigle, Peter Hall

Abstract: We introduce new nonparametric predictors for homogeneous pooled data in the context of group testing for rare abnormalities and show that they achieve optimal rates of convergence. In particular, when the level of pooling is moderate, then despite the cost savings, the method enjoys the same convergence rate as in the case of no pooling. In the setting of "over-pooling" the convergence rate diffe… ▽ More We introduce new nonparametric predictors for homogeneous pooled data in the context of group testing for rare abnormalities and show that they achieve optimal rates of convergence. In particular, when the level of pooling is moderate, then despite the cost savings, the method enjoys the same convergence rate as in the case of no pooling. In the setting of "over-pooling" the convergence rate differs from that of an optimal estimator by no more than a logarithmic factor. Our approach improves on the random-pooling nonparametric predictor, which is currently the only nonparametric method available, unless there is no pooling, in which case the two approaches are identical. △ Less

Submitted 28 May, 2012; originally announced May 2012.

Comments: Published in at http://dx.doi.org/10.1214/11-AOS952 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOS-AOS952

Journal ref: Annals of Statistics 2012, Vol. 40, No. 1, 131-158

arXiv:1003.0315 [pdf, ps, other]

Kernel methods and minimum contrast estimators for empirical deconvolution

Authors: Aurore Delaigle, Peter Hall

Abstract: We survey classical kernel methods for providing nonparametric solutions to problems involving measurement error. In particular we outline kernel-based methodology in this setting, and discuss its basic properties. Then we point to close connections that exist between kernel methods and much newer approaches based on minimum contrast techniques. The connections are through use of the sinc kernel… ▽ More We survey classical kernel methods for providing nonparametric solutions to problems involving measurement error. In particular we outline kernel-based methodology in this setting, and discuss its basic properties. Then we point to close connections that exist between kernel methods and much newer approaches based on minimum contrast techniques. The connections are through use of the sinc kernel for kernel-based inference. This `infinite order' kernel is not often used explicitly for kernel-based deconvolution, although it has received attention in more conventional problems where measurement error is not an issue. We show that in a comparison between kernel methods for density deconvolution, and their counterparts based on minimum contrast, the two approaches give identical results on a grid which becomes increasingly fine as the bandwidth decreases. In consequence, the main numerical differences between these two techniques are arguably the result of different approaches to choosing smoothing parameters. △ Less

Submitted 1 March, 2010; originally announced March 2010.

Comments: To appear in: Bingham, N. H., and Goldie, C. M. (eds), Probability and Mathematical Genetics: Papers in Honour of Sir John Kingman. London Math. Soc. Lecture Note Ser. Cambridge: Cambridge Univ. Press, 2010

arXiv:1002.4931 [pdf, ps, other]

doi 10.1214/09-AOS741

Defining probability density for a distribution of random functions

Authors: Aurore Delaigle, Peter Hall

Abstract: The notion of probability density for a random function is not as straightforward as in finite-dimensional cases. While a probability density function generally does not exist for functional data, we show that it is possible to develop the notion of density when functional data are considered in the space determined by the eigenfunctions of principal component analysis. This leads to a transpare… ▽ More The notion of probability density for a random function is not as straightforward as in finite-dimensional cases. While a probability density function generally does not exist for functional data, we show that it is possible to develop the notion of density when functional data are considered in the space determined by the eigenfunctions of principal component analysis. This leads to a transparent and meaningful surrogate for density defined in terms of the average value of the logarithms of the densities of the distributions of principal components for a given dimension. This density approximation is estimable readily from data. It accurately represents, in a monotone way, key features of small-ball approximations to density. Our results on estimators of the densities of principal component scores are also of independent interest; they reveal interesting shape differences that have not previously been considered. The statistical implications of these results and properties are identified and discussed, and practical ramifications are illustrated in numerical work. △ Less

Submitted 1 March, 2010; v1 submitted 26 February, 2010; originally announced February 2010.

Comments: Published in at http://dx.doi.org/10.1214/09-AOS741 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOS-AOS741 MSC Class: 62G05 (Primary) 62G07 (Secondary)

Journal ref: Annals of Statistics 2010, Vol. 38, No. 2, 1171-1193

arXiv:1001.3886 [pdf, other]

Robustness and accuracy of methods for high dimensional data analysis based on Student's t statistic

Authors: Aurore Delaigle, Peter Hall, Jiashun **

Abstract: Student's $t$ statistic is finding applications today that were never envisaged when it was introduced more than a century ago. Many of these applications rely on properties, for example robustness against heavy tailed sampling distributions, that were not explicitly considered until relatively recently. In this paper we explore these features of the $t$ statistic in the context of its applicati… ▽ More Student's $t$ statistic is finding applications today that were never envisaged when it was introduced more than a century ago. Many of these applications rely on properties, for example robustness against heavy tailed sampling distributions, that were not explicitly considered until relatively recently. In this paper we explore these features of the $t$ statistic in the context of its application to very high dimensional problems, including feature selection and ranking, highly multiple hypothesis testing, and sparse, high dimensional signal detection. Robustness properties of the $t$-ratio are highlighted, and it is established that those properties are preserved under applications of the bootstrap. In particular, bootstrap methods correct for skewness, and therefore lead to second-order accuracy, even in the extreme tails. Indeed, it is shown that the bootstrap, and also the more popular but less accurate $t$-distribution and normal approximations, are more effective in the tails than towards the middle of the distribution. These properties motivate new methods, for example bootstrap-based techniques for signal detection, that confine attention to the significant tail of a statistic. △ Less

Submitted 21 January, 2010; originally announced January 2010.

Comments: 37 pages, 5 figures

arXiv:0902.3319 [pdf, ps, other]

Weighted least squares methods for prediction in the functional data linear model

Authors: Aurore Delaigle, Peter Hall, Tatiyana V. Apanasovich

Abstract: The problem of prediction in functional linear regression is conventionally addressed by reducing dimension via the standard principal component basis. In this paper we show that an alternative basis chosen through weighted least-squares, or weighted least-squares itself, can be more effective when the experimental errors are heteroscedastic. We give a concise theoretical result which demonstrat… ▽ More The problem of prediction in functional linear regression is conventionally addressed by reducing dimension via the standard principal component basis. In this paper we show that an alternative basis chosen through weighted least-squares, or weighted least-squares itself, can be more effective when the experimental errors are heteroscedastic. We give a concise theoretical result which demonstrates the effectiveness of this approach, even when the model for the variance is inaccurate, and we explore the numerical properties of the method. We show too that the advantages of the suggested adaptive techniques are not found only in low-dimensional aspects of the problem; rather, they accrue almost equally among all dimensions. △ Less

Submitted 19 February, 2009; originally announced February 2009.

Comments: Submitted to the Electronic Journal of Statistics (http://www.i-journals.org/ejs/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-EJS-EJS_2009_379

arXiv:0805.2216 [pdf, ps, other]

doi 10.3150/08-BEJ121

Density estimation with heteroscedastic error

Authors: Aurore Delaigle, Alexander Meister

Abstract: It is common, in deconvolution problems, to assume that the measurement errors are identically distributed. In many real-life applications, however, this condition is not satisfied and the deconvolution estimators developed for homoscedastic errors become inconsistent. In this paper, we introduce a kernel estimator of a density in the case of heteroscedastic contamination. We establish consisten… ▽ More It is common, in deconvolution problems, to assume that the measurement errors are identically distributed. In many real-life applications, however, this condition is not satisfied and the deconvolution estimators developed for homoscedastic errors become inconsistent. In this paper, we introduce a kernel estimator of a density in the case of heteroscedastic contamination. We establish consistency of the estimator and show that it achieves optimal rates of convergence under quite general conditions. We study the limits of application of the procedure in some extreme situations, where we show that, in some cases, our estimator is consistent, even when the scaling parameter of the error is unbounded. We suggest a modified estimator for the problem where the distribution of the errors is unknown, but replicated observations are available. Finally, an adaptive procedure for selecting the smoothing parameter is proposed and its finite-sample properties are investigated on simulated examples. △ Less

Submitted 15 May, 2008; originally announced May 2008.

Comments: Published in at http://dx.doi.org/10.3150/08-BEJ121 the Bernoulli (http://isi.cbs.nl/bernoulli/) by the International Statistical Institute/Bernoulli Society (http://isi.cbs.nl/BS/bshome.htm)

Report number: IMS-BEJ-BEJ121

Journal ref: Bernoulli 2008, Vol. 14, No. 2, 562-579

arXiv:0804.0713 [pdf, ps, other]

doi 10.1214/009053607000000884

On deconvolution with repeated measurements

Authors: Aurore Delaigle, Peter Hall, Alexander Meister

Abstract: In a large class of statistical inverse problems it is necessary to suppose that the transformation that is inverted is known. Although, in many applications, it is unrealistic to make this assumption, the problem is often insoluble without it. However, if additional data are available, then it is possible to estimate consistently the unknown error density. Data are seldom available directly on… ▽ More In a large class of statistical inverse problems it is necessary to suppose that the transformation that is inverted is known. Although, in many applications, it is unrealistic to make this assumption, the problem is often insoluble without it. However, if additional data are available, then it is possible to estimate consistently the unknown error density. Data are seldom available directly on the transformation, but repeated, or replicated, measurements increasingly are becoming available. Such data consist of ``intrinsic'' values that are measured several times, with errors that are generally independent. Working in this setting we treat the nonparametric deconvolution problems of density estimation with observation errors, and regression with errors in variables. We show that, even if the number of repeated measurements is quite small, it is possible for modified kernel estimators to achieve the same level of performance they would if the error distribution were known. Indeed, density and regression estimators can be constructed from replicated data so that they have the same first-order properties as conventional estimators in the known-error case, without any replication, but with sample size equal to the sum of the numbers of replicates. Practical methods for constructing estimators with these properties are suggested, involving empirical rules for smoothing-parameter choice. △ Less

Submitted 4 April, 2008; originally announced April 2008.

Comments: Published in at http://dx.doi.org/10.1214/009053607000000884 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOS-AOS0326 MSC Class: 62G07; 62G08 (Primary) 65R32 (Secondary)

Journal ref: Annals of Statistics 2008, Vol. 36, No. 2, 665-685

arXiv:0803.3017 [pdf, ps, other]

doi 10.1214/009053607000000497

Accelerated convergence for nonparametric regression with coarsened predictors

Authors: Aurore Delaigle, Peter Hall, Hans-Georg Müller

Abstract: We consider nonparametric estimation of a regression function for a situation where precisely measured predictors are used to estimate the regression curve for coarsened, that is, less precise or contaminated predictors. Specifically, while one has available a sample $(W_1,Y_1),...,(W_n,Y_n)$ of independent and identically distributed data, representing observations with precisely measured predi… ▽ More We consider nonparametric estimation of a regression function for a situation where precisely measured predictors are used to estimate the regression curve for coarsened, that is, less precise or contaminated predictors. Specifically, while one has available a sample $(W_1,Y_1),...,(W_n,Y_n)$ of independent and identically distributed data, representing observations with precisely measured predictors, where $\mathrm{E}(Y_i|W_i)=g(W_i)$, instead of the smooth regression function $g$, the target of interest is another smooth regression function $m$ that pertains to predictors $X_i$ that are noisy versions of the $W_i$. Our target is then the regression function $m(x)=E(Y|X=x)$, where $X$ is a contaminated version of $W$, that is, $X=W+δ$. It is assumed that either the density of the errors is known, or replicated data are available resembling, but not necessarily the same as, the variables $X$. In either case, and under suitable conditions, we obtain $\sqrt{n}$-rates of convergence of the proposed estimator and its derivatives, and establish a functional limit theorem. Weak convergence to a Gaussian limit process implies pointwise and uniform confidence intervals and $\sqrt{n}$-consistent estimators of extrema and zeros of $m$. It is shown that these results are preserved under more general models in which $X$ is determined by an explanatory variable. Finite sample performance is investigated in simulations and illustrated by a real data example. △ Less

Submitted 20 March, 2008; originally announced March 2008.

Comments: Published in at http://dx.doi.org/10.1214/009053607000000497 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOS-AOS0282 MSC Class: 62G08; 62G05 (Primary)

Journal ref: Annals of Statistics 2007, Vol. 35, No. 6, 2639-2653

Showing 1–15 of 15 results for author: Delaigle, A