Search | arXiv e-print repository

Beta-trees: Multivariate histograms with confidence statements

Abstract: Multivariate histograms are difficult to construct due to the curse of dimensionality. Motivated by $k$-d trees in computer science, we show how to construct an efficient data-adaptive partition of Euclidean space that possesses the following two properties: With high confidence the distribution from which the data are generated is close to uniform on each rectangle of the partition; and despite t… ▽ More Multivariate histograms are difficult to construct due to the curse of dimensionality. Motivated by $k$-d trees in computer science, we show how to construct an efficient data-adaptive partition of Euclidean space that possesses the following two properties: With high confidence the distribution from which the data are generated is close to uniform on each rectangle of the partition; and despite the data-dependent construction we can give guaranteed finite sample simultaneous confidence intervals for the probabilities (and hence for the average densities) of each rectangle in the partition. This partition will automatically adapt to the sizes of the regions where the distribution is close to uniform. The methodology produces confidence intervals whose widths depend only on the probability content of the rectangles and not on the dimensionality of the space, thus avoiding the curse of dimensionality. Moreover, the widths essentially match the optimal widths in the univariate setting. The simultaneous validity of the confidence intervals allows to use this construction, which we call {\sl Beta-trees}, for various data-analytic purposes. We illustrate this by using Beta-trees for visualizing data and for multivariate mode-hunting. △ Less

Submitted 2 August, 2023; originally announced August 2023.

MSC Class: 62G15

arXiv:2109.06371 [pdf, ps, other]

Tail bounds for empirically standardized sums

Authors: Guenther Walther

Abstract: Exponential tail bounds for sums play an important role in statistics, but the example of the $t$-statistic shows that the exponential tail decay may be lost when population parameters need to be estimated from the data. However, it turns out that if Studentizing is accompanied by estimating the location parameter in a suitable way, then the $t$-statistic regains the exponential tail behavior. Mot… ▽ More Exponential tail bounds for sums play an important role in statistics, but the example of the $t$-statistic shows that the exponential tail decay may be lost when population parameters need to be estimated from the data. However, it turns out that if Studentizing is accompanied by estimating the location parameter in a suitable way, then the $t$-statistic regains the exponential tail behavior. Motivated by this example, the paper analyzes other ways of empirically standardizing sums and establishes tail bounds that are sub-Gaussian or even closer to normal for the following settings: Standardization with Studentized contrasts for normal observations, standardization with the log likelihood ratio statistic for observations from an exponential family, and standardization via self-normalization for observations from a symmetric distribution with unknown center of symmetry. The latter standardization gives rise to a novel scan statistic for heteroscedastic data whose asymptotic power is analyzed in the case where the observations have a log-concave distribution. △ Less

Submitted 19 March, 2022; v1 submitted 13 September, 2021; originally announced September 2021.

MSC Class: 62G32; 60F10

arXiv:2107.08296 [pdf, ps, other]

doi 10.1007/978-1-4614-8414-1_59-1

Calibrating the scan statistic with size-dependent critical values: heuristics, methodology and computation

Authors: Guenther Walther

Abstract: It is known that the scan statistic with variable window size favors the detection of signals with small spatial extent and there is a corresponding loss of power for signals with large spatial extent. Recent results have shown that this loss is not inevitable: Using critical values that depend on the size of the window allows optimal detection for all signal sizes simultaneously, so there is no s… ▽ More It is known that the scan statistic with variable window size favors the detection of signals with small spatial extent and there is a corresponding loss of power for signals with large spatial extent. Recent results have shown that this loss is not inevitable: Using critical values that depend on the size of the window allows optimal detection for all signal sizes simultaneously, so there is no substantial price to pay for not knowing the correct window size and for scanning with a variable window size. This paper gives a review of the heuristics and methodology for such size-dependent critical values, their applications to various settings including the multivariate case, and recent results about fast algorithms for computing scan statistics. △ Less

Submitted 14 February, 2022; v1 submitted 17 July, 2021; originally announced July 2021.

Journal ref: In: Glaz, J, Koutras M.V. (eds) Handbook of Scan Statistics. 2022. Springer, New York, NY

arXiv:2011.03668 [pdf, other]

Confidence bands for a log-concave density

Authors: Guenther Walther, Alnur Ali, Xinyue Shen, Stephen Boyd

Abstract: We present a new approach for inference about a log-concave distribution: Instead of using the method of maximum likelihood, we propose to incorporate the log-concavity constraint in an appropriate nonparametric confidence set for the cdf $F$. This approach has the advantage that it automatically provides a measure of statistical uncertainty and it thus overcomes a marked limitation of the maximum… ▽ More We present a new approach for inference about a log-concave distribution: Instead of using the method of maximum likelihood, we propose to incorporate the log-concavity constraint in an appropriate nonparametric confidence set for the cdf $F$. This approach has the advantage that it automatically provides a measure of statistical uncertainty and it thus overcomes a marked limitation of the maximum likelihood estimate. In particular, we show how to construct confidence bands for the density that have a finite sample guaranteed confidence level. The nonparametric confidence set for $F$ which we introduce here has attractive computational and statistical properties: It allows to bring modern tools from optimization to bear on this problem via difference of convex programming, and it results in optimal statistical inference. We show that the width of the resulting confidence bands converges at nearly the parametric $n^{-\frac{1}{2}}$ rate when the log density is $k$-affine. △ Less

Submitted 6 May, 2022; v1 submitted 6 November, 2020; originally announced November 2020.

Comments: Added a discussion section, minor changes

arXiv:1907.00085 [pdf, other]

Large-scale inference with block structure

Authors: Jiyao Kou, Guenther Walther

Abstract: The detection of weak and rare effects in large amounts of data arises in a number of modern data analysis problems. Known results show that in this situation the potential of statistical inference is severely limited by the large-scale multiple testing that is inherent in these problems. Here we show that fundamentally more powerful statistical inference is possible when there is some structure i… ▽ More The detection of weak and rare effects in large amounts of data arises in a number of modern data analysis problems. Known results show that in this situation the potential of statistical inference is severely limited by the large-scale multiple testing that is inherent in these problems. Here we show that fundamentally more powerful statistical inference is possible when there is some structure in the signal that can be exploited, e.g. if the signal is clustered in many small blocks, as is the case in some relevant applications. We derive the detection boundary in such a situation where we allow both the number of blocks and the block length to grow polynomially with sample size. We derive these results both for the univariate and the multivariate settings as well as for the problem of detecting clusters in a network. These results recover as special cases the sparse mixture detection problem (Donoho and **, 2004) where there is no structure in the signal, as well as the scan problem (Chan and Walther, 2013) where the signal comprises a single interval. We develop methodology that allows optimal adaptive detection in the general setting, thus exploiting the structure if it is present without incurring a relevant penalty in the case where there is no structure. The advantage of this methodology can be considerable, as in the case of no structure the means need to increase at the rate $\sqrt{\log n}$ to ensure detection, while the presence of structure allows detection even if the means $decrease$ at a polynomial rate. △ Less

Submitted 7 May, 2022; v1 submitted 28 June, 2019; originally announced July 2019.

MSC Class: 62G10; 62G32

arXiv:1612.07216 [pdf, other]

doi 10.1093/biomet/asz081

The Essential Histogram

Authors: Housen Li, Axel Munk, Hannes Sieling, Guenther Walther

Abstract: The histogram is widely used as a simple, exploratory display of data, but it is usually not clear how to choose the number and size of bins. We construct a confidence set of distribution functions that optimally address the two main tasks of the histogram: estimating probabilities and detecting features such as increases and modes in the distribution. We define the essential histogram as the hist… ▽ More The histogram is widely used as a simple, exploratory display of data, but it is usually not clear how to choose the number and size of bins. We construct a confidence set of distribution functions that optimally address the two main tasks of the histogram: estimating probabilities and detecting features such as increases and modes in the distribution. We define the essential histogram as the histogram in the confidence set with the fewest bins. Thus the essential histogram is the simplest visualization of the data that optimally achieves the main tasks of the histogram. The only assumption we make is that the data are independent and identically distributed. We provide a fast algorithm for the essential histogram, and illustrate our methodology with examples. An R-package is available on CRAN. △ Less

Submitted 28 May, 2019; v1 submitted 21 December, 2016; originally announced December 2016.

Comments: Extension to discrete data is included. A R-package "essHist" is available from https://CRAN.R-project.org/package=essHist

MSC Class: 62G10; 62H30

Journal ref: Biometrika, 2020

arXiv:1503.06388 [pdf, other]

Adaptive Concentration of Regression Trees, with Application to Random Forests

Authors: Stefan Wager, Guenther Walther

Abstract: We study the convergence of the predictive surface of regression trees and forests. To support our analysis we introduce a notion of adaptive concentration for regression trees. This approach breaks tree training into a model selection phase in which we pick the tree splits, followed by a model fitting phase where we find the best regression model consistent with these splits. We then show that th… ▽ More We study the convergence of the predictive surface of regression trees and forests. To support our analysis we introduce a notion of adaptive concentration for regression trees. This approach breaks tree training into a model selection phase in which we pick the tree splits, followed by a model fitting phase where we find the best regression model consistent with these splits. We then show that the fitted regression tree concentrates around the optimal predictor with the same splits: as d and n get large, the discrepancy is with high probability bounded on the order of sqrt(log(d) log(n)/k) uniformly over the whole regression surface, where d is the dimension of the feature space, n is the number of training examples, and k is the minimum leaf size for each tree. We also provide rate-matching lower bounds for this adaptive concentration statement. From a practical perspective, our result enables us to prove consistency results for adaptively grown forests in high dimensions, and to carry out valid post-selection inference in the sense of Berk et al. [2013] for subgroups defined by tree leaves. △ Less

Submitted 30 April, 2016; v1 submitted 22 March, 2015; originally announced March 2015.

arXiv:1410.3853 [pdf, other]

Peer assessment enhances student learning

Authors: Dennis L. Sun, Naftali Harris, Guenther Walther, Michael Baiocchi

Abstract: Feedback has a powerful influence on learning, but it is also expensive to provide. In large classes, it may even be impossible for instructors to provide individualized feedback. Peer assessment has received attention lately as a way of providing personalized feedback that scales to large classes. Besides these obvious benefits, some researchers have also conjectured that students learn by peer a… ▽ More Feedback has a powerful influence on learning, but it is also expensive to provide. In large classes, it may even be impossible for instructors to provide individualized feedback. Peer assessment has received attention lately as a way of providing personalized feedback that scales to large classes. Besides these obvious benefits, some researchers have also conjectured that students learn by peer assessing, although no studies have ever conclusively demonstrated this effect. By conducting a randomized controlled trial in an introductory statistics class, we provide evidence that peer assessment causes significant gains in student achievement. The strength of our conclusions depends critically on the careful design of the experiment, which was made possible by a web-based platform that we developed. Hence, our study is also a proof of concept of the high-quality experiments that are possible with online tools. △ Less

Submitted 14 October, 2014; originally announced October 2014.

arXiv:1211.2859 [pdf, ps, other]

Optimal detection of a jump in the intensity of a Poisson process or in a density with likelihood ratio statistics

Authors: Camilo Rivera, Guenther Walther

Abstract: We consider the problem of detecting a `bump' in the intensity of a Poisson process or in a density. We analyze two types of likelihood ratio based statistics which allow for exact finite sample inference and asymptotically optimal detection: The maximum of the penalized square root of log likelihood ratios (`penalized scan') evaluated over a certain sparse set of intervals, and a certain average… ▽ More We consider the problem of detecting a `bump' in the intensity of a Poisson process or in a density. We analyze two types of likelihood ratio based statistics which allow for exact finite sample inference and asymptotically optimal detection: The maximum of the penalized square root of log likelihood ratios (`penalized scan') evaluated over a certain sparse set of intervals, and a certain average of log likelihood ratios (`condensed average likelihood ratio'). We show that penalizing the {\sl square root} of the log likelihood ratio - rather than the log likelihood ratio itself - leads to a simple penalty term that yields optimal power. The thus derived penalty may prove useful for other problems that involve a Brownian bridge in the limit. The second key tool is an approximating set of intervals that is rich enough to allow for optimal detection but which is also sparse enough to allow justifying the validity of the penalization scheme simply via the union bound. This results in a considerable simplification in the theoretical treatment compared to the usual approach for this type of penalization technique, which requires establishing an exponential inequality for the variation of the test statistic. Another advantage of using the sparse approximating set is that it allows fast computation in nearly linear time. We present a simulation study that illustrates the superior performance of the penalized scan and of the condensed average likelihood ratio compared to the standard scan statistic. △ Less

Submitted 25 February, 2014; v1 submitted 12 November, 2012; originally announced November 2012.

Journal ref: Scandinavian Journal of Statistics 40 (2013), 752-769

arXiv:1111.0328 [pdf, other]

The Average Likelihood Ratio for Large-scale Multiple Testing and Detecting Sparse Mixtures

Authors: Guenther Walther

Abstract: Large-scale multiple testing problems require the simultaneous assessment of many p-values. This paper compares several methods to assess the evidence in multiple binomial counts of p-values: the maximum of the binomial counts after standardization (the `higher-criticism statistic'), the maximum of the binomial counts after a log-likelihood ratio transformation (the `Berk-Jones statistic'), and a… ▽ More Large-scale multiple testing problems require the simultaneous assessment of many p-values. This paper compares several methods to assess the evidence in multiple binomial counts of p-values: the maximum of the binomial counts after standardization (the `higher-criticism statistic'), the maximum of the binomial counts after a log-likelihood ratio transformation (the `Berk-Jones statistic'), and a newly introduced average of the binomial counts after a likelihood ratio transformation. Simulations show that the higher criticism statistic has a superior performance to the Berk-Jones statistic in the case of very sparse alternatives (sparsity coefficient $β\gtrapprox 0.75$), while the situation is reversed for $β\lessapprox 0.75$. The average likelihood ratio is found to combine the favorable performance of higher criticism in the very sparse case with that of the Berk-Jones statistic in the less sparse case and thus appears to dominate both statistics. Some asymptotic optimality theory is considered but found to set in too slowly to illuminate the above findings, at least for sample sizes up to one million. In contrast, asymptotic approximations to the critical values of the Berk-Jones statistic that have been developed by Wellner and Koltchinskii (2003) and Jager and Wellner (2007) are found to give surprisingly accurate approximations even for quite small sample sizes. △ Less

Submitted 1 November, 2011; originally announced November 2011.

Journal ref: From Probability to Statistics and Back: High-Dimensional Models and Processes - A Festschrift in Honor of Jon A. Wellner. M. Bannerjee, F. Bunea, J. Huang, V. Koltchinskii, M.H. Maathuis (eds.), Inst. Math. Statistics (2013), 317-326

arXiv:1107.4344 [pdf, other]

Detection with the scan and the average likelihood ratio

Authors: Hock Peng Chan, Guenther Walther

Abstract: We investigate the performance of the scan (maximum likelihood ratio statistic) and of the average likelihood ratio statistic in the problem of detecting a deterministic signal with unknown spatial extent in the prototypical univariate sampled data model with white Gaussian noise. Our results show that the scan statistic, a popular tool for detection problems, is optimal only for the detection of… ▽ More We investigate the performance of the scan (maximum likelihood ratio statistic) and of the average likelihood ratio statistic in the problem of detecting a deterministic signal with unknown spatial extent in the prototypical univariate sampled data model with white Gaussian noise. Our results show that the scan statistic, a popular tool for detection problems, is optimal only for the detection of signals with the smallest spatial extent. For signals with larger spatial extent the scan is suboptimal, and the power loss can be considerable. In contrast, the average likelihood ratio statistic is optimal for the detection of signals on all scales except the smallest ones, where its performance is only slightly suboptimal. We give rigorous mathematical statements of these results as well as heuristic explanations which suggest that the essence of these findings applies to detection problems quite generally, such as the detection of clusters in models involving densities or intensities or the detection of multivariate signals. We present a modification of the average likelihood ratio that yields optimal detection of signals with arbitrary spatial extent and which has the additional benefit of allowing for a fast computation of the statistic. In contrast, optimal detection with the scan seems to require the use of scale-dependent critical values. △ Less

Submitted 25 February, 2014; v1 submitted 21 July, 2011; originally announced July 2011.

Journal ref: Statistica Sinica 23 (2013), 409-428

arXiv:1010.0305 [pdf, ps, other]

doi 10.1214/09-STS303

Inference and Modeling with Log-concave Distributions

Authors: Guenther Walther

Abstract: Log-concave distributions are an attractive choice for modeling and inference, for several reasons: The class of log-concave distributions contains most of the commonly used parametric distributions and thus is a rich and flexible nonparametric class of distributions. Further, the MLE exists and can be computed with readily available algorithms. Thus, no tuning parameter, such as a bandwidth, is n… ▽ More Log-concave distributions are an attractive choice for modeling and inference, for several reasons: The class of log-concave distributions contains most of the commonly used parametric distributions and thus is a rich and flexible nonparametric class of distributions. Further, the MLE exists and can be computed with readily available algorithms. Thus, no tuning parameter, such as a bandwidth, is necessary for estimation. Due to these attractive properties, there has been considerable recent research activity concerning the theory and applications of log-concave distributions. This article gives a review of these results. △ Less

Submitted 2 October, 2010; originally announced October 2010.

Comments: Published in at http://dx.doi.org/10.1214/09-STS303 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-STS-STS303

Journal ref: Statistical Science 2009, Vol. 24, No. 3, 319-327

Showing 1–12 of 12 results for author: Walther, G