-
TurboRVB: a many-body toolkit for {\it ab initio} electronic simulations by quantum Monte Carlo
Authors:
Kousuke Nakano,
Claudio Attaccalite,
Matteo Barborini,
Luca Capriotti,
Michele Casula,
Emanuele Coccia,
Mario Dagrada,
Claudio Genovese,
Ye Luo,
Guglielmo Mazzola,
Andrea Zen,
Sandro Sorella
Abstract:
TurboRVB is a computational package for {\it ab initio} Quantum Monte Carlo (QMC) simulations of both molecular and bulk electronic systems. The code implements two types of well established QMC algorithms: Variational Monte Carlo (VMC), and Diffusion Monte Carlo in its robust and efficient lattice regularized variant. A key feature of the code is the possibility of using strongly correlated many-…
▽ More
TurboRVB is a computational package for {\it ab initio} Quantum Monte Carlo (QMC) simulations of both molecular and bulk electronic systems. The code implements two types of well established QMC algorithms: Variational Monte Carlo (VMC), and Diffusion Monte Carlo in its robust and efficient lattice regularized variant. A key feature of the code is the possibility of using strongly correlated many-body wave functions. The electronic wave function (WF) is obtained by applying a Jastrow factor, which takes into account dynamical correlations, to the most general mean-field ground state, written either as an antisymmetrized geminal product with spin-singlet pairing, or as a Pfaffian, including both singlet and triplet correlations. This wave function can be viewed as an efficient implementation of the so-called resonating valence bond (RVB) ansatz, first proposed by L. Pauling and P. W. Anderson in quantum chemistry and condensed matter physics, respectively. The RVB ansatz implemented in TurboRVB has a large variational freedom, including the Jastrow correlated Slater determinant as its simplest, but nontrivial case. Moreover, it has the remarkable advantage of remaining with an affordable computational cost, proportional to the one spent for the evaluation of a single Slater determinant. The code implements the adjoint algorithmic differentiation that enables a very efficient evaluation of energy derivatives, comprising the ionic forces. Thus, one can perform structural optimizations and molecular dynamics in the canonical NVT ensemble at the VMC level. For the electronic part, a full WF optimization is made possible thanks to state-of-the-art stochastic algorithms for energy minimization. The code has been efficiently parallelized by using a hybrid MPI-OpenMP protocol, that is also an ideal environment for exploiting the computational power of modern GPU accelerators.
△ Less
Submitted 1 June, 2020; v1 submitted 18 February, 2020;
originally announced February 2020.
-
General correlated geminal ansatz for electronic structure calculations: exploiting Pfaffians in place of determinants
Authors:
Claudio Genovese,
Tomonori Shirakawa,
Kousuke Nakano,
Sandro Sorella
Abstract:
We propose here a single Pfaffian correlated variational ansatz, that dramatically improves the accuracy with respect to the single determinant one, while remaining at a similar computational cost. A much larger correlation energy is indeed determined by the most general two electron pairing function, including both singlet and triplet channels, combined with a many-body Jastrow factor, including…
▽ More
We propose here a single Pfaffian correlated variational ansatz, that dramatically improves the accuracy with respect to the single determinant one, while remaining at a similar computational cost. A much larger correlation energy is indeed determined by the most general two electron pairing function, including both singlet and triplet channels, combined with a many-body Jastrow factor, including all possible spin-spin spin-density and density-density terms. The main technical ingredient to exploit this accuracy is the use of the Pfaffian for antisymmetrizing an highly correlated pairing function, thus recovering the Fermi statistics for electrons with an affordable computational cost. Moreover the application of the Diffusion Monte Carlo, within the fixed node approximation, allows us to obtain very accurate binding energies for the first preliminary calculations reported in this study: C$_2$, N$_2$ and O$_2$ and the benzene molecule. This is promising and remarkable, considering that they represent extremely difficult molecules even for computationally demanding multi-determinant approaches, and opens therefore the way for realistic and accurate electronic simulations with an algorithm scaling at most as the fourth power of the number of electrons.
△ Less
Submitted 26 July, 2020; v1 submitted 9 February, 2020;
originally announced February 2020.
-
Expanding the scope of statistical computing: Training statisticians to be software engineers
Authors:
Alex Reinhart,
Christopher R. Genovese
Abstract:
Traditionally, statistical computing courses have taught the syntax of a particular programming language or specific statistical computation methods. Since the publication of Nolan and Temple Lang (2010), we have seen a greater emphasis on data wrangling, reproducible research, and visualization. This shift better prepares students for careers working with complex datasets and producing analyses f…
▽ More
Traditionally, statistical computing courses have taught the syntax of a particular programming language or specific statistical computation methods. Since the publication of Nolan and Temple Lang (2010), we have seen a greater emphasis on data wrangling, reproducible research, and visualization. This shift better prepares students for careers working with complex datasets and producing analyses for multiple audiences. But, we argue, statisticians are now often called upon to develop statistical software, not just analyses, such as R packages implementing new analysis methods or machine learning systems integrated into commercial products. This demands different skills.
We describe a graduate course that we developed to meet this need by focusing on four themes: programming practices; software design; important algorithms and data structures; and essential tools and methods. Through code review and revision, and a semester-long software project, students practice all the skills of software engineering. The course allows students to expand their understanding of computing as applied to statistical problems while building expertise in the kind of software development that is increasingly the province of the working statistician. We see this as a model for the future evolution of the computing curriculum in statistics and data science.
△ Less
Submitted 28 October, 2020; v1 submitted 30 December, 2019;
originally announced December 2019.
-
The nature of the chemical bond in the dicarbon molecule
Authors:
Claudio Genovese,
Sandro Sorella
Abstract:
The molecular dissociation energy has often been explained and discussed in terms of singlet bonds, formed by bounded pairs of valence electrons. In this work we use a highly correlated resonating valence bond ansatz, providing a consistent paradigm for the chemical bond, where spin fluctuations are shown to play a crucial role. Spin fluctuations are known to be important in magnetic systems and c…
▽ More
The molecular dissociation energy has often been explained and discussed in terms of singlet bonds, formed by bounded pairs of valence electrons. In this work we use a highly correlated resonating valence bond ansatz, providing a consistent paradigm for the chemical bond, where spin fluctuations are shown to play a crucial role. Spin fluctuations are known to be important in magnetic systems and correspond to the zero point motion of the spin waves emerging from a magnetic broken symmetry state. Recently, in order to explain the excitation spectrum of the carbon dimer, an unusual quadruple bond has been proposed. Within our ansatz, a satisfactory description of the carbon dimer is determined by the magnetic interaction of two Carbon atoms with antiferromagnetically ordered S = 1 magnetic moments. This is a first step that, thanks to the highly scalable and efficient quantum Monte Carlo technique, may open the way for understanding challenging complex systems containing atoms with large spins (e.g. transition metals).
△ Less
Submitted 25 September, 2020; v1 submitted 21 November, 2019;
originally announced November 2019.
-
Ground-state properties of the hydrogen chain: insulator-to-metal transition, dimerization, and magnetic phases
Authors:
Mario Motta,
Claudio Genovese,
Fengjie Ma,
Zhi-Hao Cui,
Randy Sawaya,
Garnet Kin-Lic Chan,
Natalia Chepiga,
Phillip Helms,
Carlos Jimenez-Hoyos,
Andrew J. Millis,
Ushnish Ray,
Enrico Ronca,
Hao Shi,
Sandro Sorella,
Edwin M. Stoudenmire,
Steven R. White,
Shiwei Zhang
Abstract:
Accurate and predictive computations of the quantum-mechanical behavior of many interacting electrons in realistic atomic environments are critical for the theoretical design of materials with desired properties, and require solving the grand-challenge problem of the many-electron Schrodinger equation. An infinite chain of equispaced hydrogen atoms is perhaps the simplest realistic model for a bul…
▽ More
Accurate and predictive computations of the quantum-mechanical behavior of many interacting electrons in realistic atomic environments are critical for the theoretical design of materials with desired properties, and require solving the grand-challenge problem of the many-electron Schrodinger equation. An infinite chain of equispaced hydrogen atoms is perhaps the simplest realistic model for a bulk material, embodying several central themes of modern condensed matter physics and chemistry, while retaining a connection to the paradigmatic Hubbard model. Here we report a combined application of cutting-edge computational methods to determine the properties of the hydrogen chain in its quantum-mechanical ground state. Varying the separation between the nuclei leads to a rich phase diagram, including a Mott phase with quasi long-range antiferromagnetic order, electron density dimerization with power-law correlations, an insulator-to-metal transition and an intricate set of intertwined magnetic orders.
△ Less
Submitted 13 July, 2020; v1 submitted 4 November, 2019;
originally announced November 2019.
-
Assessing the Accuracy of the Jastrow Antisymmetrized Geminal Power in the H4 Model System
Authors:
Claudio Genovese,
Antonella Meninno,
Sandro Sorella
Abstract:
We report a quantum Monte Carlo (QMC) study, on a very simple but nevertheless very instructive model system of four hydrogen atoms, recently proposed in Ref. 1. We find that the Jastrow correlated Antisymmetrized Geminal Power (JAGP) is able to recover most of the correlation energy even when the geometry is symmetric and the hydrogens lie on the edges of a perfect square. Under such conditions t…
▽ More
We report a quantum Monte Carlo (QMC) study, on a very simple but nevertheless very instructive model system of four hydrogen atoms, recently proposed in Ref. 1. We find that the Jastrow correlated Antisymmetrized Geminal Power (JAGP) is able to recover most of the correlation energy even when the geometry is symmetric and the hydrogens lie on the edges of a perfect square. Under such conditions the diradical character of the molecule ground state prevents a single determinant ansatz to achieve an acceptable accuracy, whereas the JAGP performs very well for all geometries. Remarkably, this is obtained with a similar computational effort. Moreover we find that the Jastrow Factor is fundamental in promoting the correct resonances among several configurations in the JAGP, that cannot show up in the pure Antisymmetrized Geminal Power (AGP). We also show the extremely fast convergence of this approach in the extension of the basis set. Remarkably only the simultaneous optimization of the Jastrow and the AGP part of our variational ansatz is able to recover an almost perfect nodal surface, yielding therefore state of the art energies, almost converged in the complete basis set limit (CBS), when the so called Diffusion Monte Carlo is applied.
△ Less
Submitted 29 January, 2019;
originally announced May 2019.
-
Finding Singular Features
Authors:
Christopher Genovese,
Marco Perone-Pacifico,
Isabella Verdinelli,
Larry Wasserman
Abstract:
We present a method for finding high density, low-dimensional structures in noisy point clouds. These structures are sets with zero Lebesgue measure with respect to the $D$-dimensional ambient space and belong to a $d<D$ dimensional space. We call them "singular features." Hunting for singular features corresponds to finding unexpected or unknown structures hidden in point clouds belonging to…
▽ More
We present a method for finding high density, low-dimensional structures in noisy point clouds. These structures are sets with zero Lebesgue measure with respect to the $D$-dimensional ambient space and belong to a $d<D$ dimensional space. We call them "singular features." Hunting for singular features corresponds to finding unexpected or unknown structures hidden in point clouds belonging to $\R^D$. Our method outputs well defined sets of dimensions $d<D$. Unlike spectral clustering, the method works well in the presence of noise. We show how to find singular features by first finding ridges in the estimated density, followed by a filtering step based on the eigenvalues of the Hessian of the density.
△ Less
Submitted 1 June, 2016;
originally announced June 2016.
-
Nonparametric Clustering of Functional Data Using Pseudo-Densities
Authors:
Mattia Ciollaro,
Christopher R. Genovese,
Daren Wang
Abstract:
We study nonparametric clustering of smooth random curves on the basis of the L2 gradient flow associated to a pseudo-density functional and we show that the clustering is well-defined both at the population and at the sample level. We provide an algorithm to mark significant local modes, which are associated to informative sample clusters, and we derive its consistency properties. Our theory is d…
▽ More
We study nonparametric clustering of smooth random curves on the basis of the L2 gradient flow associated to a pseudo-density functional and we show that the clustering is well-defined both at the population and at the sample level. We provide an algorithm to mark significant local modes, which are associated to informative sample clusters, and we derive its consistency properties. Our theory is developed under weak assumptions, which essentially reduce to the integrability of the random curves, and does not require to project the random curves on a finite-dimensional subspace. However, if the underlying probability distribution is supported on a finite-dimensional subspace, we show that the pseudo-density and the expectation of a kernel density estimator induce the same gradient flow, and therefore the same clustering. Although our theory is developed for smooth curves that belong to an infinite-dimensional functional space, we also provide consistent procedures that can be used with real data (discretized and noisy observations).
△ Less
Submitted 28 January, 2016;
originally announced January 2016.
-
Cosmic Web Reconstruction through Density Ridges: Catalogue
Authors:
Yen-Chi Chen,
Shirley Ho,
Jon Brinkmann,
Peter E. Freeman,
Christopher R. Genovese,
Donald P. Schneider,
Larry Wasserman
Abstract:
We construct a catalogue for filaments using a novel approach called SCMS (subspace constrained mean shift; Ozertem & Erdogmus 2011; Chen et al. 2015). SCMS is a gradient-based method that detects filaments through density ridges (smooth curves tracing high-density regions). A great advantage of SCMS is its uncertainty measure, which allows an evaluation of the errors for the detected filaments. T…
▽ More
We construct a catalogue for filaments using a novel approach called SCMS (subspace constrained mean shift; Ozertem & Erdogmus 2011; Chen et al. 2015). SCMS is a gradient-based method that detects filaments through density ridges (smooth curves tracing high-density regions). A great advantage of SCMS is its uncertainty measure, which allows an evaluation of the errors for the detected filaments. To detect filaments, we use data from the Sloan Digital Sky Survey, which consist of three galaxy samples: the NYU main galaxy sample (MGS), the LOWZ sample and the CMASS sample. Each of the three dataset covers different redshift regions so that the combined sample allows detection of filaments up to z = 0.7. Our filament catalogue consists of a sequence of two-dimensional filament maps at different redshifts that provide several useful statistics on the evolution cosmic web. To construct the maps, we select spectroscopically confirmed galaxies within 0.050 < z < 0.700 and partition them into 130 bins. For each bin, we ignore the redshift, treating the galaxy observations as a 2-D data and detect filaments using SCMS. The filament catalogue consists of 130 individual 2-D filament maps, and each map comprises points on the detected filaments that describe the filamentary structures at a particular redshift. We also apply our filament catalogue to investigate galaxy luminosity and its relation with distance to filament. Using a volume-limited sample, we find strong evidence (6.1$σ$ - 12.3$σ$) that galaxies close to filaments are generally brighter than those at significant distance from filaments.
△ Less
Submitted 21 September, 2015;
originally announced September 2015.
-
Detecting Effects of Filaments on Galaxy Properties in the Sloan Digital Sky Survey III
Authors:
Yen-Chi Chen,
Shirley Ho,
Rachel Mandelbaum,
Neta A. Bahcall,
Joel R. Brownstein,
Peter E. Freeman,
Christopher R. Genovese,
Donald P. Schneider,
Larry Wasserman
Abstract:
We study the effects of filaments on galaxy properties in the Sloan Digital Sky Survey (SDSS) Data Release 12 using filaments from the `Cosmic Web Reconstruction' catalogue (Chen et al. 2016), a publicly available filament catalogue for SDSS. Since filaments are tracers of medium-to-high density regions, we expect that galaxy properties associated with the environment are dependent on the distance…
▽ More
We study the effects of filaments on galaxy properties in the Sloan Digital Sky Survey (SDSS) Data Release 12 using filaments from the `Cosmic Web Reconstruction' catalogue (Chen et al. 2016), a publicly available filament catalogue for SDSS. Since filaments are tracers of medium-to-high density regions, we expect that galaxy properties associated with the environment are dependent on the distance to the nearest filament. Our analysis demonstrates that a red galaxy or a high-mass galaxy tend to reside closer to filaments than a blue or low-mass galaxy. After adjusting the effect from stellar mass, on average, early-forming galaxies or large galaxies have a shorter distance to filaments than late-forming galaxies or small galaxies. For the Main galaxy sample (MGS), all signals are very significant ($>6σ$). For the LOWZ and CMASS sample, the stellar mass and size are significant ($>2 σ$). The filament effects we observe persist until $z = 0.7$ (the edge of the CMASS sample). Comparing our results to those using the galaxy distances from redMaPPer galaxy clusters as a reference, we find a similar result between filaments and clusters. Moreover, we find that the effect of clusters on the stellar mass of nearby galaxies depends on the galaxy's filamentary environment. Our findings illustrate the strong correlation of galaxy properties with proximity to density ridges, strongly supporting the claim that density ridges are good tracers of filaments.
△ Less
Submitted 12 January, 2017; v1 submitted 21 September, 2015;
originally announced September 2015.
-
Investigating Galaxy-Filament Alignments in Hydrodynamic Simulations using Density Ridges
Authors:
Yen-Chi Chen,
Shirley Ho,
Ananth Tenneti,
Rachel Mandelbaum,
Rupert Croft,
Tiziana DiMatteo,
Peter E. Freeman,
Christopher R. Genovese,
Larry Wasserman
Abstract:
In this paper, we study the filamentary structures and the galaxy alignment along filaments at redshift $z=0.06$ in the MassiveBlack-II simulation, a state-of-the-art, high-resolution hydrodynamical cosmological simulation which includes stellar and AGN feedback in a volume of (100 Mpc$/h$)$^3$. The filaments are constructed using the subspace constrained mean shift (SCMS; Ozertem & Erdogmus (2011…
▽ More
In this paper, we study the filamentary structures and the galaxy alignment along filaments at redshift $z=0.06$ in the MassiveBlack-II simulation, a state-of-the-art, high-resolution hydrodynamical cosmological simulation which includes stellar and AGN feedback in a volume of (100 Mpc$/h$)$^3$. The filaments are constructed using the subspace constrained mean shift (SCMS; Ozertem & Erdogmus (2011) and Chen et al. (2015a)). First, we show that reconstructed filaments using galaxies and reconstructed filaments using dark matter particles are similar to each other; over $50\%$ of the points on the galaxy filaments have a corresponding point on the dark matter filaments within distance $0.13$ Mpc$/h$ (and vice versa) and this distance is even smaller at high-density regions. Second, we observe the alignment of the major principal axis of a galaxy with respect to the orientation of its nearest filament and detect a $2.5$ Mpc$/h$ critical radius for filament's influence on the alignment when the subhalo mass of this galaxy is between $10^9M_\odot/h$ and $10^{12}M_\odot/h$. Moreover, we find the alignment signal to increase significantly with the subhalo mass. Third, when a galaxy is close to filaments (less than $0.25$ Mpc$/h$), the galaxy alignment toward the nearest galaxy group depends on the galaxy subhalo mass. Finally, we find that galaxies close to filaments or groups tend to be rounder than those away from filaments or groups.
△ Less
Submitted 17 August, 2015;
originally announced August 2015.
-
Statistical Inference using the Morse-Smale Complex
Authors:
Yen-Chi Chen,
Christopher R. Genovese,
Larry Wasserman
Abstract:
The Morse-Smale complex of a function $f$ decomposes the sample space into cells where $f$ is increasing or decreasing. When applied to nonparametric density estimation and regression, it provides a way to represent, visualize, and compare multivariate functions. In this paper, we present some statistical results on estimating Morse-Smale complexes. This allows us to derive new results for two exi…
▽ More
The Morse-Smale complex of a function $f$ decomposes the sample space into cells where $f$ is increasing or decreasing. When applied to nonparametric density estimation and regression, it provides a way to represent, visualize, and compare multivariate functions. In this paper, we present some statistical results on estimating Morse-Smale complexes. This allows us to derive new results for two existing methods: mode clustering and Morse-Smale regression. We also develop two new methods based on the Morse-Smale complex: a visualization technique for multivariate functions and a two-sample, multivariate hypothesis test.
△ Less
Submitted 3 April, 2017; v1 submitted 29 June, 2015;
originally announced June 2015.
-
Optimal Ridge Detection using Coverage Risk
Authors:
Yen-Chi Chen,
Christopher R. Genovese,
Shirley Ho,
Larry Wasserman
Abstract:
We introduce the concept of coverage risk as an error measure for density ridge estimation. The coverage risk generalizes the mean integrated square error to set estimation. We propose two risk estimators for the coverage risk and we show that we can select tuning parameters by minimizing the estimated risk. We study the rate of convergence for coverage risk and prove consistency of the risk estim…
▽ More
We introduce the concept of coverage risk as an error measure for density ridge estimation. The coverage risk generalizes the mean integrated square error to set estimation. We propose two risk estimators for the coverage risk and we show that we can select tuning parameters by minimizing the estimated risk. We study the rate of convergence for coverage risk and prove consistency of the risk estimators. We apply our method to three simulated datasets and to cosmology data. In all the examples, the proposed method successfully recover the underlying density structure.
△ Less
Submitted 7 June, 2015;
originally announced June 2015.
-
Density Level Sets: Asymptotics, Inference, and Visualization
Authors:
Yen-Chi Chen,
Christopher R. Genovese,
Larry Wasserman
Abstract:
We derive asymptotic theory for the plug-in estimate for density level sets under Hausdoff loss. Based on the asymptotic theory, we propose two bootstrap confidence regions for level sets. The confidence regions can be used to perform tests for anomaly detection and clustering. We also introduce a technique to visualize high dimensional density level sets by combining mode clustering and multidime…
▽ More
We derive asymptotic theory for the plug-in estimate for density level sets under Hausdoff loss. Based on the asymptotic theory, we propose two bootstrap confidence regions for level sets. The confidence regions can be used to perform tests for anomaly detection and clustering. We also introduce a technique to visualize high dimensional density level sets by combining mode clustering and multidimensional scaling.
△ Less
Submitted 5 September, 2016; v1 submitted 21 April, 2015;
originally announced April 2015.
-
Cosmic Web Reconstruction through Density Ridges: Method and Algorithm
Authors:
Yen-Chi Chen,
Shirley Ho,
Peter E. Freeman,
Christopher R. Genovese,
Larry Wasserman
Abstract:
The detection and characterization of filamentary structures in the cosmic web allows cosmologists to constrain parameters that dictates the evolution of the Universe. While many filament estimators have been proposed, they generally lack estimates of uncertainty, reducing their inferential power. In this paper, we demonstrate how one may apply the Subspace Constrained Mean Shift (SCMS) algorithm…
▽ More
The detection and characterization of filamentary structures in the cosmic web allows cosmologists to constrain parameters that dictates the evolution of the Universe. While many filament estimators have been proposed, they generally lack estimates of uncertainty, reducing their inferential power. In this paper, we demonstrate how one may apply the Subspace Constrained Mean Shift (SCMS) algorithm (Ozertem and Erdogmus (2011); Genovese et al. (2012)) to uncover filamentary structure in galaxy data. The SCMS algorithm is a gradient ascent method that models filaments as density ridges, one-dimensional smooth curves that trace high-density regions within the point cloud. We also demonstrate how augmenting the SCMS algorithm with bootstrap-based methods of uncertainty estimation allows one to place uncertainty bands around putative filaments. We apply the SCMS method to datasets sampled from the P3M N-body simulation, with galaxy number densities consistent with SDSS and WFIRST-AFTA and to LOWZ and CMASS data from the Baryon Oscillation Spectroscopic Survey (BOSS). To further assess the efficacy of SCMS, we compare the relative locations of BOSS filaments with galaxy clusters in the redMaPPer catalog, and find that redMaPPer clusters are significantly closer (with p-values $< 10^{-9}$) to SCMS-detected filaments than to randomly selected galaxies.
△ Less
Submitted 27 August, 2015; v1 submitted 21 January, 2015;
originally announced January 2015.
-
Nonparametric modal regression
Authors:
Yen-Chi Chen,
Christopher R. Genovese,
Ryan J. Tibshirani,
Larry Wasserman
Abstract:
Modal regression estimates the local modes of the distribution of $Y$ given $X=x$, instead of the mean, as in the usual regression sense, and can hence reveal important structure missed by usual regression methods. We study a simple nonparametric method for modal regression, based on a kernel density estimate (KDE) of the joint distribution of $Y$ and $X$. We derive asymptotic error bounds for thi…
▽ More
Modal regression estimates the local modes of the distribution of $Y$ given $X=x$, instead of the mean, as in the usual regression sense, and can hence reveal important structure missed by usual regression methods. We study a simple nonparametric method for modal regression, based on a kernel density estimate (KDE) of the joint distribution of $Y$ and $X$. We derive asymptotic error bounds for this method, and propose techniques for constructing confidence sets and prediction sets. The latter is used to select the smoothing bandwidth of the underlying KDE. The idea behind modal regression is connected to many others, such as mixture regression and density ridge estimation, and we discuss these ties as well.
△ Less
Submitted 30 March, 2016; v1 submitted 4 December, 2014;
originally announced December 2014.
-
The functional mean-shift algorithm for mode hunting and clustering in infinite dimensions
Authors:
Mattia Ciollaro,
Christopher Genovese,
**g Lei,
Larry Wasserman
Abstract:
We introduce the functional mean-shift algorithm, an iterative algorithm for estimating the local modes of a surrogate density from functional data. We show that the algorithm can be used for cluster analysis of functional data. We propose a test based on the bootstrap for the significance of the estimated local modes of the surrogate density. We present two applications of our methodology. In the…
▽ More
We introduce the functional mean-shift algorithm, an iterative algorithm for estimating the local modes of a surrogate density from functional data. We show that the algorithm can be used for cluster analysis of functional data. We propose a test based on the bootstrap for the significance of the estimated local modes of the surrogate density. We present two applications of our methodology. In the first application, we demonstrate how the functional mean-shift algorithm can be used to perform spike sorting, i.e. cluster neural activity curves. In the second application, we use the functional mean-shift algorithm to distinguish between original and fake signatures.
△ Less
Submitted 6 August, 2014;
originally announced August 2014.
-
Estimating the distribution of Galaxy Morphologies on a continuous space
Authors:
Giuseppe Vinci,
Peter Freeman,
Jeffrey Newman,
Larry Wasserman,
Christopher Genovese
Abstract:
The incredible variety of galaxy shapes cannot be summarized by human defined discrete classes of shapes without causing a possibly large loss of information. Dictionary learning and sparse coding allow us to reduce the high dimensional space of shapes into a manageable low dimensional continuous vector space. Statistical inference can be done in the reduced space via probability distribution esti…
▽ More
The incredible variety of galaxy shapes cannot be summarized by human defined discrete classes of shapes without causing a possibly large loss of information. Dictionary learning and sparse coding allow us to reduce the high dimensional space of shapes into a manageable low dimensional continuous vector space. Statistical inference can be done in the reduced space via probability distribution estimation and manifold estimation.
△ Less
Submitted 29 June, 2014;
originally announced June 2014.
-
Asymptotic theory for density ridges
Authors:
Yen-Chi Chen,
Christopher R. Genovese,
Larry Wasserman
Abstract:
The large sample theory of estimators for density modes is well understood. In this paper we consider density ridges, which are a higher-dimensional extension of modes. Modes correspond to zero-dimensional, local high-density regions in point clouds. Density ridges correspond to $s$-dimensional, local high-density regions in point clouds. We establish three main results. First we show that under a…
▽ More
The large sample theory of estimators for density modes is well understood. In this paper we consider density ridges, which are a higher-dimensional extension of modes. Modes correspond to zero-dimensional, local high-density regions in point clouds. Density ridges correspond to $s$-dimensional, local high-density regions in point clouds. We establish three main results. First we show that under appropriate regularity conditions, the local variation of the estimated ridge can be approximated by an empirical process. Second, we show that the distribution of the estimated ridge converges to a Gaussian process. Third, we establish that the bootstrap leads to valid confidence sets for density ridges.
△ Less
Submitted 13 October, 2015; v1 submitted 21 June, 2014;
originally announced June 2014.
-
Generalized Mode and Ridge Estimation
Authors:
Yen-Chi Chen,
Christopher R. Genovese,
Larry Wasserman
Abstract:
The generalized density is a product of a density function and a weight function. For example, the average local brightness of an astronomical image is the probability of finding a galaxy times the mean brightness of the galaxy. We propose a method for studying the geometric structure of generalized densities. In particular, we show how to find the modes and ridges of a generalized density functio…
▽ More
The generalized density is a product of a density function and a weight function. For example, the average local brightness of an astronomical image is the probability of finding a galaxy times the mean brightness of the galaxy. We propose a method for studying the geometric structure of generalized densities. In particular, we show how to find the modes and ridges of a generalized density function using a modification of the mean shift algorithm and its variant, subspace constrained mean shift. Our method can be used to perform clustering and to calculate a measure of connectivity between clusters. We establish consistency and rates of convergence for our estimator and apply the methods to data from two astronomical problems.
△ Less
Submitted 6 June, 2014;
originally announced June 2014.
-
A Comprehensive Approach to Mode Clustering
Authors:
Yen-Chi Chen,
Christopher R. Genovese,
Larry Wasserman
Abstract:
Mode clustering is a nonparametric method for clustering that defines clusters using the basins of attraction of a density estimator's modes. We provide several enhancements to mode clustering: (i) a soft variant of cluster assignment, (ii) a measure of connectivity between clusters, (iii) a technique for choosing the bandwidth, (iv) a method for denoising small clusters, and (v) an approach to vi…
▽ More
Mode clustering is a nonparametric method for clustering that defines clusters using the basins of attraction of a density estimator's modes. We provide several enhancements to mode clustering: (i) a soft variant of cluster assignment, (ii) a measure of connectivity between clusters, (iii) a technique for choosing the bandwidth, (iv) a method for denoising small clusters, and (v) an approach to visualizing the clusters. Combining all these enhancements gives us a complete procedure for clustering in multivariate problems. We also compare mode clustering to other clustering methods in several examples
△ Less
Submitted 22 December, 2015; v1 submitted 6 June, 2014;
originally announced June 2014.
-
Functional Regression for Quasar Spectra
Authors:
Mattia Ciollaro,
Jessi Cisewski,
Peter Freeman,
Christopher Genovese,
**g Lei,
Ross O'Connell,
Larry Wasserman
Abstract:
The Lyman-alpha forest is a portion of the observed light spectrum of distant galactic nuclei which allows us to probe remote regions of the Universe that are otherwise inaccessible. The observed Lyman-alpha forest of a quasar light spectrum can be modeled as a noisy realization of a smooth curve that is affected by a `dam** effect' which occurs whenever the light emitted by the quasar travels t…
▽ More
The Lyman-alpha forest is a portion of the observed light spectrum of distant galactic nuclei which allows us to probe remote regions of the Universe that are otherwise inaccessible. The observed Lyman-alpha forest of a quasar light spectrum can be modeled as a noisy realization of a smooth curve that is affected by a `dam** effect' which occurs whenever the light emitted by the quasar travels through regions of the Universe with higher matter concentration. To decode the information conveyed by the Lyman-alpha forest about the matter distribution, we must be able to separate the smooth `continuum' from the noise and the contribution of the dam** effect in the quasar light spectra. To predict the continuum in the Lyman-alpha forest, we use a nonparametric functional regression model in which both the response and the predictor variable (the smooth part of the dam**-free portion of the spectrum) are function-valued random variables. We demonstrate that the proposed method accurately predicts the unobservable continuum in the Lyman-alpha forest both on simulated spectra and real spectra. Also, we introduce distribution-free prediction bands for the nonparametric functional regression model that have finite sample guarantees. These prediction bands, together with bootstrap-based confidence bands for the projection of the mean continuum on a fixed number of principal components, allow us to assess the degree of uncertainty in the model predictions.
△ Less
Submitted 11 April, 2014;
originally announced April 2014.
-
Nonparametric 3D map of the IGM using the Lyman-alpha forest
Authors:
Jessi Cisewski,
Rupert A. C. Croft,
Peter E. Freeman,
Christopher R. Genovese,
Nishikanta Khandai,
Melih Ozbek,
Larry Wasserman
Abstract:
Visualizing the high-redshift Universe is difficult due to the dearth of available data; however, the Lyman-alpha forest provides a means to map the intergalactic medium at redshifts not accessible to large galaxy surveys. Large-scale structure surveys, such as the Baryon Oscillation Spectroscopic Survey (BOSS), have collected quasar (QSO) spectra that enable the reconstruction of HI density fluct…
▽ More
Visualizing the high-redshift Universe is difficult due to the dearth of available data; however, the Lyman-alpha forest provides a means to map the intergalactic medium at redshifts not accessible to large galaxy surveys. Large-scale structure surveys, such as the Baryon Oscillation Spectroscopic Survey (BOSS), have collected quasar (QSO) spectra that enable the reconstruction of HI density fluctuations. The data fall on a collection of lines defined by the lines-of-sight (LOS) of the QSO, and a major issue with producing a 3D reconstruction is determining how to model the regions between the LOS. We present a method that produces a 3D map of this relatively uncharted portion of the Universe by employing local polynomial smoothing, a nonparametric methodology. The performance of the method is analyzed on simulated data that mimics the varying number of LOS expected in real data, and then is applied to a sample region selected from BOSS. Evaluation of the reconstruction is assessed by considering various features of the predicted 3D maps including visual comparison of slices, PDFs, counts of local minima and maxima, and standardized correlation functions. This 3D reconstruction allows for an initial investigation of the topology of this portion of the Universe using persistent homology.
△ Less
Submitted 8 January, 2014;
originally announced January 2014.
-
Nonparametric Inference For Density Modes
Authors:
Christopher Genovese,
Marco Perone-Pacifico,
Isabella Verdinelli,
Larry Wasserman
Abstract:
We derive nonparametric confidence intervals for the eigenvalues of the Hessian at modes of a density estimate. This provides information about the strength and shape of modes and can also be used as a significance test. We use a data-splitting approach in which potential modes are identified using the first half of the data and inference is done with the second half of the data. To get valid conf…
▽ More
We derive nonparametric confidence intervals for the eigenvalues of the Hessian at modes of a density estimate. This provides information about the strength and shape of modes and can also be used as a significance test. We use a data-splitting approach in which potential modes are identified using the first half of the data and inference is done with the second half of the data. To get valid confidence sets for the eigenvalues, we use a bootstrap based on an elementary-symmetric-polynomial (ESP) transformation. This leads to valid bootstrap confidence sets regardless of any multiplicities in the eigenvalues. We also suggest a new method for bandwidth selection, namely, choosing the bandwidth to maximize the number of significant modes. We show by example that this method works well. Even when the true distribution is singular, and hence does not have a density, (in which case cross validation chooses a zero bandwidth), our method chooses a reasonable bandwidth.
△ Less
Submitted 29 December, 2013;
originally announced December 2013.
-
Uncertainty Measures and Limiting Distributions for Filament Estimation
Authors:
Yen-Chi Chen,
Christopher R. Genovese,
Larry Wasserman
Abstract:
A filament is a high density, connected region in a point cloud. There are several methods for estimating filaments but these methods do not provide any measure of uncertainty. We give a definition for the uncertainty of estimated filaments and we study statistical properties of the estimated filaments. We show how to estimate the uncertainty measures and we construct confidence sets based on a bo…
▽ More
A filament is a high density, connected region in a point cloud. There are several methods for estimating filaments but these methods do not provide any measure of uncertainty. We give a definition for the uncertainty of estimated filaments and we study statistical properties of the estimated filaments. We show how to estimate the uncertainty measures and we construct confidence sets based on a bootstrap** technique. We apply our methods to astronomy data and earthquake data.
△ Less
Submitted 7 December, 2013;
originally announced December 2013.
-
Nonparametric ridge estimation
Authors:
Christopher R. Genovese,
Marco Perone-Pacifico,
Isabella Verdinelli,
Larry Wasserman
Abstract:
We study the problem of estimating the ridges of a density function. Ridge estimation is an extension of mode finding and is useful for understanding the structure of a density. It can also be used to find hidden structure in point cloud data. We show that, under mild regularity conditions, the ridges of the kernel density estimator consistently estimate the ridges of the true density. When the da…
▽ More
We study the problem of estimating the ridges of a density function. Ridge estimation is an extension of mode finding and is useful for understanding the structure of a density. It can also be used to find hidden structure in point cloud data. We show that, under mild regularity conditions, the ridges of the kernel density estimator consistently estimate the ridges of the true density. When the data are noisy measurements of a manifold, we show that the ridges are close and topologically similar to the hidden manifold. To find the estimated ridges in practice, we adapt the modified mean-shift algorithm proposed by Ozertem and Erdogmus [J. Mach. Learn. Res. 12 (2011) 1249-1286]. Some numerical experiments verify that the algorithm is accurate.
△ Less
Submitted 28 August, 2014; v1 submitted 20 December, 2012;
originally announced December 2012.
-
Efficient Estimators for Sequential and Resolution-Limited Inverse Problems
Authors:
Darren Homrighausen,
Christopher R. Genovese
Abstract:
A common problem in the sciences is that a signal of interest is observed only indirectly, through smooth functionals of the signal whose values are then obscured by noise. In such inverse problems, the functionals dampen or entirely eliminate some of the signal's interesting features. This makes it difficult or even impossible to fully reconstruct the signal, even without noise. In this paper, we…
▽ More
A common problem in the sciences is that a signal of interest is observed only indirectly, through smooth functionals of the signal whose values are then obscured by noise. In such inverse problems, the functionals dampen or entirely eliminate some of the signal's interesting features. This makes it difficult or even impossible to fully reconstruct the signal, even without noise. In this paper, we develop methods for handling sequences of related inverse problems, with the problems varying either systematically or randomly over time. Such sequences often arise with automated data collection systems, like the data pipelines of large astronomical instruments such as the Large Synoptic Survey Telescope (LSST). The LSST will observe each patch of the sky many times over its lifetime under varying conditions. A possible additional complication in these problems is that the observational resolution is limited by the instrument, so that even with many repeated observations, only an approximation of the underlying signal can be reconstructed. We propose an efficient estimator for reconstructing a signal of interest given a sequence of related, resolution-limited inverse problems. We demonstrate our method's effectiveness in some representative examples and provide theoretical support for its adoption.
△ Less
Submitted 2 July, 2012;
originally announced July 2012.
-
Regularization Techniques for PSF-Matching Kernels. I. Choice of Kernel Basis
Authors:
A. C. Becker,
D. Homrighausen,
A. J. Connolly,
C. R. Genovese,
R. Owen,
S. J. Bickerton,
R. H. Lupton
Abstract:
We review current methods for building PSF-matching kernels for the purposes of image subtraction or coaddition. Such methods use a linear decomposition of the kernel on a series of basis functions. The correct choice of these basis functions is fundamental to the efficiency and effectiveness of the matching - the chosen bases should represent the underlying signal using a reasonably small number…
▽ More
We review current methods for building PSF-matching kernels for the purposes of image subtraction or coaddition. Such methods use a linear decomposition of the kernel on a series of basis functions. The correct choice of these basis functions is fundamental to the efficiency and effectiveness of the matching - the chosen bases should represent the underlying signal using a reasonably small number of shapes, and/or have a minimum number of user-adjustable tuning parameters. We examine methods whose bases comprise multiple Gauss-Hermite polynomials, as well as a form free basis composed of delta-functions. Kernels derived from delta-functions are unsurprisingly shown to be more expressive; they are able to take more general shapes and perform better in situations where sum-of-Gaussian methods are known to fail. However, due to its many degrees of freedom (the maximum number allowed by the kernel size) this basis tends to overfit the problem, and yields noisy kernels having large variance. We introduce a new technique to regularize these delta-function kernel solutions, which bridges the gap between the generality of delta-function kernels, and the compactness of sum-of-Gaussian kernels. Through this regularization we are able to create general kernel solutions that represent the intrinsic shape of the PSF-matching kernel with only one degree of freedom, the strength of the regularization lambda. The role of lambda is effectively to exchange variance in the resulting difference image with variance in the kernel itself. We examine considerations in choosing the value of lambda, including statistical risk estimators and the ability of the solution to predict solutions for adjacent areas. Both of these suggest moderate strengths of lambda between 0.1 and 1.0, although this optimization is likely dataset dependent.
△ Less
Submitted 13 February, 2012;
originally announced February 2012.
-
Manifold estimation and singular deconvolution under Hausdorff loss
Authors:
Christopher R. Genovese,
Marco Perone-Pacifico,
Isabella Verdinelli,
Larry Wasserman
Abstract:
We find lower and upper bounds for the risk of estimating a manifold in Hausdorff distance under several models. We also show that there are close connections between manifold estimation and the problem of deconvolving a singular measure.
We find lower and upper bounds for the risk of estimating a manifold in Hausdorff distance under several models. We also show that there are close connections between manifold estimation and the problem of deconvolving a singular measure.
△ Less
Submitted 5 June, 2012; v1 submitted 21 September, 2011;
originally announced September 2011.
-
Image Coaddition with Temporally Varying Kernels
Authors:
Darren Homrighausen,
Christopher Genovese,
Andy Connolly,
Andy Becker,
Russell Owen
Abstract:
Large, multi-frequency imaging surveys, such as the Large Synaptic Survey Telescope (LSST), need to do near-real time analysis of very large datasets. This raises a host of statistical and computational problems where standard methods do not work. In this paper, we study a proposed method for combining stacks of images into a single summary image, sometimes referred to as a template. This task is…
▽ More
Large, multi-frequency imaging surveys, such as the Large Synaptic Survey Telescope (LSST), need to do near-real time analysis of very large datasets. This raises a host of statistical and computational problems where standard methods do not work. In this paper, we study a proposed method for combining stacks of images into a single summary image, sometimes referred to as a template. This task is commonly referred to as image coaddition. In part, we focus on a method proposed in previous work, which outlines a procedure for combining stacks of images in an online fashion in the Fourier domain. We evaluate this method by comparing it to two straightforward methods through the use of various criteria and simulations. Note that the goal is not to propose these comparison methods for use in their own right, but to ensure that additional complexity also provides substantially improved performance.
△ Less
Submitted 17 November, 2010;
originally announced November 2010.
-
Discussion of: Brownian distance covariance
Authors:
Christopher R. Genovese
Abstract:
Discussion on "Brownian distance covariance" by Gábor J. Székely and Maria L. Rizzo [arXiv:1010.0297]
Discussion on "Brownian distance covariance" by Gábor J. Székely and Maria L. Rizzo [arXiv:1010.0297]
△ Less
Submitted 5 October, 2010;
originally announced October 2010.
-
Minimax Manifold Estimation
Authors:
Christopher Genovese,
Marco Perone-Pacifico,
Isabella Verdinelli,
Larry Wasserman
Abstract:
We find the minimax rate of convergence in Hausdorff distance for estimating a manifold M of dimension d embedded in R^D given a noisy sample from the manifold. We assume that the manifold satisfies a smoothness condition and that the noise distribution has compact support. We show that the optimal rate of convergence is n^{-2/(2+d)}. Thus, the minimax rate depends only on the dimension of the man…
▽ More
We find the minimax rate of convergence in Hausdorff distance for estimating a manifold M of dimension d embedded in R^D given a noisy sample from the manifold. We assume that the manifold satisfies a smoothness condition and that the noise distribution has compact support. We show that the optimal rate of convergence is n^{-2/(2+d)}. Thus, the minimax rate depends only on the dimension of the manifold, not on the dimension of the space in which M is embedded.
△ Less
Submitted 28 September, 2011; v1 submitted 4 July, 2010;
originally announced July 2010.
-
The Geometry of Nonparametric Filament Estimation
Authors:
Christopher R. Genovese,
Marco Perone-Pacifico,
Isabella Verdinelli,
Larry Wasserman
Abstract:
We consider the problem of estimating filamentary structure from planar point process data. We make some connections with computational geometry and we develop nonparametric methods for estimating the filaments. We show that, under weak conditions, the filaments have a simple geometric representation as the medial axis of the data distribution's support. Our methods convert an estimator of the sup…
▽ More
We consider the problem of estimating filamentary structure from planar point process data. We make some connections with computational geometry and we develop nonparametric methods for estimating the filaments. We show that, under weak conditions, the filaments have a simple geometric representation as the medial axis of the data distribution's support. Our methods convert an estimator of the support's boundary into an estimator of the filaments. We also find the rates of convergence of our estimators.
△ Less
Submitted 12 December, 2010; v1 submitted 25 March, 2010;
originally announced March 2010.
-
Revisiting Marginal Regression
Authors:
Christopher Genovese,
Jiashun **,
Larry Wasserman
Abstract:
The lasso has become an important practical tool for high dimensional regression as well as the object of intense theoretical investigation. But despite the availability of efficient algorithms, the lasso remains computationally demanding in regression problems where the number of variables vastly exceeds the number of data points. A much older method, marginal regression, largely displaced by t…
▽ More
The lasso has become an important practical tool for high dimensional regression as well as the object of intense theoretical investigation. But despite the availability of efficient algorithms, the lasso remains computationally demanding in regression problems where the number of variables vastly exceeds the number of data points. A much older method, marginal regression, largely displaced by the lasso, offers a promising alternative in this case. Computation for marginal regression is practical even when the dimension is very high. In this paper, we study the relative performance of the lasso and marginal regression for regression problems in three different regimes: (a) exact reconstruction in the noise-free and noisy cases when design and coefficients are fixed, (b) exact reconstruction in the noise-free case when the design is fixed but the coefficients are random, and (c) reconstruction in the noisy case where performance is measured by the number of coefficients whose sign is incorrect.
In the first regime, we compare the conditions for exact reconstruction of the two procedures, find examples where each procedure succeeds while the other fails, and characterize the advantages and disadvantages of each. In the second regime, we derive conditions under which marginal regression will provide exact reconstruction with high probability. And in the third regime, we derive rates of convergence for the procedures and offer a new partitioning of the ``phase diagram,'' that shows when exact or Hamming reconstruction is effective.
△ Less
Submitted 20 November, 2009;
originally announced November 2009.
-
Straight to the Source: Detecting Aggregate Objects in Astronomical Images with Proper Error Control
Authors:
David A. Friedenberg,
Christopher R. Genovese
Abstract:
The next generation of telescopes will acquire terabytes of image data on a nightly basis. Collectively, these large images will contain billions of interesting objects, which astronomers call sources. The astronomers' task is to construct a catalog detailing the coordinates and other properties of the sources. The source catalog is the primary data product for most telescopes and is an importan…
▽ More
The next generation of telescopes will acquire terabytes of image data on a nightly basis. Collectively, these large images will contain billions of interesting objects, which astronomers call sources. The astronomers' task is to construct a catalog detailing the coordinates and other properties of the sources. The source catalog is the primary data product for most telescopes and is an important input for testing new astrophysical theories, but to construct the catalog one must first detect the sources. Existing algorithms for catalog creation are effective at detecting sources, but do not have rigorous statistical error control. At the same time, there are several multiple testing procedures that provide rigorous error control, but they are not designed to detect sources that are aggregated over several pixels. In this paper, we propose a technique that does both, by providing rigorous statistical error control on the aggregate objects themselves rather than the pixels. We demonstrate the effectiveness of this approach on data from the Chandra X-ray Observatory Satellite. Our technique effectively controls the rate of false sources, yet still detects almost all of the sources detected by procedures that do not have such rigorous error control and have the advantage of additional data in the form of follow up observations, which will not be available for upcoming large telescopes. In fact, we even detect a new source that was missed by previous studies. The statistical methods developed in this paper can be extended to problems beyond Astronomy, as we will illustrate with an example from Neuroimaging.
△ Less
Submitted 28 October, 2009;
originally announced October 2009.
-
Revealing components of the galaxy population through nonparametric techniques
Authors:
Steven P. Bamford,
Alex L. Rojas,
Robert C. Nichol,
Christopher J. Miller,
Larry Wasserman,
Christopher R. Genovese,
Peter E. Freeman
Abstract:
The distributions of galaxy properties vary with environment, and are often multimodal, suggesting that the galaxy population may be a combination of multiple components. The behaviour of these components versus environment holds details about the processes of galaxy development. To release this information we apply a novel, nonparametric statistical technique, identifying four components presen…
▽ More
The distributions of galaxy properties vary with environment, and are often multimodal, suggesting that the galaxy population may be a combination of multiple components. The behaviour of these components versus environment holds details about the processes of galaxy development. To release this information we apply a novel, nonparametric statistical technique, identifying four components present in the distribution of galaxy H$α$ emission-line equivalent-widths. We interpret these components as passive, star-forming, and two varieties of active galactic nuclei. Independent of this interpretation, the properties of each component are remarkably constant as a function of environment. Only their relative proportions display substantial variation. The galaxy population thus appears to comprise distinct components which are individually independent of environment, with galaxies rapidly transitioning between components as they move into denser environments.
△ Less
Submitted 16 September, 2008;
originally announced September 2008.
-
On the path density of a gradient field
Authors:
Christopher R. Genovese,
Marco Perone-Pacifico,
Isabella Verdinelli,
Larry Wasserman
Abstract:
We consider the problem of reliably finding filaments in point clouds. Realistic data sets often have numerous filaments of various sizes and shapes. Statistical techniques exist for finding one (or a few) filaments but these methods do not handle noisy data sets with many filaments. Other methods can be found in the astronomy literature but they do not have rigorous statistical guarantees. We p…
▽ More
We consider the problem of reliably finding filaments in point clouds. Realistic data sets often have numerous filaments of various sizes and shapes. Statistical techniques exist for finding one (or a few) filaments but these methods do not handle noisy data sets with many filaments. Other methods can be found in the astronomy literature but they do not have rigorous statistical guarantees. We propose the following method. Starting at each data point we construct the steepest ascent path along a kernel density estimator. We locate filaments by finding regions where these paths are highly concentrated. Formally, we define the density of these paths and we construct a consistent estimator of this path density.
△ Less
Submitted 11 September, 2009; v1 submitted 27 May, 2008;
originally announced May 2008.
-
Inference for the dark energy equation of state using Type IA supernova data
Authors:
Christopher Genovese,
Peter Freeman,
Larry Wasserman,
Robert Nichol,
Christopher Miller
Abstract:
The surprising discovery of an accelerating universe led cosmologists to posit the existence of "dark energy"--a mysterious energy field that permeates the universe. Understanding dark energy has become the central problem of modern cosmology. After describing the scientific background in depth, we formulate the task as a nonlinear inverse problem that expresses the comoving distance function in…
▽ More
The surprising discovery of an accelerating universe led cosmologists to posit the existence of "dark energy"--a mysterious energy field that permeates the universe. Understanding dark energy has become the central problem of modern cosmology. After describing the scientific background in depth, we formulate the task as a nonlinear inverse problem that expresses the comoving distance function in terms of the dark-energy equation of state. We present two classes of methods for making sharp statistical inferences about the equation of state from observations of Type Ia Supernovae (SNe). First, we derive a technique for testing hypotheses about the equation of state that requires no assumptions about its form and can distinguish among competing theories. Second, we present a framework for computing parametric and nonparametric estimators of the equation of state, with an associated assessment of uncertainty. Using our approach, we evaluate the strength of statistical evidence for various competing models of dark energy. Consistent with current studies, we find that with the available Type Ia SNe data, it is not possible to distinguish statistically among popular dark-energy models, and that, in particular, there is no support in the data for rejecting a cosmological constant. With much more supernova data likely to be available in coming years (e.g., from the DOE/NASA Joint Dark Energy Mission), we address the more interesting question of whether future data sets will have sufficient resolution to distinguish among competing theories.
△ Less
Submitted 18 May, 2009; v1 submitted 27 May, 2008;
originally announced May 2008.
-
Map** the Cosmological Confidence Ball Surface
Authors:
Brent Bryan,
Jeff Schneider,
Christopher J. Miller,
Robert C. Nichol,
Christopher Genovese,
Larry Wasserman
Abstract:
We present a new technique to compute simultaneously valid confidence intervals for a set of model parameters. We apply our method to the Wilkinson Microwave Anisotropy Probe's (WMAP) Cosmic Microwave Background (CMB) data, exploring a seven dimensional space (tau, Omega_DE, Omega_M, omega_DM, omega_B, f_nu, n_s). We find two distinct regions-of-interest: the standard Concordance Model, and a re…
▽ More
We present a new technique to compute simultaneously valid confidence intervals for a set of model parameters. We apply our method to the Wilkinson Microwave Anisotropy Probe's (WMAP) Cosmic Microwave Background (CMB) data, exploring a seven dimensional space (tau, Omega_DE, Omega_M, omega_DM, omega_B, f_nu, n_s). We find two distinct regions-of-interest: the standard Concordance Model, and a region with large values of omega_DM, omega_B and H_0. This second peak in parameter space can be rejected by applying a constraint (or a prior) on the allowable values of the Hubble constant. Our new technique uses a non-parametric fit to the data, along with a frequentist approach and a smart search algorithm to map out a statistical confidence surface. The result is a confidence ``ball'': a set of parameter values that contains the true value with probability at least 1-alpha. Our algorithm performs a role similar to the often used Markov Chain Monte Carlo (MCMC), which samples from the posterior probability function in order to provide Bayesian credible intervals on the parameters. While the MCMC approach samples densely around a peak in the posterior, our new technique allows cosmologists to perform efficient analyses around any regions of interest: e.g., the peak itself, or, possibly more importantly, the 1-alpha confidence surface.
△ Less
Submitted 19 April, 2007;
originally announced April 2007.
-
Adaptive Confidence Bands
Authors:
Christopher R. Genovese,
Larry Wasserman
Abstract:
We show that there do not exist adaptive confidence bands for curve estimation except under very restrictive assumptions. We propose instead to construct adaptive bands that cover a surrogate function f^\star which is close to, but simpler than, f. The surrogate captures the significant features in f. We establish lower bounds on the width for any confidence band for f^\star and construct a proc…
▽ More
We show that there do not exist adaptive confidence bands for curve estimation except under very restrictive assumptions. We propose instead to construct adaptive bands that cover a surrogate function f^\star which is close to, but simpler than, f. The surrogate captures the significant features in f. We establish lower bounds on the width for any confidence band for f^\star and construct a procedure that comes within a small constant factor of attaining the lower bound for finite-samples.
△ Less
Submitted 18 January, 2007;
originally announced January 2007.
-
Statistical Computations with AstroGrid and the Grid
Authors:
Robert C Nichol,
Garry Smith,
Christopher J Miller,
Chris Genovese,
Larry Wasserman,
Brent Bryan,
Alexander Gray,
Jeff Schneider,
Andrew W Moore
Abstract:
We outline our first steps towards marrying two new and emerging technologies; the Virtual Observatory (e.g, AstroGrid) and the computational grid. We discuss the construction of VOTechBroker, which is a modular software tool designed to abstract the tasks of submission and management of a large number of computational jobs to a distributed computer system. The broker will also interact with the…
▽ More
We outline our first steps towards marrying two new and emerging technologies; the Virtual Observatory (e.g, AstroGrid) and the computational grid. We discuss the construction of VOTechBroker, which is a modular software tool designed to abstract the tasks of submission and management of a large number of computational jobs to a distributed computer system. The broker will also interact with the AstroGrid workflow and MySpace environments. We present our planned usage of the VOTechBroker in computing a huge number of n-point correlation functions from the SDSS, as well as fitting over a million CMBfast models to the WMAP data.
△ Less
Submitted 15 November, 2005;
originally announced November 2005.
-
Massive Science with VO and Grids
Authors:
Robert Nichol,
Garry Smith,
Christopher Miller,
Peter Freeman,
Chris Genovese,
Larry Wasserman,
Brent Bryan,
Alexander Gray,
Jeff Schneider,
Andrew Moore
Abstract:
There is a growing need for massive computational resources for the analysis of new astronomical datasets. To tackle this problem, we present here our first steps towards marrying two new and emerging technologies; the Virtual Observatory (e.g, AstroGrid) and the computational grid (e.g. TeraGrid, COSMOS etc.). We discuss the construction of VOTechBroker, which is a modular software tool designe…
▽ More
There is a growing need for massive computational resources for the analysis of new astronomical datasets. To tackle this problem, we present here our first steps towards marrying two new and emerging technologies; the Virtual Observatory (e.g, AstroGrid) and the computational grid (e.g. TeraGrid, COSMOS etc.). We discuss the construction of VOTechBroker, which is a modular software tool designed to abstract the tasks of submission and management of a large number of computational jobs to a distributed computer system. The broker will also interact with the AstroGrid workflow and MySpace environments. We discuss our planned usages of the VOTechBroker in computing a huge number of n-point correlation functions from the SDSS data and massive model-fitting of millions of CMBfast models to WMAP data. We also discuss other applications including the determination of the XMM Cluster Survey selection function and the construction of new WMAP maps.
△ Less
Submitted 31 October, 2005;
originally announced October 2005.
-
Examining the Effect of the Map-Making Algorithm on Observed Power Asymmetry in WMAP Data
Authors:
P. E. Freeman,
C. R. Genovese,
C. J. Miller,
R. C. Nichol,
L. Wasserman
Abstract:
We analyze first-year data of WMAP to determine the significance of asymmetry in summed power between arbitrarily defined opposite hemispheres, using maps that we create ourselves with software developed independently of the WMAP team. We find that over the multipole range l=[2,64], the significance of asymmetry is ~ 10^-4, a value insensitive to both frequency and power spectrum. We determine t…
▽ More
We analyze first-year data of WMAP to determine the significance of asymmetry in summed power between arbitrarily defined opposite hemispheres, using maps that we create ourselves with software developed independently of the WMAP team. We find that over the multipole range l=[2,64], the significance of asymmetry is ~ 10^-4, a value insensitive to both frequency and power spectrum. We determine the smallest multipole ranges exhibiting significant asymmetry, and find twelve, including l=[2,3] and [6,7], for which the significance -> 0. In these ranges there is an improbable association between the direction of maximum significance and the ecliptic plane (p ~ 0.01). Also, contours of least significance follow great circles inclined relative to the ecliptic at the largest scales. The great circle for l=[2,3] passes over previously reported preferred axes and is insensitive to frequency, while the great circle for l=[6,7] is aligned with the ecliptic poles. We examine how changing map-making parameters affects asymmetry, and find that at large scales, it is rendered insignificant if the magnitude of the WMAP dipole vector is increased by approximately 1-3 sigma (or 2-6 km/s). While confirmation of this result would require data recalibration, such a systematic change would be consistent with observations of frequency-independent asymmetry. We conclude that the use of an incorrect dipole vector, in combination with a systematic or foreground process associated with the ecliptic, may help to explain the observed asymmetry.
△ Less
Submitted 13 October, 2005;
originally announced October 2005.
-
Confidence sets for nonparametric wavelet regression
Authors:
Christopher R. Genovese,
Larry Wasserman
Abstract:
We construct nonparametric confidence sets for regression functions using wavelets that are uniform over Besov balls. We consider both thresholding and modulation estimators for the wavelet coefficients. The confidence set is obtained by showing that a pivot process, constructed from the loss function, converges uniformly to a mean zero Gaussian process. Inverting this pivot yields a confidence…
▽ More
We construct nonparametric confidence sets for regression functions using wavelets that are uniform over Besov balls. We consider both thresholding and modulation estimators for the wavelet coefficients. The confidence set is obtained by showing that a pivot process, constructed from the loss function, converges uniformly to a mean zero Gaussian process. Inverting this pivot yields a confidence set for the wavelet coefficients, and from this we obtain confidence sets on functionals of the regression curve.
△ Less
Submitted 30 May, 2005;
originally announced May 2005.
-
Nonparametric Inference for the Cosmic Microwave Background
Authors:
Christopher R. Genovese,
Christopher J. Miller,
Robert C. Nichol,
Mihir Arjunwadkar,
Larry Wasserman
Abstract:
The Cosmic Microwave Background (CMB), which permeates the entire Universe, is the radiation left over from just 380,000 years after the Big Bang. On very large scales, the CMB radiation field is smooth and isotropic, but the existence of structure in the Universe - stars, galaxies, clusters of galaxies - suggests that the field should fluctuate on smaller scales. Recent observations, from the C…
▽ More
The Cosmic Microwave Background (CMB), which permeates the entire Universe, is the radiation left over from just 380,000 years after the Big Bang. On very large scales, the CMB radiation field is smooth and isotropic, but the existence of structure in the Universe - stars, galaxies, clusters of galaxies - suggests that the field should fluctuate on smaller scales. Recent observations, from the Cosmic Microwave Background Explorer to the Wilkinson Microwave Anisotropy Project, have strikingly confirmed this prediction. CMB fluctuations provide clues to the Universe's structure and composition shortly after the Big Bang that are critical for testing cosmological models. For example, CMB data can be used to determine what portion of the Universe is composed of ordinary matter versus the mysterious dark matter and dark energy. To this end, cosmologists usually summarize the fluctuations by the power spectrum, which gives the variance as a function of angular frequency. The spectrum's shape, and in particular the location and height of its peaks, relates directly to the parameters in the cosmological models. Thus, a critical statistical question is how accurately can these peaks be estimated. We use recently developed techniques to construct a nonparametric confidence set for the unknown CMB spectrum. Our estimated spectrum, based on minimal assumptions, closely matches the model-based estimates used by cosmologists, but we can make a wide range of additional inferences. We apply these techniques to test various models and to extract confidence intervals on cosmological parameters of interest. Our analysis shows that, even without parametric assumptions, the first peak is resolved accurately with current data but that the second and third peaks are not.
△ Less
Submitted 6 October, 2004;
originally announced October 2004.
-
A stochastic process approach to false discovery control
Authors:
Christopher Genovese,
Larry Wasserman
Abstract:
This paper extends the theory of false discovery rates (FDR) pioneered by Benjamini and Hochberg [J. Roy. Statist. Soc. Ser. B 57 (1995) 289-300].
We develop a framework in which the False Discovery Proportion (FDP)--the number of false rejections divided by the number of rejections--is treated as a stochastic process. After obtaining the limiting distribution of the process, we demonstrate th…
▽ More
This paper extends the theory of false discovery rates (FDR) pioneered by Benjamini and Hochberg [J. Roy. Statist. Soc. Ser. B 57 (1995) 289-300].
We develop a framework in which the False Discovery Proportion (FDP)--the number of false rejections divided by the number of rejections--is treated as a stochastic process. After obtaining the limiting distribution of the process, we demonstrate the validity of a class of procedures for controlling the False Discovery Rate (the expected FDP). We construct a confidence envelope for the whole FDP process. From these envelopes we derive confidence thresholds, for controlling the quantiles of the distribution of the FDP as well as controlling the number of false discoveries. We also investigate methods for estimating the p-value distribution.
△ Less
Submitted 25 June, 2004;
originally announced June 2004.
-
Multi-Tree Methods for Statistics on Very Large Datasets in Astronomy
Authors:
Alexander G. Gray,
Andrew W. Moore,
Robert C. Nichol,
Andrew J. Connolly,
Christopher Genovese,
Larry Wasserman
Abstract:
Many fundamental statistical methods have become critical tools for scientific data analysis yet do not scale tractably to modern large datasets. This paper will describe very recent algorithms based on computational geometry which have dramatically reduced the computational complexity of 1) kernel density estimation (which also extends to nonparametric regression, classification, and clustering…
▽ More
Many fundamental statistical methods have become critical tools for scientific data analysis yet do not scale tractably to modern large datasets. This paper will describe very recent algorithms based on computational geometry which have dramatically reduced the computational complexity of 1) kernel density estimation (which also extends to nonparametric regression, classification, and clustering), and 2) the n-point correlation function for arbitrary n. These new multi-tree methods typically yield orders of magnitude in speedup over the previous state of the art for similar accuracy, making millions of data points tractable on desktop workstations for the first time.
△ Less
Submitted 8 January, 2004;
originally announced January 2004.
-
Non-Parametric Inference in Astrophysics
Authors:
Larry Wasserman,
Christopher J. Miller,
Robert C. Nichol,
Chris Genovese,
Woncheol Jang,
Andrew J. Connolly,
Andrew W. Moore,
Jeff Schneider,
the PICA group
Abstract:
We discuss non-parametric density estimation and regression for astrophysics problems. In particular, we show how to compute non-parametric confidence intervals for the location and size of peaks of a function. We illustrate these ideas with recent data on the Cosmic Microwave Background. We also briefly discuss non-parametric Bayesian inference.
We discuss non-parametric density estimation and regression for astrophysics problems. In particular, we show how to compute non-parametric confidence intervals for the location and size of peaks of a function. We illustrate these ideas with recent data on the Cosmic Microwave Background. We also briefly discuss non-parametric Bayesian inference.
△ Less
Submitted 3 December, 2001;
originally announced December 2001.
-
A Non-parametric Analysis of the CMB Power Spectrum
Authors:
Christopher J. Miller,
Robert C. Nichol,
Christopher Genovese,
Larry Wasserman
Abstract:
We examine Cosmic Microwave Background (CMB) temperature power spectra from the BOOMERANG, MAXIMA, and DASI experiments. We non-parametrically estimate the true power spectrum with no model assumptions. This is a significant departure from previous research which used either cosmological models or some other parameterized form (e.g. parabolic fits). Our non-parametric estimate is practically ind…
▽ More
We examine Cosmic Microwave Background (CMB) temperature power spectra from the BOOMERANG, MAXIMA, and DASI experiments. We non-parametrically estimate the true power spectrum with no model assumptions. This is a significant departure from previous research which used either cosmological models or some other parameterized form (e.g. parabolic fits). Our non-parametric estimate is practically indistinguishable from the best fit cosmological model, thus lending independent support to the underlying physics that governs these models. We also generate a confidence set for the non-parametric fit and extract confidence intervals for the numbers, locations, and heights of peaks and the successive peak-to-peak height ratios. At the 95%, 68%, and 40% confidence levels, we find functions that fit the data with one, two, and three peaks respectively (0 <= l <= 1100). Therefore, the current data prefer two peaks at the 1 sigma level. However, we also rule out a constant temperature function at the >8 sigma level. If we assume that there are three peaks in the data, we find their locations to be within l_1 = (118,300), l_2 = (377,650), and l_3 = (597,900). We find the ratio of the first peak-height to the second (Delta T_1)/(Delta T_2)^2= (1.06, 4.27) and the second to the third (Delta T_2)/(Delta T_3)^2= (0.41, 2.5). All measurements are for 95% confidence. If the standard errors on the temperature measurements were reduced to a third of what they are currently, as we expect to be achieved by the MAP and Planck CMB experiments, we could eliminate two-peak models at the 95% confidence limit. The non-parametric methodology discussed in this paper has many astrophysical applications.
△ Less
Submitted 3 December, 2001;
originally announced December 2001.
-
A new source detection algorithm using FDR
Authors:
A. M. Hopkins,
C. J. Miller,
A. J. Connolly,
C. Genovese,
R. C. Nichol,
L. Wasserman
Abstract:
The False Discovery Rate (FDR) method has recently been described by Miller et al (2001), along with several examples of astrophysical applications. FDR is a new statistical procedure due to Benjamini and Hochberg (1995) for controlling the fraction of false positives when performing multiple hypothesis testing. The importance of this method to source detection algorithms is immediately clear. T…
▽ More
The False Discovery Rate (FDR) method has recently been described by Miller et al (2001), along with several examples of astrophysical applications. FDR is a new statistical procedure due to Benjamini and Hochberg (1995) for controlling the fraction of false positives when performing multiple hypothesis testing. The importance of this method to source detection algorithms is immediately clear. To explore the possibilities offered we have developed a new task for performing source detection in radio-telescope images, Sfind 2.0, which implements FDR. We compare Sfind 2.0 with two other source detection and measurement tasks, Imsad and SExtractor, and comment on several issues arising from the nature of the correlation between nearby pixels and the necessary assumption of the null hypothesis. The strong suggestion is made that implementing FDR as a threshold defining method in other existing source-detection tasks is easy and worthwhile. We show that the constraint on the fraction of false detections as specified by FDR holds true even for highly correlated and realistic images. For the detection of true sources, which are complex combinations of source-pixels, this constraint appears to be somewhat less strict. It is still reliable enough, however, for a priori estimates of the fraction of false source detections to be robust and realistic.
△ Less
Submitted 26 October, 2001;
originally announced October 2001.