-
Incorporating Measurement Error in Astronomical Object Classification
Authors:
Sarah Shy,
Hyungsuk Tak,
Eric D. Feigelson,
John D. Timlin,
G. Jogesh Babu
Abstract:
Most general-purpose classification methods, such as support-vector machine (SVM) and random forest (RF), fail to account for an unusual characteristic of astronomical data: known measurement error uncertainties. In astronomical data, this information is often given in the data but discarded because popular machine learning classifiers cannot incorporate it. We propose a simulation-based approach…
▽ More
Most general-purpose classification methods, such as support-vector machine (SVM) and random forest (RF), fail to account for an unusual characteristic of astronomical data: known measurement error uncertainties. In astronomical data, this information is often given in the data but discarded because popular machine learning classifiers cannot incorporate it. We propose a simulation-based approach that incorporates heteroscedastic measurement error into existing classification method to better quantify uncertainty in classification. The proposed method first simulates perturbed realizations of the data from a Bayesian posterior predictive distribution of a Gaussian measurement error model. Then, a chosen classifier is fit to each simulation. The variation across the simulations naturally reflects the uncertainty propagated from the measurement errors in both labeled and unlabeled data sets. We demonstrate the use of this approach via two numerical studies. The first is a thorough simulation study applying the proposed procedure to SVM and RF, which are well-known hard and soft classifiers, respectively. The second study is a realistic classification problem of identifying high-$z$ $(2.9 \leq z \leq 5.1)$ quasar candidates from photometric data. The data are from merged catalogs of the Sloan Digital Sky Survey, the $Spitzer$ IRAC Equatorial Survey, and the $Spitzer$-HETDEX Exploratory Large-Area Survey. The proposed approach reveals that out of 11,847 high-$z$ quasar candidates identified by a random forest without incorporating measurement error, 3,146 are potential misclassifications with measurement error. Additionally, out of $1.85$ million objects not identified as high-$z$ quasars without measurement error, 936 can be considered new candidates with measurement error.
△ Less
Submitted 2 May, 2022; v1 submitted 13 December, 2021;
originally announced December 2021.
-
A Statistician Teaches Deep Learning
Authors:
G. Jogesh Babu,
David Banks,
Hyunsoon Cho,
David Han,
Hailin Sang,
Shouyi Wang
Abstract:
Deep learning (DL) has gained much attention and become increasingly popular in modern data science. Computer scientists led the way in develo** deep learning techniques, so the ideas and perspectives can seem alien to statisticians. Nonetheless, it is important that statisticians become involved -- many of our students need this expertise for their careers. In this paper, developed as part of a…
▽ More
Deep learning (DL) has gained much attention and become increasingly popular in modern data science. Computer scientists led the way in develo** deep learning techniques, so the ideas and perspectives can seem alien to statisticians. Nonetheless, it is important that statisticians become involved -- many of our students need this expertise for their careers. In this paper, developed as part of a program on DL held at the Statistical and Applied Mathematical Sciences Institute, we address this culture gap and provide tips on how to teach deep learning to statistics graduate students. After some background, we list ways in which DL and statistical perspectives differ, provide a recommended syllabus that evolved from teaching two iterations of a DL graduate course, offer examples of suggested homework assignments, give an annotated list of teaching resources, and discuss DL in the context of two research areas.
△ Less
Submitted 3 February, 2021; v1 submitted 28 January, 2021;
originally announced February 2021.
-
21st Century Statistical and Computational Challenges in Astrophysics
Authors:
Eric D. Feigelson,
Rafael S. de Souza,
Emille E. O. Ishida,
Gutti Jogesh Babu
Abstract:
Modern astronomy has been rapidly increasing our ability to see deeper into the universe, acquiring enormous samples of cosmic populations. Gaining astrophysical insights from these datasets requires a wide range of sophisticated statistical and machine learning methods. Long-standing problems in cosmology include characterization of galaxy clustering and estimation of galaxy distances from photom…
▽ More
Modern astronomy has been rapidly increasing our ability to see deeper into the universe, acquiring enormous samples of cosmic populations. Gaining astrophysical insights from these datasets requires a wide range of sophisticated statistical and machine learning methods. Long-standing problems in cosmology include characterization of galaxy clustering and estimation of galaxy distances from photometric colors. Bayesian inference, central to linking astronomical data to nonlinear astrophysical models, addresses problems in solar physics, properties of star clusters, and exoplanet systems. Likelihood-free methods are growing in importance. Detection of faint signals in complicated noise is needed to find periodic behaviors in stars and detect explosive gravitational wave events. Open issues concern treatment of heteroscedastic measurement errors and understanding probability distributions characterizing astrophysical systems. The field of astrostatistics needs increased collaboration with statisticians in the design and analysis stages of research projects, and to jointly develop new statistical methodologies. Together, they will draw more astrophysical insights into astronomical populations and the cosmos itself.
△ Less
Submitted 26 May, 2020;
originally announced May 2020.
-
Algorithms and Statistical Models for Scientific Discovery in the Petabyte Era
Authors:
Brian Nord,
Andrew J. Connolly,
Jamie Kinney,
Jeremy Kubica,
Gautaum Narayan,
Joshua E. G. Peek,
Chad Schafer,
Erik J. Tollerud,
Camille Avestruz,
G. Jogesh Babu,
Simon Birrer,
Douglas Burke,
João Caldeira,
Douglas A. Caldwell,
Joleen K. Carlberg,
Yen-Chi Chen,
Chuanfei Dong,
Eric D. Feigelson,
V. Zach Golkhou,
Vinay Kashyap,
T. S. Li,
Thomas Loredo,
Luisa Lucie-Smith,
Kaisey S. Mandel,
J. R. Martínez-Galarza
, et al. (13 additional authors not shown)
Abstract:
The field of astronomy has arrived at a turning point in terms of size and complexity of both datasets and scientific collaboration. Commensurately, algorithms and statistical models have begun to adapt --- e.g., via the onset of artificial intelligence --- which itself presents new challenges and opportunities for growth. This white paper aims to offer guidance and ideas for how we can evolve our…
▽ More
The field of astronomy has arrived at a turning point in terms of size and complexity of both datasets and scientific collaboration. Commensurately, algorithms and statistical models have begun to adapt --- e.g., via the onset of artificial intelligence --- which itself presents new challenges and opportunities for growth. This white paper aims to offer guidance and ideas for how we can evolve our technical and collaborative frameworks to promote efficient algorithmic development and take advantage of opportunities for scientific discovery in the petabyte era. We discuss challenges for discovery in large and complex data sets; challenges and requirements for the next stage of development of statistical methodologies and algorithmic tool sets; how we might change our paradigms of collaboration and education; and the ethical implications of scientists' contributions to widely applicable algorithms and computational modeling. We start with six distinct recommendations that are supported by the commentary following them. This white paper is related to a larger corpus of effort that has taken place within and around the Petabytes to Science Workshops (https://petabytestoscience.github.io/).
△ Less
Submitted 4 November, 2019;
originally announced November 2019.
-
AutoRegressive Planet Search: Application to the Kepler Mission
Authors:
Gabriel A. Caceres,
Eric D. Feigelson,
G. Jogesh Babu,
Natalia Bahamonde,
Alejandra Christen,
Karine Bertin,
Cristian Meza,
Michel Curé
Abstract:
The 4-year light curves of 156,717 stars observed with NASA's Kepler mission are analyzed using the AutoRegressive Planet Search (ARPS) methodology described by Caceres et al. (2019). The three stages of processing are: maximum likelihood ARIMA modeling of the light curves to reduce stellar brightness variations; constructing the Transit Comb Filter periodogram to identify transit-like periodic di…
▽ More
The 4-year light curves of 156,717 stars observed with NASA's Kepler mission are analyzed using the AutoRegressive Planet Search (ARPS) methodology described by Caceres et al. (2019). The three stages of processing are: maximum likelihood ARIMA modeling of the light curves to reduce stellar brightness variations; constructing the Transit Comb Filter periodogram to identify transit-like periodic dips in the ARIMA residuals; Random Forest classification trained on Kepler Team confirmed planets using several dozen features from the analysis. Orbital periods between 0.2 and 100 days are examined. The result is a recovery of 76% of confirmed planets, 97% when period and transit depth constraints are added. The classifier is then applied to the full Kepler dataset; 1,004 previously noticed and 97 new stars have light curve criteria consistent with the confirmed planets, after subjective vetting removes clear False Alarms and False Positive cases. The 97 Kepler ARPS Candidate Transits mostly have periods $P<10$ days; many are UltraShort Period hot planets with radii $<1$% of the host star. Extensive tabular and graphical output from the ARPS time series analysis is provided to assist in other research relating to the Kepler sample.
△ Less
Submitted 23 May, 2019;
originally announced May 2019.
-
Autoregressive Times Series Methods for Time Domain Astronomy
Authors:
Eric D. Feigelson,
G. Jogesh Babu,
Gabriel A. Caceres
Abstract:
Celestial objects exhibit a wide range of variability in brightness at different wavebands. Surprisingly, the most common methods for characterizing time series in statistics -- parametric autoregressive modeling -- is rarely used to interpret astronomical light curves. We review standard ARMA, ARIMA and ARFIMA (autoregressive moving average fractionally integrated) models that treat short-memory…
▽ More
Celestial objects exhibit a wide range of variability in brightness at different wavebands. Surprisingly, the most common methods for characterizing time series in statistics -- parametric autoregressive modeling -- is rarely used to interpret astronomical light curves. We review standard ARMA, ARIMA and ARFIMA (autoregressive moving average fractionally integrated) models that treat short-memory autocorrelation, long-memory $1/f^α$ `red noise', and nonstationary trends. Though designed for evenly spaced time series, moderately irregular cadences can be treated as evenly-spaced time series with missing data. Fitting algorithms are efficient and software implementations are widely available. We apply ARIMA models to light curves of four variable stars, discussing their effectiveness for different temporal characteristics. A variety of extensions to ARIMA are outlined, with emphasis on recently developed continuous-time models like CARMA and CARFIMA designed for irregularly spaced time series. Strengths and weakness of ARIMA-type modeling for astronomical data analysis and astrophysical insights are reviewed.
△ Less
Submitted 23 January, 2019;
originally announced January 2019.
-
AutoRegressive Planet Search: Methodology
Authors:
Gabriel A. Caceres,
Eric D. Feigelson,
G. Jogesh Babu,
Natalia Bahamonde,
Alejandra Christen,
Karine Bertin,
Cristian Meza,
Michel Curé
Abstract:
The detection of periodic signals from transiting exoplanets is often impeded by extraneous aperiodic photometric variability, either intrinsic to the star or arising from the measurement process. Frequently, these variations are autocorrelated wherein later flux values are correlated with previous ones. In this work, we present the methodology of the Autoregessive Planet Search (ARPS) project whi…
▽ More
The detection of periodic signals from transiting exoplanets is often impeded by extraneous aperiodic photometric variability, either intrinsic to the star or arising from the measurement process. Frequently, these variations are autocorrelated wherein later flux values are correlated with previous ones. In this work, we present the methodology of the Autoregessive Planet Search (ARPS) project which uses Autoregressive Integrated Moving Average (ARIMA) and related statistical models that treat a wide variety of stochastic processes, as well as nonstationarity, to improve detection of new planetary transits. Providing a time series is evenly spaced or can be placed on an evenly spaced grid with missing values, these low-dimensional parametric models can prove very effective. We introduce a planet-search algorithm to detect periodic transits in the residuals after the application of ARIMA models. Our matched-filter algorithm, the Transit Comb Filter (TCF), is closely related to the traditional Box-fitting Least Squares and provides an analogous periodogram. Finally, if a previously identified or simulated sample of planets is available, selected scalar features from different stages of the analysis -- the original light curves, ARIMA fits, TCF periodograms, and folded light curves -- can be collectively used with a multivariate classifier to identify promising candidates while efficiently rejecting false alarms. We use Random Forests for this task, in conjunction with Receiver Operating Characteristic (ROC) curves, to define discovery criteria for new, high fidelity planetary candidates. The ARPS methodology can be applied to both evenly spaced satellite light curves and densely cadenced ground-based photometric surveys.
△ Less
Submitted 14 May, 2019; v1 submitted 15 January, 2019;
originally announced January 2019.
-
Some Optimizations on Detecting Gravitational Wave Using Convolutional Neural Network
Authors:
Xiangru Li,
Woliang Yu,
Xilong Fan,
G. Jogesh Babu
Abstract:
This work investigates the problem of detecting gravitational wave (GW) events based on simulated damped sinusoid signals contaminated with white Gaussian noise. It is treated as a classification problem with one class for the interesting events. The proposed scheme consists of the following two successive steps: decomposing the data using a wavelet packet, representing the GW signal and noise usi…
▽ More
This work investigates the problem of detecting gravitational wave (GW) events based on simulated damped sinusoid signals contaminated with white Gaussian noise. It is treated as a classification problem with one class for the interesting events. The proposed scheme consists of the following two successive steps: decomposing the data using a wavelet packet, representing the GW signal and noise using the derived decomposition coefficients; and determining the existence of any GW event using a convolutional neural network (CNN) with a logistic regression output layer. The characteristics of this work is its comprehensive investigations on CNN structure, detection window width, data resolution, wavelet packet decomposition and detection window overlap scheme. Extensive simulation experiments show excellent performances for reliable detection of signals with a range of GW model parameters and signal-to-noise ratios. While we use a simple waveform model in this study, we expect the method to be particularly valuable when the potential GW shapes are too complex to be characterized with a template bank.
△ Less
Submitted 29 May, 2020; v1 submitted 1 December, 2017;
originally announced December 2017.
-
VOStat: A Statistical Web Service for Astronomers
Authors:
Arnab Chakraborty,
Eric D. Feigelson,
G. Jogesh Babu
Abstract:
VOStat is a Web service providing interactive statistical analysis of astronomical tabular datasets. It is integrated into the suite of analysis and visualization tools associated with the international Virtual Observatory (VO) through the SAMP communication system. A user supplies VOStat with a dataset extracted from the VO, or otherwise acquired, and chooses among $\sim 60$ statistical functions…
▽ More
VOStat is a Web service providing interactive statistical analysis of astronomical tabular datasets. It is integrated into the suite of analysis and visualization tools associated with the international Virtual Observatory (VO) through the SAMP communication system. A user supplies VOStat with a dataset extracted from the VO, or otherwise acquired, and chooses among $\sim 60$ statistical functions. These include data transformations, plots and summaries, density estimation, one- and two-sample hypothesis tests, global and local regressions, multivariate analysis and clustering, spatial analysis, directional statistics, survival analysis (for censored data like upper limits), and time series analysis. The statistical operations are performed using the public domain {\bf R} statistical software environment, including a small fraction of its $>4000$ {\bf CRAN} add-on packages. The purpose of VOStat is to facilitate a wider range of statistical analyses than are commonly used in astronomy, and to promote use of more advanced methodology in {\bf R} and {\bf CRAN}.
△ Less
Submitted 2 February, 2013;
originally announced February 2013.
-
The Astrophysical Multimessenger Observatory Network (AMON)
Authors:
M. W. E. Smith,
D. B. Fox,
D. F. Cowen,
P. Mészáros,
G. Tešić,
J. Fixelle,
I. Bartos,
P. Sommers,
Abhay Ashtekar,
G. Jogesh Babu,
S. D. Barthelmy,
S. Coutu,
T. DeYoung,
A. D. Falcone,
L. S. Finn,
Shan Gao,
B. Hashemi,
A. Homeier,
S. Márka,
B. J. Owen,
I. Taboada
Abstract:
We summarize the science opportunity, design elements, current and projected partner observatories, and anticipated science returns of the Astrophysical Multimessenger Observatory Network (AMON). AMON will link multiple current and future high-energy, multimessenger, and follow-up observatories together into a single network, enabling near real-time coincidence searches for multimessenger astrophy…
▽ More
We summarize the science opportunity, design elements, current and projected partner observatories, and anticipated science returns of the Astrophysical Multimessenger Observatory Network (AMON). AMON will link multiple current and future high-energy, multimessenger, and follow-up observatories together into a single network, enabling near real-time coincidence searches for multimessenger astrophysical transients and their electromagnetic counterparts. Candidate and high-confidence multimessenger transient events will be identified, characterized, and distributed as AMON alerts within the network and to interested external observers, leading to follow-up observations across the electromagnetic spectrum. In this way, AMON aims to evoke the discovery of multimessenger transients from within observatory subthreshold data streams and facilitate the exploitation of these transients for purposes of astronomy and fundamental physics. As a central hub of global multimessenger science, AMON will also enable cross-collaboration analyses of archival datasets in search of rare or exotic astrophysical phenomena.
△ Less
Submitted 23 November, 2012;
originally announced November 2012.
-
Statistical Methods for Astronomy
Authors:
Eric D. Feigelson,
G. Jogesh Babu
Abstract:
This review outlines concepts of mathematical statistics, elements of probability theory, hypothesis tests and point estimation for use in the analysis of modern astronomical data. Least squares, maximum likelihood, and Bayesian approaches to statistical inference are treated. Resampling methods, particularly the bootstrap, provide valuable procedures when distributions functions of statistics are…
▽ More
This review outlines concepts of mathematical statistics, elements of probability theory, hypothesis tests and point estimation for use in the analysis of modern astronomical data. Least squares, maximum likelihood, and Bayesian approaches to statistical inference are treated. Resampling methods, particularly the bootstrap, provide valuable procedures when distributions functions of statistics are not known. Several approaches to model selection and good- ness of fit are considered. Applied statistics relevant to astronomical research are briefly discussed: nonparametric methods for use when little is known about the behavior of the astronomical populations or processes; data smoothing with kernel density estimation and nonparametric regression; unsupervised clustering and supervised classification procedures for multivariate problems; survival analysis for astronomical datasets with nondetections; time- and frequency-domain times series analysis for light curves; and spatial statistics to interpret the spatial distributions of points in low dimensions. Two types of resources are presented: about 40 recommended texts and monographs in various fields of statistics, and the public domain R software system for statistical analysis. Together with its \sim 3500 (and growing) add-on CRAN packages, R implements a vast range of statistical procedures in a coherent high-level language with advanced graphics.
△ Less
Submitted 9 May, 2012;
originally announced May 2012.
-
Limit theorems for functions of marginal quantiles
Authors:
G. Jogesh Babu,
Zhidong Bai,
Kwok Pui Choi,
Vasudevan Mangalam
Abstract:
Multivariate distributions are explored using the joint distributions of marginal sample quantiles. Limit theory for the mean of a function of order statistics is presented. The results include a multivariate central limit theorem and a strong law of large numbers. A result similar to Bahadur's representation of quantiles is established for the mean of a function of the marginal quantiles. In part…
▽ More
Multivariate distributions are explored using the joint distributions of marginal sample quantiles. Limit theory for the mean of a function of order statistics is presented. The results include a multivariate central limit theorem and a strong law of large numbers. A result similar to Bahadur's representation of quantiles is established for the mean of a function of the marginal quantiles. In particular, it is shown that \[\sqrt{n}\Biggl(\frac{1}{n}\sum_{i=1}^nφ\bigl(X_{n:i}^{(1)},...,X_{n:i}^{(d)}\bigr)-\barγ\Biggr)=\frac{1}{\sqrt{n}}\sum_{i=1}^nZ_{n,i}+\mathrm{o}_P(1)\] as $n\rightarrow\infty$, where $\barγ$ is a constant and $Z_{n,i}$ are i.i.d. random variables for each $n$. This leads to the central limit theorem. Weak convergence to a Gaussian process using equicontinuity of functions is indicated. The results are established under very general conditions. These conditions are shown to be satisfied in many commonly occurring situations.
△ Less
Submitted 22 April, 2011;
originally announced April 2011.
-
A statistical model for the relation between exoplanets and their host stars
Authors:
E. Martinez-Gomez,
G. J. Babu
Abstract:
A general model is proposed to explain the relation between the extrasolar planets (or exoplanets) detected until June 2008 and the main characteristics of their host stars through statistical techniques. The main goal is to establish a mathematical relation among the set of variables which better describe the physical characteristics of the host star and the planet itself. The host star is char…
▽ More
A general model is proposed to explain the relation between the extrasolar planets (or exoplanets) detected until June 2008 and the main characteristics of their host stars through statistical techniques. The main goal is to establish a mathematical relation among the set of variables which better describe the physical characteristics of the host star and the planet itself. The host star is characterized by its distance, age, effective temperature, mass, metallicity, radius and magnitude. The exoplanet is described through its physical parameters (radius and mass) and its orbital parameters (distance, period, eccentricity, inclination and major semiaxis). As a first approach we consider that only the mass of the exoplanet is being determined by the physical properties of its host star. The proposed model is then validated through statistical analysis. Finally we discuss the categorical behavior of the dependent variable through binary models.
△ Less
Submitted 27 August, 2009;
originally announced August 2009.
-
Object detection in multi-epoch data
Authors:
G. Jogesh Babu,
Ashish Mahabal,
S. G. Djorgovski,
R. Williams
Abstract:
In astronomy multiple images are frequently obtained at the same position of the sky for follow-up co-addition as it helps one go deeper and look for fainter objects. With large scale panchromatic synoptic surveys becoming more common, image co-addition has become even more necessary as new observations start to get compared with co-added fiducial sky in real time. The standard co-addition techn…
▽ More
In astronomy multiple images are frequently obtained at the same position of the sky for follow-up co-addition as it helps one go deeper and look for fainter objects. With large scale panchromatic synoptic surveys becoming more common, image co-addition has become even more necessary as new observations start to get compared with co-added fiducial sky in real time. The standard co-addition techniques have included straight averages, variance weighted averages, medians etc. A more sophisticated nonlinear response chi-square method is also used when it is known that the data are background noise limited and the point spread function is homogenized in all channels. A more robust object detection technique capable of detecting faint sources, even those not seen at all epochs which will normally be smoothed out in traditional methods, is described. The analysis at each pixel level is based on a formula similar to Mahalanobis distance. The method does not depend on the point spread function.
△ Less
Submitted 22 December, 2006;
originally announced December 2006.
-
Statistical Challenges in Modern Astronomy
Authors:
E. D. Feigelson,
G. J. Babu
Abstract:
Despite centuries of close association, statistics and astronomy are surprisingly distant today. Most observational astronomical research relies on an inadequate toolbox of methodological tools. Yet the needs are substantial: astronomy encounters sophisticated problems involving sampling theory, survival analysis, multivariate classification and analysis, time series analysis, wavelet analysis,…
▽ More
Despite centuries of close association, statistics and astronomy are surprisingly distant today. Most observational astronomical research relies on an inadequate toolbox of methodological tools. Yet the needs are substantial: astronomy encounters sophisticated problems involving sampling theory, survival analysis, multivariate classification and analysis, time series analysis, wavelet analysis, spatial point processes, nonlinear regression, bootstrap resampling and model selection. We review the recent resurgence of astrostatistical research, and outline new challenges raised by the emerging Virtual Observatory. Our essay ends with a list of research challenges and infrastructure for astrostatistics in the coming decade.
△ Less
Submitted 20 January, 2004;
originally announced January 2004.
-
Three types of gamma-ray bursts
Authors:
Soma Mukherjee,
Eric D. Feigelson,
Gutti Jogesh Babu,
Fionn Murtagh,
Chris Fraley,
Adrian Raftery
Abstract:
A multivariate analysis of gamma-ray burst (GRB) bulk properties is presented to discriminate between distinct classes of GRBs. Several variables representing burst duration, fluence and spectral hardness are considered. Two multivariate clustering procedures are used on a sample of 797 bursts from the Third BATSE Catalog: a nonparametric average linkage hierarchical agglomerative clustering pro…
▽ More
A multivariate analysis of gamma-ray burst (GRB) bulk properties is presented to discriminate between distinct classes of GRBs. Several variables representing burst duration, fluence and spectral hardness are considered. Two multivariate clustering procedures are used on a sample of 797 bursts from the Third BATSE Catalog: a nonparametric average linkage hierarchical agglomerative clustering procedure validated with Wilks' $Λ^*$ and other MANOVA tests; and a parametric maximum likelihood model-based clustering procedure assuming multinormal populations calculated with the EM Algorithm and validated with the Bayesian Information Criterion.
The two methods yield very similar results. The BATSE GRB population consists of three classes with the following Duration/Fluence/Spectrum bulk properties: Class I with long/bright/intermediate bursts, Class II with short/hard/faint bursts, and Class III with intermediate/intermediate/soft bursts. One outlier with poor data is also present. Classes I and II correspond to those reported by Kouveliotou et al. (1993), but Class III is clearly defined here for the first time.
△ Less
Submitted 7 February, 1998;
originally announced February 1998.