-
Estimating Causal Effects with Hidden Confounding using Instrumental Variables and Environments
Authors:
James P. Long,
Hongxu Zhu,
Kim-Anh Do,
Min ** Ha
Abstract:
Recent works have proposed regression models which are invariant across data collection environments. These estimators often have a causal interpretation under conditions on the environments and type of invariance imposed. One recent example, the Causal Dantzig (CD), is consistent under hidden confounding and represents an alternative to classical instrumental variable estimators such as Two Stage…
▽ More
Recent works have proposed regression models which are invariant across data collection environments. These estimators often have a causal interpretation under conditions on the environments and type of invariance imposed. One recent example, the Causal Dantzig (CD), is consistent under hidden confounding and represents an alternative to classical instrumental variable estimators such as Two Stage Least Squares (TSLS). In this work we derive the CD as a generalized method of moments (GMM) estimator. The GMM representation leads to several practical results, including 1) creation of the Generalized Causal Dantzig (GCD) estimator which can be applied to problems with continuous environments where the CD cannot be fit 2) a Hybrid (GCD-TSLS combination) estimator which has properties superior to GCD or TSLS alone 3) straightforward asymptotic results for all methods using GMM theory. We compare the CD, GCD, TSLS, and Hybrid estimators in simulations and an application to a Flow Cytometry data set. The newly proposed GCD and Hybrid estimators have superior performance to existing methods in many settings.
△ Less
Submitted 9 November, 2023; v1 submitted 29 July, 2022;
originally announced July 2022.
-
Causal Models, Prediction, and Extrapolation in Cell Line Perturbation Experiments
Authors:
James P. Long,
Yumeng Yang,
Kim-Anh Do
Abstract:
In cell line perturbation experiments, a collection of cells is perturbed with external agents (e.g. drugs) and responses such as protein expression measured. Due to cost constraints, only a small fraction of all possible perturbations can be tested in vitro. This has led to the development of computational (in silico) models which can predict cellular responses to perturbations. Perturbations wit…
▽ More
In cell line perturbation experiments, a collection of cells is perturbed with external agents (e.g. drugs) and responses such as protein expression measured. Due to cost constraints, only a small fraction of all possible perturbations can be tested in vitro. This has led to the development of computational (in silico) models which can predict cellular responses to perturbations. Perturbations with clinically interesting predicted responses can be prioritized for in vitro testing. In this work, we compare causal and non-causal regression models for perturbation response prediction in a Melanoma cancer cell line. The current best performing method on this data set is Cellbox which models how proteins causally effect each other using a system of ordinary differential equations (ODEs). We derive a closed form solution to the Cellbox system of ODEs in the linear case. These analytic results facilitate comparison of Cellbox to regression approaches. We show that causal models such as Cellbox, while requiring more assumptions, enable extrapolation in ways that non-causal regression models cannot. For example, causal models can predict responses for never before tested drugs. We illustrate these strengths and weaknesses in simulations. In an application to the Melanoma cell line data, we find that regression models outperform the Cellbox causal model.
△ Less
Submitted 20 July, 2022;
originally announced July 2022.
-
Sample Selection Bias in Evaluation of Prediction Performance of Causal Models
Authors:
James P. Long,
Min ** Ha
Abstract:
Causal models are notoriously difficult to validate because they make untestable assumptions regarding confounding. New scientific experiments offer the possibility of evaluating causal models using prediction performance. Prediction performance measures are typically robust to violations in causal assumptions. However, prediction performance does depend on the selection of training and test sets.…
▽ More
Causal models are notoriously difficult to validate because they make untestable assumptions regarding confounding. New scientific experiments offer the possibility of evaluating causal models using prediction performance. Prediction performance measures are typically robust to violations in causal assumptions. However, prediction performance does depend on the selection of training and test sets. Biased training sets can lead to optimistic assessments of model performance. In this work, we revisit the prediction performance of several recently proposed causal models tested on a genetic perturbation data set of Kemmeren. We find that sample selection bias is likely a key driver of model performance. We propose using a less-biased evaluation set for assessing prediction performance and compare models on this new set. In this setting, the causal models have similar or worse performance compared to standard association-based estimators such as Lasso. Finally, we compare the performance of causal estimators in simulation studies that reproduce the Kemmeren structure of genetic knockout experiments but without any sample selection bias. These results provide an improved understanding of the performance of several causal models and offer guidance on how future studies should use Kemmeren.
△ Less
Submitted 26 October, 2021; v1 submitted 3 June, 2021;
originally announced June 2021.
-
A Framework for Mediation Analysis with Multiple Exposures, Multivariate Mediators, and Non-Linear Response Models
Authors:
James P. Long,
Ehsan Irajizad,
James D. Doecke,
Kim-Anh Do,
Min ** Ha
Abstract:
Mediation analysis seeks to identify and quantify the paths by which an exposure affects an outcome. Intermediate variables which are effected by the exposure and which effect the outcome are known as mediators. There exists extensive work on mediation analysis in the context of models with a single mediator and continuous and binary outcomes. However these methods are often not suitable for multi…
▽ More
Mediation analysis seeks to identify and quantify the paths by which an exposure affects an outcome. Intermediate variables which are effected by the exposure and which effect the outcome are known as mediators. There exists extensive work on mediation analysis in the context of models with a single mediator and continuous and binary outcomes. However these methods are often not suitable for multi-omic data that include highly interconnected variables measuring biological mechanisms and various types of outcome variables such as censored survival responses. In this article, we develop a general framework for causal mediation analysis with multiple exposures, multivariate mediators, and continuous, binary, and survival responses. We estimate mediation effects on several scales including the mean difference, odds ratio, and restricted mean scale as appropriate for various outcome models. Our estimation method avoids imposing constraints on model parameters such as the rare disease assumption while accommodating continuous exposures. We evaluate the framework and compare it to other methods in extensive simulation studies by assessing bias, type I error and power at a range of sample sizes, disease prevalences, and number of false mediators. Using Kidney Renal Clear Cell Carcinoma data from The Cancer Genome Atlas, we identify proteins which mediate the effect of metabolic gene expression on survival. Software for implementing this unified framework is made available in an R package (https://github.com/longjp/mediateR).
△ Less
Submitted 11 November, 2020;
originally announced November 2020.
-
Identification of RR Lyrae stars in multiband, sparsely-sampled data from the Dark Energy Survey using template fitting and Random Forest classification
Authors:
K. M. Stringer,
J. P. Long,
L. M. Macri,
J. L. Marshall,
A. Drlica-Wagner,
C. E. Martínez-Vázquez,
A. K. Vivas,
K. Bechtol,
E. Morganson,
M. Carrasco Kind,
A. B. Pace,
A. R. Walker,
C. Nielsen,
T. S. Li,
E. Rykoff,
D. Burke,
A. Carnero Rosell,
E. Neilsen,
P. Ferguson,
S. A. Cantu,
J. L. Myron,
L. Strigari,
A. Farahi,
F. Paz-Chinchón,
D. Tucker
, et al. (53 additional authors not shown)
Abstract:
Many studies have shown that RR Lyrae variable stars (RRL) are powerful stellar tracers of Galactic halo structure and satellite galaxies. The Dark Energy Survey (DES), with its deep and wide coverage (g ~ 23.5 mag) in a single exposure; over 5000 deg$^{2}$) provides a rich opportunity to search for substructures out to the edge of the Milky Way halo. However, the sparse and unevenly sampled multi…
▽ More
Many studies have shown that RR Lyrae variable stars (RRL) are powerful stellar tracers of Galactic halo structure and satellite galaxies. The Dark Energy Survey (DES), with its deep and wide coverage (g ~ 23.5 mag) in a single exposure; over 5000 deg$^{2}$) provides a rich opportunity to search for substructures out to the edge of the Milky Way halo. However, the sparse and unevenly sampled multiband light curves from the DES wide-field survey (median 4 observations in each of grizY over the first three years) pose a challenge for traditional techniques used to detect RRL. We present an empirically motivated and computationally efficient template fitting method to identify these variable stars using three years of DES data. When tested on DES light curves of previously classified objects in SDSS stripe 82, our algorithm recovers 89% of RRL periods to within 1% of their true value with 85% purity and 76% completeness. Using this method, we identify 5783 RRL candidates, ~31% of which are previously undiscovered. This method will be useful for identifying RRL in other sparse multiband data sets.
△ Less
Submitted 1 May, 2019;
originally announced May 2019.
-
A Flexible Procedure for Mixture Proportion Estimation in Positive-Unlabeled Learning
Authors:
Zhenfeng Lin,
James P. Long
Abstract:
Positive--unlabeled (PU) learning considers two samples, a positive set P with observations from only one class and an unlabeled set U with observations from two classes. The goal is to classify observations in U. Class mixture proportion estimation (MPE) in U is a key step in PU learning. Blanchard et al. [2010] showed that MPE in PU learning is a generalization of the problem of estimating the p…
▽ More
Positive--unlabeled (PU) learning considers two samples, a positive set P with observations from only one class and an unlabeled set U with observations from two classes. The goal is to classify observations in U. Class mixture proportion estimation (MPE) in U is a key step in PU learning. Blanchard et al. [2010] showed that MPE in PU learning is a generalization of the problem of estimating the proportion of true null hypotheses in multiple testing problems. Motivated by this idea, we propose reducing the problem to one dimension via construction of a probabilistic classifier trained on the P and U data sets followed by application of a one--dimensional mixture proportion method from the multiple testing literature to the observation class probabilities. The flexibility of this framework lies in the freedom to choose the classifier and the one--dimensional MPE method. We prove consistency of two mixture proportion estimators using bounds from empirical process theory, develop tuning parameter free implementations, and demonstrate that they have competitive performance on simulated waveform data and a protein signaling problem.
△ Less
Submitted 9 January, 2020; v1 submitted 29 January, 2018;
originally announced January 2018.
-
Statistical methods in astronomy
Authors:
James P. Long,
Rafael S. de Souza
Abstract:
We present a review of data types and statistical methods often encountered in astronomy. The aim is to provide an introduction to statistical applications in astronomy for statisticians and computer scientists. We highlight the complex, often hierarchical, nature of many astronomy inference problems and advocate for cross-disciplinary collaborations to address these challenges.
We present a review of data types and statistical methods often encountered in astronomy. The aim is to provide an introduction to statistical applications in astronomy for statisticians and computer scientists. We highlight the complex, often hierarchical, nature of many astronomy inference problems and advocate for cross-disciplinary collaborations to address these challenges.
△ Less
Submitted 19 October, 2017; v1 submitted 16 July, 2017;
originally announced July 2017.
-
Active Tuning of Surface Phonon Polariton Resonances via Carrier Photoinjection
Authors:
Adam D. Dunkelberger,
Chase T. Ellis,
Daniel C. Ratchford,
Alexander J. Giles,
Mi** Kim,
Chul Soo Kim,
Bryan T. Spann,
Igor Vurgaftman,
Joseph G. Tischler,
James P. Long,
Orest J. Glembocki,
Jeffrey C. Owrutsky,
Joshua D. Caldwell
Abstract:
Surface-phonon polaritons (SPhPs) are attractive alternatives to far-infrared plasmonics for sub-diffractional confinement of light. Localized SPhP resonances in semiconductor nanoresonators are very narrow, but that linewidth and the limited extent of the Reststrahlen band inherently limit spectral coverage. To address this limitation, we report active tuning of SPhP resonances in InP and 4H-SiC…
▽ More
Surface-phonon polaritons (SPhPs) are attractive alternatives to far-infrared plasmonics for sub-diffractional confinement of light. Localized SPhP resonances in semiconductor nanoresonators are very narrow, but that linewidth and the limited extent of the Reststrahlen band inherently limit spectral coverage. To address this limitation, we report active tuning of SPhP resonances in InP and 4H-SiC by photoinjecting free carriers into the nanoresonators, taking advantage of the coupling between the carrier plasma and optical phonons to blue-shift SPhP resonances. We demonstrate state-of-the-art tuning figures of merit upon continuous-wave (CW) excitation (in InP) or pulsed excitation (in 4H-SiC). Lifetime effects cause the tuning to saturate in InP, and carrier-redistribution leads to rapid (<50 ps) recovery of the tuning in 4H-SiC. This work opens the path toward actively tuned nanophotonic devices, such as modulators and beacons, in the infrared and identifies important implications of coupling between electronic and photonic excitations.
△ Less
Submitted 16 May, 2017;
originally announced May 2017.
-
Photoinduced tunability of the Reststrahlen band in 4H-SiC
Authors:
Bryan T. Spann,
Ryan Compton,
Daniel Ratchford,
James P. Long,
Adam D. Dunkelberger,
Paul B. Klein,
Alexander J. Giles,
Joshua D. Caldwell,
Jeffrey C. Owrutsky
Abstract:
Materials with a negative dielectric permittivity (e.g. metals) display high reflectance and can be shaped into nanoscale optical-resonators exhibiting extreme mode confinement, a central theme of nanophotonics. However, the ability to $actively$ tune these effects remains elusive. By photoexciting free carriers in 4H-SiC, we induce dramatic changes in reflectance near the "Reststrahlen band" wher…
▽ More
Materials with a negative dielectric permittivity (e.g. metals) display high reflectance and can be shaped into nanoscale optical-resonators exhibiting extreme mode confinement, a central theme of nanophotonics. However, the ability to $actively$ tune these effects remains elusive. By photoexciting free carriers in 4H-SiC, we induce dramatic changes in reflectance near the "Reststrahlen band" where the permittivity is negative due to charge oscillations of the polar optical phonons in the mid-infrared. We infer carrier-induced changes in the permittivity required for useful tunability (~ 40 cm$^{-1}$) in nanoscale resonators, providing a direct avenue towards the realization of actively tunable nanophotonic devices in the mid-infrared to terahertz spectral range.
△ Less
Submitted 30 November, 2015;
originally announced November 2015.
-
A Note on Parameter Estimation for Misspecified Regression Models with Heteroskedastic Errors
Authors:
James P. Long
Abstract:
Misspecified models often provide useful information about the true data generating distribution. For example, if $y$ is a non-linear function of $x$ the least squares estimator $\hatβ$ is an estimate of $β$, the slope of the best linear approximation to the non-linear function. Motivated by problems in astronomy, we study how to incorporate observation measurement error variances into fitting par…
▽ More
Misspecified models often provide useful information about the true data generating distribution. For example, if $y$ is a non-linear function of $x$ the least squares estimator $\hatβ$ is an estimate of $β$, the slope of the best linear approximation to the non-linear function. Motivated by problems in astronomy, we study how to incorporate observation measurement error variances into fitting parameters of misspecified models. Our asymptotic theory focuses on the particular case of linear regression where often weighted least squares procedures are used to account for heteroskedasticity. We find that when the response is a non-linear function of the independent variable, the standard procedure of weighting by the inverse of the observation variances can be counter-productive. In particular, ordinary least squares may have lower asymptotic variance. We construct an adaptive estimator which has lower asymptotic variance than either OLS or standard WLS. We demonstrate our theory in a small simulation and apply these ideas to the problem of estimating the period of a periodic function using a sinusoidal model.
△ Less
Submitted 15 May, 2017; v1 submitted 18 September, 2015;
originally announced September 2015.
-
A Multiband Generalization of the Analysis of Variance Period Estimation Algorithm and the Effect of Inter-band Observing Cadence on Period Recovery Rate
Authors:
Nicholas Mondrik,
James P. Long,
Jennifer L. Marshall
Abstract:
We present a new method of extending the single band Analysis of Variance period estimation algorithm to multiple bands. We use SDSS Stripe 82 RR Lyrae to show that in the case of low number of observations per band and non-simultaneous observations, improvements in period recovery rates of up to $\approx$60\% are observed. We also investigate the effect of inter-band observing cadence on period r…
▽ More
We present a new method of extending the single band Analysis of Variance period estimation algorithm to multiple bands. We use SDSS Stripe 82 RR Lyrae to show that in the case of low number of observations per band and non-simultaneous observations, improvements in period recovery rates of up to $\approx$60\% are observed. We also investigate the effect of inter-band observing cadence on period recovery rates. We find that using non-simultaneous observation times between bands is ideal for the multiband method, and using simultaneous multiband data is only marginally better than using single band data. These results will be particularly useful in planning observing cadences for wide-field astronomical imaging surveys such as LSST. They also have the potential to improve the extraction of transient data from surveys with few ($\lesssim 30$) observations per band across several bands, such as the Dark Energy Survey.
△ Less
Submitted 19 August, 2015;
originally announced August 2015.
-
A Study of Functional Depths
Authors:
James P. Long,
Jianhua Z. Huang
Abstract:
Functional depth is used for ranking functional observations from most outlying to most typical. The ranks produced by functional depth have been proposed as the basis for functional classifiers, rank tests, and data visualization procedures. Many of the proposed functional depths are invariant to domain permutation, an unusual property for a functional data analysis procedure. Essentially these d…
▽ More
Functional depth is used for ranking functional observations from most outlying to most typical. The ranks produced by functional depth have been proposed as the basis for functional classifiers, rank tests, and data visualization procedures. Many of the proposed functional depths are invariant to domain permutation, an unusual property for a functional data analysis procedure. Essentially these depths treat functional data as if it were multivariate data. In this work, we compare the performance of several existing functional depths to a simple adaptation of an existing multivariate depth notion, $L^\infty$ depth ($L^{\infty}D$). On simulated and real data, we show $L^{\infty}D$ has performance comparable or superior to several existing notions of functional depth. In addition, we review how depth functions are evaluated and propose some improvements. In particular, we show that empirical depth function asymptotics can be mis--leading and instead propose a new method, the rank--rank plot, for evaluating empirical depth rank stability.
△ Less
Submitted 1 November, 2016; v1 submitted 3 June, 2015;
originally announced June 2015.
-
Estimating a Common Period for a Set of Irregularly Sampled Functions with Applications to Periodic Variable Star Data
Authors:
James P. Long,
Eric C. Chi,
Richard G. Baraniuk
Abstract:
We consider the estimation of a common period for a set of functions sampled at irregular intervals. The problem arises in astronomy, where the functions represent a star's brightness observed over time through different photometric filters. While current methods can estimate periods accurately provided that the brightness is well--sampled in at least one filter, there are no existing methods that…
▽ More
We consider the estimation of a common period for a set of functions sampled at irregular intervals. The problem arises in astronomy, where the functions represent a star's brightness observed over time through different photometric filters. While current methods can estimate periods accurately provided that the brightness is well--sampled in at least one filter, there are no existing methods that can provide accurate estimates when no brightness function is well--sampled. In this paper we introduce two new methods for period estimation when brightnesses are poorly--sampled in all filters. The first, multiband generalized Lomb-Scargle (MGLS), extends the frequently used Lomb-Scargle method in a way that naïvely combines information across filters. The second, penalized generalized Lomb-Scargle (PGLS), builds on the first by more intelligently borrowing strength across filters. Specifically, we incorporate constraints on the phases and amplitudes across the different functions using a non--convex penalized likelihood function. We develop a fast algorithm to optimize the penalized likelihood by combining block coordinate descent with the majorization-minimization (MM) principle. We illustrate our methods on synthetic and real astronomy data. Both advance the state-of-the-art in period estimation; however, PGLS significantly outperforms MGLS when all functions are extremely poorly--sampled.
△ Less
Submitted 19 December, 2014;
originally announced December 2014.
-
Kernel Density Estimation with Berkson Error
Authors:
James P. Long,
Noureddine El Karoui,
John A. Rice
Abstract:
Given a sample $\{X_i\}_{i=1}^n$ from $f_X$, we construct kernel density estimators for $f_Y$, the convolution of $f_X$ with a known error density $f_ε$. This problem is known as density estimation with Berkson error and has applications in epidemiology and astronomy. Little is understood about bandwidth selection for Berkson density estimation. We compare three approaches to selecting the bandwid…
▽ More
Given a sample $\{X_i\}_{i=1}^n$ from $f_X$, we construct kernel density estimators for $f_Y$, the convolution of $f_X$ with a known error density $f_ε$. This problem is known as density estimation with Berkson error and has applications in epidemiology and astronomy. Little is understood about bandwidth selection for Berkson density estimation. We compare three approaches to selecting the bandwidth both asymptotically, using large sample approximations to the MISE, and at finite samples, using simulations. Our results highlight the relationship between the structure of the error $f_ε$ and the optimal bandwidth. In particular, the results demonstrate the importance of smoothing when the error term $f_ε$ is concentrated near 0. We propose a data--driven bandwidth estimator and test its performance on NO$_2$ exposure data.
△ Less
Submitted 29 July, 2014; v1 submitted 14 January, 2014;
originally announced January 2014.
-
Electronic Hybridization of Large-Area Stacked Graphene Films
Authors:
Jeremy T. Robinson,
Scott W. Schmucker,
C. Bogdan Diaconescu,
James P. Long,
James C. Culbertson,
Taisuke Ohta,
Adam L. Friedman,
Thomas E. Beechem
Abstract:
Direct, tunable coupling between individually assembled graphene layers is a next step towards designer two-dimensional (2D) crystal systems, with relevance for fundamental studies and technological applications. Here we describe the fabrication and characterization of large-area (> cm^2), coupled bilayer graphene on SiO2/Si substrates. Stacking two graphene films leads to direct electronic intera…
▽ More
Direct, tunable coupling between individually assembled graphene layers is a next step towards designer two-dimensional (2D) crystal systems, with relevance for fundamental studies and technological applications. Here we describe the fabrication and characterization of large-area (> cm^2), coupled bilayer graphene on SiO2/Si substrates. Stacking two graphene films leads to direct electronic interactions between layers, where the resulting film properties are determined by the local twist angle. Polycrystalline bilayer films have a "stained-glass window" appearance explained by the emergence of a narrow absorption band in the visible spectrum that depends on twist angle. Direct measurement of layer orientation via electron diffraction, together with Raman and optical spectroscopy, confirms the persistence of clean interfaces over large areas. Finally, we demonstrate that interlayer coupling can be reversibly turned off through chemical modification, enabling optical-based chemical detection schemes. Together, these results suggest that individual 2D crystals can be individually assembled to form electronically coupled systems suitable for large-scale applications.
△ Less
Submitted 2 January, 2013;
originally announced January 2013.
-
Optimizing Automated Classification of Periodic Variable Stars in New Synoptic Surveys
Authors:
James P. Long,
Noureddine El Karoui,
John A. Rice,
Joseph W. Richards,
Joshua S. Bloom
Abstract:
Efficient and automated classification of periodic variable stars is becoming increasingly important as the scale of astronomical surveys grows. Several recent papers have used methods from machine learning and statistics to construct classifiers on databases of labeled, multi--epoch sources with the intention of using these classifiers to automatically infer the classes of unlabeled sources from…
▽ More
Efficient and automated classification of periodic variable stars is becoming increasingly important as the scale of astronomical surveys grows. Several recent papers have used methods from machine learning and statistics to construct classifiers on databases of labeled, multi--epoch sources with the intention of using these classifiers to automatically infer the classes of unlabeled sources from new surveys. However, the same source observed with two different synoptic surveys will generally yield different derived metrics (features) from the light curve. Since such features are used in classifiers, this survey-dependent mismatch in feature space will typically lead to degraded classifier performance. In this paper we show how and why feature distributions change using OGLE and \textit{Hipparcos} light curves. To overcome survey systematics, we apply a method, \textit{noisification}, which attempts to empirically match distributions of features between the labeled sources used to construct the classifier and the unlabeled sources we wish to classify. Results from simulated and real--world light curves show that noisification can significantly improve classifier performance. In a three--class problem using light curves from \textit{Hipparcos} and OGLE, noisification reduces the classifier error rate from 27.0% to 7.0%. We recommend that noisification be used for upcoming surveys such as Gaia and LSST and describe some of the promises and challenges of applying noisification to these surveys.
△ Less
Submitted 23 February, 2012; v1 submitted 23 January, 2012;
originally announced January 2012.
-
Active Learning to Overcome Sample Selection Bias: Application to Photometric Variable Star Classification
Authors:
Joseph W. Richards,
Dan L. Starr,
Henrik Brink,
Adam A. Miller,
Joshua S. Bloom,
Nathaniel R. Butler,
J. Berian James,
James P. Long,
John Rice
Abstract:
Despite the great promise of machine-learning algorithms to classify and predict astrophysical parameters for the vast numbers of astrophysical sources and transients observed in large-scale surveys, the peculiarities of the training data often manifest as strongly biased predictions on the data of interest. Typically, training sets are derived from historical surveys of brighter, more nearby obje…
▽ More
Despite the great promise of machine-learning algorithms to classify and predict astrophysical parameters for the vast numbers of astrophysical sources and transients observed in large-scale surveys, the peculiarities of the training data often manifest as strongly biased predictions on the data of interest. Typically, training sets are derived from historical surveys of brighter, more nearby objects than those from more extensive, deeper surveys (testing data). This sample selection bias can cause catastrophic errors in predictions on the testing data because a) standard assumptions for machine-learned model selection procedures break down and b) dense regions of testing space might be completely devoid of training data. We explore possible remedies to sample selection bias, including importance weighting (IW), co-training (CT), and active learning (AL). We argue that AL---where the data whose inclusion in the training set would most improve predictions on the testing set are queried for manual follow-up---is an effective approach and is appropriate for many astronomical applications. For a variable star classification problem on a well-studied set of stars from Hipparcos and OGLE, AL is the optimal method in terms of error rate on the testing data, beating the off-the-shelf classifier by 3.4% and the other proposed methods by at least 3.0%. To aid with manual labeling of variable stars, we developed a web interface which allows for easy light curve visualization and querying of external databases. Finally, we apply active learning to classify variable stars in the ASAS survey, finding dramatic improvement in our agreement with the ACVS catalog, from 65.5% to 79.5%, and a significant increase in the classifier's average confidence for the testing set, from 14.6% to 42.9%, after a few AL iterations.
△ Less
Submitted 17 June, 2011; v1 submitted 14 June, 2011;
originally announced June 2011.