-
14 New Light Curves and an Updated Ephemeris for the Hot Jupiter HAT-P-54 b
Authors:
Heather B. Hewitt,
Bradley Hutson,
Michael Brockman,
Elizabeth Catogni,
Rosemary Ferreira,
Gary Fussell,
Atea Johnson,
Chris Kight,
Ryan A. Kilinski,
Khatu Nguyen,
Ty Perry,
Elizabeth Quinlan,
Eva Randazzo,
Kellan Reagan,
Kinley Subers,
Federico R. Noguer,
Molly N. Simon,
Robert T. Zellem
Abstract:
Here we present an analysis of 14 transit light curves of the hot Jupiter HAT-P-54 b. Thirteen of our datasets were obtained with the 6-inch MicroObservatory telescope, Cecilia, and one was measured with the 61-inch Kuiper Telescope. We used the EXOplanet Transit Interpretation Code (EXOTIC) to reduce 49 datasets in order to update the planet's ephemeris to a mid-transit time of 2460216.95257 +/-…
▽ More
Here we present an analysis of 14 transit light curves of the hot Jupiter HAT-P-54 b. Thirteen of our datasets were obtained with the 6-inch MicroObservatory telescope, Cecilia, and one was measured with the 61-inch Kuiper Telescope. We used the EXOplanet Transit Interpretation Code (EXOTIC) to reduce 49 datasets in order to update the planet's ephemeris to a mid-transit time of 2460216.95257 +/- 0.00022 BJD_TBD and an updated orbital period of 3.79985363 +/- 0.00000037 days. These results improve the mid-transit uncertainty by 70.27% from the most recent ephemeris update. The updated mid-transit time can help to ensure the efficient use of expensive, large ground- and space-based telescope missions in the future. This result demonstrates that amateur astronomers and citizen scientists can provide meaningful, cost-efficient, crowd-sourcing observations using ground-based telescopes to further refine current mid-transit times and orbital periods.
△ Less
Submitted 26 June, 2024;
originally announced June 2024.
-
Control in the Coefficients of an Obstacle Problem
Authors:
Nicolai Simon,
Winnifried Wollner
Abstract:
In this work, we consider optimality conditions of an optimal control problem governed by an obstacle problem. Here, we focus on introducing a, matrix valued, control variable as the coefficients of the obstacle problem. As it is well known, obstacle problems can be formulated as a complementarity system and consequently the associated solution operator is not Gateaux differentiable. As a conseque…
▽ More
In this work, we consider optimality conditions of an optimal control problem governed by an obstacle problem. Here, we focus on introducing a, matrix valued, control variable as the coefficients of the obstacle problem. As it is well known, obstacle problems can be formulated as a complementarity system and consequently the associated solution operator is not Gateaux differentiable. As a consequence, we utilize a regularization approach to obtain optimality conditions as the limit of optimality conditions of a family of regularized problems.
Due to the coupling of the controlled coefficient with the gradients of the solution to the obstacle problem, weak convergence arguments can not be applied and we need to argue by $H$-convergence. We show, that, based on initial $H$-convergence, a bootstrap** argument can be utilized to prove strong $L^p$-convergence of the control and thus enable the passage to the limit in the optimality conditions.
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
Enhancing Exoplanet Ephemerides by Leveraging Professional and Citizen Science Data: A Test Case with WASP-77A b
Authors:
Federico R. Noguer,
Suber Corley,
Kyle A. Pearson,
Robert T. Zellem,
Molly N. Simon,
Jennifer A. Burt,
Isabela Huckabee,
Prune C. August,
Megan Weiner Mansfield,
Paul A. Dalba,
Peter C. B. Smith,
Timothy Banks,
Ira Bell,
Dominique Daniel,
Lindsay Dawson,
Jesús De Mula,
Marc Deldem,
Dimitrios Deligeorgopoulos,
Romina P. Di Sisto,
Roger Dymock,
Phil Evans,
Giulio Follero,
Martin J. F. Fowler,
Eduardo Fernández-Lajús,
Alex Hamrick
, et al. (20 additional authors not shown)
Abstract:
We present an updated ephemeris and physical parameters for the exoplanet WASP-77 A b. In this effort, we combine 64 ground- and space-based transit observations, 6 space-based eclipse observations, and 32 radial velocity observations to produce the most precise orbital solution to date for this target, aiding in the planning of James Webb Space Telescope (JWST) and Ariel observations and atmosphe…
▽ More
We present an updated ephemeris and physical parameters for the exoplanet WASP-77 A b. In this effort, we combine 64 ground- and space-based transit observations, 6 space-based eclipse observations, and 32 radial velocity observations to produce the most precise orbital solution to date for this target, aiding in the planning of James Webb Space Telescope (JWST) and Ariel observations and atmospheric studies. We report a new orbital period of 1.360029395 +- 5.7e-8 days, a new mid-transit time of 2459957.337860 +- 4.3e-5 BJDTDB (Barycentric Julian Date in the Barycentric Dynamical Time scale; arXiv:1005.4415) and a new mid-eclipse time of 2459956.658192 +- 6.7e-5 BJDTDB. Furthermore, the methods presented in this study reduce the uncertainties in the planet mass to 1.6654 +- 4.5e-3 Mjup and orbital period to 1.360029395 +- 5.7e-8 days by factors of 15.1 and 10.9, respectively. Through a joint fit analysis comparison of transit data taken by space-based and citizen science-led initiatives, our study demonstrates the power of including data collected by citizen scientists compared to a fit of the space-based data alone. Additionally, by including a vast array of citizen science data from ExoClock, Exoplanet Transit Database (ETD), and Exoplanet Watch, we can increase our observational baseline and thus acquire better constraints on the forward propagation of our ephemeris than what is achievable with TESS data alone.
△ Less
Submitted 4 June, 2024; v1 submitted 29 May, 2024;
originally announced May 2024.
-
Bringing Lecture-Tutorials Online: An Analysis of A New Strategy to Teach Planet Formation in the Undergraduate Classroom
Authors:
Haylee N. Archer,
Molly N. Simon,
Chris Mead,
Edward E. Prather,
Mia Brunkhorst,
Diana Hunsley
Abstract:
Previous studies conclusively show that pencil-and-paper lecture-tutorials (LTs) are incredibly effective at increasing student engagement and learning gains on a variety of topics when compared to traditional lecture. LTs in astronomy are post-lecture activities developed with the intention of hel** students engage with conceptual and reasoning difficulties around a specific topic with the end…
▽ More
Previous studies conclusively show that pencil-and-paper lecture-tutorials (LTs) are incredibly effective at increasing student engagement and learning gains on a variety of topics when compared to traditional lecture. LTs in astronomy are post-lecture activities developed with the intention of hel** students engage with conceptual and reasoning difficulties around a specific topic with the end goal of them develo** a more expert-like understanding of astrophysical concepts. To date, all astronomy LTs have been developed for undergraduate courses taught in-person. Increases in online course enrollments and the COVID-19 pandemic further highlighted the need for additional interactive, research-based, curricular materials designed for online classrooms. To this end, we developed and assessed the efficacy of an innovative, interactive LT designed to teach planet formation in asynchronous, online, introductory astronomy courses for undergraduates. We utilized the Planet Formation Concept Inventory to compare learning outcomes between courses that implemented the new online, interactive LT, and those that used either a lecture-only approach or utilized a standard pencil-and-paper LT on the same topic. Overall, learning gains from the standard pencil-and-paper LT were statistically indistinguishable from the in-person implementation of the online LT and both of these conditions outperformed the lecture-only condition. However, when implemented asynchronously, learning gains from the online LT were lower and not significantly above the lecture-only condition. While improvements can be made to improve the online LT in the future, the current discipline ideas still outperform traditional lecture, and can be used as a tool to teach planet formation effectively.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
A Pilot Study from the First Course-Based Undergraduate Research Experience for Online Degree-Seeking Astronomy Students
Authors:
Justin Hom,
Jennifer Patience,
Karen Knierman,
Molly N. Simon,
Ara Austin
Abstract:
Research-based active learning approaches are critical for the teaching and learning of undergraduate STEM majors. Course-based undergraduate research experiences (CUREs) are becoming more commonplace in traditional, in-person academic environments, but have only just started to be utilized in online education. Online education has been shown to create accessible pathways to knowledge for individu…
▽ More
Research-based active learning approaches are critical for the teaching and learning of undergraduate STEM majors. Course-based undergraduate research experiences (CUREs) are becoming more commonplace in traditional, in-person academic environments, but have only just started to be utilized in online education. Online education has been shown to create accessible pathways to knowledge for individuals from nontraditional student backgrounds, and increasing the diversity of STEM fields has been identified as a priority for future generations of scientists and engineers. We developed and instructed a rigorous, six-week curriculum on the topic of observational astronomy, dedicated to educating second year online astronomy students in practices and techniques for astronomical research. Throughout the course, the students learned about telescopes, the atmosphere, filter systems, adaptive optics systems, astronomical catalogs, and image viewing and processing tools. We developed a survey informed by previous research validated assessments aimed to evaluate course feedback, course impact, student self-efficacy, student science identity and community values, and student sense of belonging. The survey was administered at the conclusion of the course to all eleven students yielding eight total responses. Although preliminary, the results of our analysis indicate that student confidence in utilizing the tools and skills taught in the course was significant. Students also felt a great sense of belonging to the astronomy community and increased confidence in conducting astronomical research in the future.
△ Less
Submitted 23 May, 2024;
originally announced May 2024.
-
MonoNav: MAV Navigation via Monocular Depth Estimation and Reconstruction
Authors:
Nathaniel Simon,
Anirudha Majumdar
Abstract:
A major challenge in deploying the smallest of Micro Aerial Vehicle (MAV) platforms (< 100 g) is their inability to carry sensors that provide high-resolution metric depth information (e.g., LiDAR or stereo cameras). Current systems rely on end-to-end learning or heuristic approaches that directly map images to control inputs, and struggle to fly fast in unknown environments. In this work, we ask…
▽ More
A major challenge in deploying the smallest of Micro Aerial Vehicle (MAV) platforms (< 100 g) is their inability to carry sensors that provide high-resolution metric depth information (e.g., LiDAR or stereo cameras). Current systems rely on end-to-end learning or heuristic approaches that directly map images to control inputs, and struggle to fly fast in unknown environments. In this work, we ask the following question: using only a monocular camera, optical odometry, and offboard computation, can we create metrically accurate maps to leverage the powerful path planning and navigation approaches employed by larger state-of-the-art robotic systems to achieve robust autonomy in unknown environments? We present MonoNav: a fast 3D reconstruction and navigation stack for MAVs that leverages recent advances in depth prediction neural networks to enable metrically accurate 3D scene reconstruction from a stream of monocular images and poses. MonoNav uses off-the-shelf pre-trained monocular depth estimation and fusion techniques to construct a map, then searches over motion primitives to plan a collision-free trajectory to the goal. In extensive hardware experiments, we demonstrate how MonoNav enables the Crazyflie (a 37 g MAV) to navigate fast (0.5 m/s) in cluttered indoor environments. We evaluate MonoNav against a state-of-the-art end-to-end approach, and find that the collision rate in navigation is significantly reduced (by a factor of 4). This increased safety comes at the cost of conservatism in terms of a 22% reduction in goal completion.
△ Less
Submitted 23 November, 2023;
originally announced November 2023.
-
Nonparametric variable importance for time-to-event outcomes with application to prediction of HIV infection
Authors:
Charles J. Wolock,
Peter B. Gilbert,
Noah Simon,
Marco Carone
Abstract:
In survival analysis, complex machine learning algorithms have been increasingly used for predictive modeling. Given a collection of features available for inclusion in a predictive model, it may be of interest to quantify the relative importance of a subset of features for the prediction task at hand. In particular, in HIV vaccine trials, participant baseline characteristics are used to predict t…
▽ More
In survival analysis, complex machine learning algorithms have been increasingly used for predictive modeling. Given a collection of features available for inclusion in a predictive model, it may be of interest to quantify the relative importance of a subset of features for the prediction task at hand. In particular, in HIV vaccine trials, participant baseline characteristics are used to predict the probability of infection over the intended follow-up period, and investigators may wish to understand how much certain types of predictors, such as behavioral factors, contribute toward overall predictiveness. Time-to-event outcomes such as time to infection are often subject to right censoring, and existing methods for assessing variable importance are typically not intended to be used in this setting. We describe a broad class of algorithm-agnostic variable importance measures for prediction in the context of survival data. We propose a nonparametric efficient estimation procedure that incorporates flexible learning of nuisance parameters, yields asymptotically valid inference, and enjoys double-robustness. We assess the performance of our proposed procedure via numerical simulations and analyze data from the HVTN 702 study to inform enrollment strategies for future HIV vaccine trials.
△ Less
Submitted 11 December, 2023; v1 submitted 21 November, 2023;
originally announced November 2023.
-
Improved convergence rates of nonparametric penalized regression under misspecified total variation
Authors:
Marlena S. Bannick,
Noah Simon
Abstract:
Penalties that induce smoothness are common in nonparametric regression. In many settings, the amount of smoothness in the data generating function will not be known. Simon and Shojaie (2021) derived convergence rates for nonparametric estimators under misspecified smoothness. We show that their theoretical convergence rates can be improved by working with convenient approximating functions. Prope…
▽ More
Penalties that induce smoothness are common in nonparametric regression. In many settings, the amount of smoothness in the data generating function will not be known. Simon and Shojaie (2021) derived convergence rates for nonparametric estimators under misspecified smoothness. We show that their theoretical convergence rates can be improved by working with convenient approximating functions. Properties of convolutions and higher-order kernels allow these approximation functions to match the true functions more closely than those used in Simon and Shojaie (2021). As a result, we obtain tighter convergence rates.
△ Less
Submitted 2 August, 2023;
originally announced August 2023.
-
Coefficient Control of Variational Inequalities
Authors:
Andreas Hehl,
Denis Khimin,
Ira Neitzel,
Nicolai Simon,
Thomas Wick,
Winnifried Wollner
Abstract:
Within this chapter, we discuss control in the coefficients of an obstacle problem. Utilizing tools from H-convergence, we show existence of optimal solutions. First order necessary optimality conditions are obtained after deriving directional differentiability of the coefficient to solution map** for the obstacle problem. Further, considering a regularized obstacle problem as a constraint yield…
▽ More
Within this chapter, we discuss control in the coefficients of an obstacle problem. Utilizing tools from H-convergence, we show existence of optimal solutions. First order necessary optimality conditions are obtained after deriving directional differentiability of the coefficient to solution map** for the obstacle problem. Further, considering a regularized obstacle problem as a constraint yields a limiting optimality system after proving, strong, convergence of the regularized control and state variables. Numerical examples underline convergence with respect to the regularization. Finally, some numerical experiments highlight the possible extension of the results to coefficient control in phase-field fracture.
△ Less
Submitted 3 July, 2023;
originally announced July 2023.
-
13 New Light Curves and Updated Mid-Transit Time and Period for Hot Jupiter WASP-104 b with EXOTIC
Authors:
Heather B. Hewitt,
Federico Noguer,
Suber Corley,
James Ball,
Claudia Chastain,
Richard Cochran-White,
Kendall Collins,
Kris Ganzel,
Kimberly Merriam Gray,
Mike Logan,
Steve Marquez-Perez,
Chyna Merchant,
Matthew Pedone,
Gina Plumey,
Matthew Rice,
Zachary Ruybal,
Molly N. Simon,
Isabela Huckabee,
Robert T. Zellem,
Kyle A. Pearson
Abstract:
Using the EXOplanet Transit Interpretation Code (EXOTIC), we reduced 52 sets of images of WASP-104 b, a Hot Jupiter-class exoplanet orbiting WASP-104, in order to obtain an updated mid-transit time (ephemeris) and orbital period for the planet. We performed this reduction on images taken with a 6-inch telescope of the Center for Astrophysics | Harvard & Smithsonian MicroObservatory. Of the reduced…
▽ More
Using the EXOplanet Transit Interpretation Code (EXOTIC), we reduced 52 sets of images of WASP-104 b, a Hot Jupiter-class exoplanet orbiting WASP-104, in order to obtain an updated mid-transit time (ephemeris) and orbital period for the planet. We performed this reduction on images taken with a 6-inch telescope of the Center for Astrophysics | Harvard & Smithsonian MicroObservatory. Of the reduced light curves, 13 were of sufficient accuracy to be used in updating the ephemerides for WASP-104 b, meeting or exceeding the three-sigma standard for determining a significant detection. Our final mid-transit value was 2457805.170208 +/- 0.000036 BJD_TBD and the final period value was 1.75540644 +/- 0.00000016 days. The true significance of our results is in their derivation from image sets gathered over time by a small, ground-based telescope as part of the Exoplanet Watch citizen science initiative, and their competitive results to an ephemeris generated from data gathered by the TESS telescope. We use these results to further show how such techniques can be employed by amateur astronomers and citizen scientists to maximize the efficacy of larger telescopes by reducing the use of expensive observation time. The work done in the paper was accomplished as part of the first fully online Course-Based Undergraduate Research Experience (CURE) for astronomy majors in the only online Bachelor of Science program in Astronomical and Planetary Sciences.
△ Less
Submitted 29 June, 2023;
originally announced June 2023.
-
Online Learning for Obstacle Avoidance
Authors:
David Snyder,
Meghan Booker,
Nathaniel Simon,
Wenhan Xia,
Daniel Suo,
Elad Hazan,
Anirudha Majumdar
Abstract:
We approach the fundamental problem of obstacle avoidance for robotic systems via the lens of online learning. In contrast to prior work that either assumes worst-case realizations of uncertainty in the environment or a stationary stochastic model of uncertainty, we propose a method that is efficient to implement and provably grants instance-optimality with respect to perturbations of trajectories…
▽ More
We approach the fundamental problem of obstacle avoidance for robotic systems via the lens of online learning. In contrast to prior work that either assumes worst-case realizations of uncertainty in the environment or a stationary stochastic model of uncertainty, we propose a method that is efficient to implement and provably grants instance-optimality with respect to perturbations of trajectories generated from an open-loop planner (in the sense of minimizing worst-case regret). The resulting policy adapts online to realizations of uncertainty and provably compares well with the best obstacle avoidance policy in hindsight from a rich class of policies. The method is validated in simulation on a dynamical system environment and compared to baseline open-loop planning and robust Hamilton- Jacobi reachability techniques. Further, it is implemented on a hardware example where a quadruped robot traverses a dense obstacle field and encounters input disturbances due to time delays, model uncertainty, and dynamics nonlinearities.
△ Less
Submitted 5 November, 2023; v1 submitted 14 June, 2023;
originally announced June 2023.
-
A framework for leveraging machine learning tools to estimate personalized survival curves
Authors:
Charles J. Wolock,
Peter B. Gilbert,
Noah Simon,
Marco Carone
Abstract:
The conditional survival function of a time-to-event outcome subject to censoring and truncation is a common target of estimation in survival analysis. This parameter may be of scientific interest and also often appears as a nuisance in nonparametric and semiparametric problems. In addition to classical parametric and semiparametric methods (e.g., based on the Cox proportional hazards model), flex…
▽ More
The conditional survival function of a time-to-event outcome subject to censoring and truncation is a common target of estimation in survival analysis. This parameter may be of scientific interest and also often appears as a nuisance in nonparametric and semiparametric problems. In addition to classical parametric and semiparametric methods (e.g., based on the Cox proportional hazards model), flexible machine learning approaches have been developed to estimate the conditional survival function. However, many of these methods are either implicitly or explicitly targeted toward risk stratification rather than overall survival function estimation. Others apply only to discrete-time settings or require inverse probability of censoring weights, which can be as difficult to estimate as the outcome survival function itself. Here, we employ a decomposition of the conditional survival function in terms of observable regression models in which censoring and truncation play no role. This allows application of an array of flexible regression and classification methods rather than only approaches that explicitly handle the complexities inherent to survival data. We outline estimation procedures based on this decomposition, empirically assess their performance, and demonstrate their use on data from an HIV vaccine trial.
△ Less
Submitted 31 October, 2023; v1 submitted 6 November, 2022;
originally announced November 2022.
-
FlowDrone: Wind Estimation and Gust Rejection on UAVs Using Fast-Response Hot-Wire Flow Sensors
Authors:
Nathaniel Simon,
Allen Z. Ren,
Alexander Piqué,
David Snyder,
Daphne Barretto,
Marcus Hultmark,
Anirudha Majumdar
Abstract:
Unmanned aerial vehicles (UAVs) are finding use in applications that place increasing emphasis on robustness to external disturbances including extreme wind. However, traditional multirotor UAV platforms do not directly sense wind; conventional flow sensors are too slow, insensitive, or bulky for widespread integration on UAVs. Instead, drones typically observe the effects of wind indirectly throu…
▽ More
Unmanned aerial vehicles (UAVs) are finding use in applications that place increasing emphasis on robustness to external disturbances including extreme wind. However, traditional multirotor UAV platforms do not directly sense wind; conventional flow sensors are too slow, insensitive, or bulky for widespread integration on UAVs. Instead, drones typically observe the effects of wind indirectly through accumulated errors in position or trajectory tracking. In this work, we integrate a novel flow sensor based on micro-electro-mechanical systems (MEMS) hot-wire technology developed in our prior work onto a multirotor UAV for wind estimation. These sensors are omnidirectional, lightweight, fast, and accurate. In order to achieve superior tracking performance in windy conditions, we train a `wind-aware' residual-based controller via reinforcement learning using simulated wind gusts and their aerodynamic effects on the drone. In extensive hardware experiments, we demonstrate the wind-aware controller outperforming two strong `wind-unaware' baseline controllers in challenging windy conditions. See: https://youtu.be/KWqkH9Z-338.
△ Less
Submitted 24 October, 2022; v1 submitted 11 October, 2022;
originally announced October 2022.
-
Fast-response hot-wire flow sensors for wind and gust estimation on UAVs
Authors:
Nathaniel Simon,
Alexander Piqué,
David Snyder,
Kyle Ikuma,
Anirudha Majumdar,
Marcus Hultmark
Abstract:
Due to limitations in available sensor technology, unmanned aerial vehicles (UAVs) lack an active sensing capability to measure turbulence, gusts, or other unsteady aerodynamic phenomena. Conventional in situ anemometry techniques fail to deliver in the harsh and dynamic multirotor environment due to form factor, resolution, or robustness requirements. To address this capability gap, a novel, fast…
▽ More
Due to limitations in available sensor technology, unmanned aerial vehicles (UAVs) lack an active sensing capability to measure turbulence, gusts, or other unsteady aerodynamic phenomena. Conventional in situ anemometry techniques fail to deliver in the harsh and dynamic multirotor environment due to form factor, resolution, or robustness requirements. To address this capability gap, a novel, fast-response sensor system to measure a wind vector in two dimensions is introduced and evaluated. This system, known as `MAST' (for MEMS Anemometry Sensing Tower), leverages advances in microelectromechanical (MEMS) hot-wire devices to produce a solid-state, lightweight, and robust flow sensor suitable for real-time wind estimation onboard a UAV. The MAST uses five pentagonally-arranged microscale hot-wires to determine the wind vector's direction and magnitude. The MAST's performance was evaluated in a wind tunnel at speeds up to 5~m/s and orientations of 0 - 360 degrees. A neural network sensor model was trained from the wind tunnel data to estimate the wind vector from sensor signals. The average error of the sensor is 0.14 m/s for speed and 1.6 degrees for direction. Furthermore, 95% of measurements are within 0.36 m/s error for speed and 5.0 degree error for direction. With a bandwidth of 570 Hz determined from square-wave testing, the MAST stands to greatly enhance UAV wind estimation capabilities and enable capturing relevant high-frequency phenomena in flow conditions.
△ Less
Submitted 26 October, 2022; v1 submitted 14 September, 2022;
originally announced September 2022.
-
Accounting for Inconsistent Use of Covariate Adjustment in Group Sequential Trials
Authors:
Marlena S. Bannick,
Sonya L. Heltshe,
Noah Simon
Abstract:
Group sequential designs in clinical trials allow for interim efficacy and futility monitoring. Adjustment for baseline covariates can increase power and precision of estimated effects. However, inconsistently applying covariate adjustment throughout the stages of a group sequential trial can result in inflation of type I error, biased point estimates, and anti-conservative confidence intervals. W…
▽ More
Group sequential designs in clinical trials allow for interim efficacy and futility monitoring. Adjustment for baseline covariates can increase power and precision of estimated effects. However, inconsistently applying covariate adjustment throughout the stages of a group sequential trial can result in inflation of type I error, biased point estimates, and anti-conservative confidence intervals. We propose methods for performing correct interim monitoring, estimation, and inference in this setting that avoid these issues. We focus on two-arm trials with simple, balanced randomization and continuous outcomes. We study the performance of our boundary, estimation, and inference adjustments in simulation studies. We end with recommendations about the application of covariate adjustment in group sequential designs.
△ Less
Submitted 9 August, 2023; v1 submitted 24 June, 2022;
originally announced June 2022.
-
Regression in Tensor Product Spaces by the Method of Sieves
Authors:
Tianyu Zhang,
Noah Simon
Abstract:
Estimation of a conditional mean (linking a set of features to an outcome of interest) is a fundamental statistical task. While there is an appeal to flexible nonparametric procedures, effective estimation in many classical nonparametric function spaces (e.g., multivariate Sobolev spaces) can be prohibitively difficult -- both statistically and computationally -- especially when the number of feat…
▽ More
Estimation of a conditional mean (linking a set of features to an outcome of interest) is a fundamental statistical task. While there is an appeal to flexible nonparametric procedures, effective estimation in many classical nonparametric function spaces (e.g., multivariate Sobolev spaces) can be prohibitively difficult -- both statistically and computationally -- especially when the number of features is large. In this paper, we present (penalized) sieve estimators for regression in nonparametric tensor product spaces: These spaces are more amenable to multivariate regression, and allow us to, in-part, avoid the curse of dimensionality. Our estimators can be easily applied to multivariate nonparametric problems and have appealing statistical and computational properties. Moreover, they can effectively leverage additional structures such as feature sparsity. In this manuscript, we give theoretical guarantees, indicating that the predictive performance of our estimators scale favorably in dimension. In addition, we also present numerical examples to compare the finite-sample performance of the proposed estimators with several popular machine learning methods.
△ Less
Submitted 7 June, 2022;
originally announced June 2022.
-
Comparison of Evaluation Metrics for Landmark Detection in CMR Images
Authors:
Sven Koehler,
Lalith Sharan,
Julian Kuhm,
Arman Ghanaat,
Jelizaveta Gordejeva,
Nike K. Simon,
Niko M. Grell,
Florian André,
Sandy Engelhardt
Abstract:
Cardiac Magnetic Resonance (CMR) images are widely used for cardiac diagnosis and ventricular assessment. Extracting specific landmarks like the right ventricular insertion points is of importance for spatial alignment and 3D modeling. The automatic detection of such landmarks has been tackled by multiple groups using Deep Learning, but relatively little attention has been paid to the failure case…
▽ More
Cardiac Magnetic Resonance (CMR) images are widely used for cardiac diagnosis and ventricular assessment. Extracting specific landmarks like the right ventricular insertion points is of importance for spatial alignment and 3D modeling. The automatic detection of such landmarks has been tackled by multiple groups using Deep Learning, but relatively little attention has been paid to the failure cases of evaluation metrics in this field. In this work, we extended the public ACDC dataset with additional labels of the right ventricular insertion points and compare different variants of a heatmap-based landmark detection pipeline. In this comparison, we demonstrate very likely pitfalls of apparently simple detection and localisation metrics which highlights the importance of a clear detection strategy and the definition of an upper limit for localisation-based metrics. Our preliminary results indicate that a combination of different metrics is necessary, as they yield different winners for method comparison. Additionally, they highlight the need of a comprehensive metric description and evaluation standardisation, especially for the error cases where no metrics could be computed or where no lower/upper boundary of a metric exists. Code and labels: https://github.com/Cardio-AI/rvip_landmark_detection
△ Less
Submitted 28 January, 2022; v1 submitted 25 January, 2022;
originally announced January 2022.
-
Mesh-Based Solutions for Nonparametric Penalized Regression
Authors:
Brayan Ortiz,
Noah Simon
Abstract:
It is often of interest to estimate regression functions non-parametrically. Penalized regression (PR) is one statistically-effective, well-studied solution to this problem. Unfortunately, in many cases, finding exact solutions to PR problems is computationally intractable. In this manuscript, we propose a mesh-based approximate solution (MBS) for those scenarios. MBS transforms the complicated fu…
▽ More
It is often of interest to estimate regression functions non-parametrically. Penalized regression (PR) is one statistically-effective, well-studied solution to this problem. Unfortunately, in many cases, finding exact solutions to PR problems is computationally intractable. In this manuscript, we propose a mesh-based approximate solution (MBS) for those scenarios. MBS transforms the complicated functional minimization of NPR, to a finite parameter, discrete convex minimization; and allows us to leverage the tools of modern convex optimization. We show applications of MBS in a number of explicit examples (including both uni- and multi-variate regression), and explore how the number of parameters must increase with our sample-size in order for MBS to maintain the rate-optimality of NPR. We also give an efficient algorithm to minimize the MBS objective while effectively leveraging the sparsity inherent in MBS.
△ Less
Submitted 6 December, 2021;
originally announced December 2021.
-
The Future will be Different than Today: Model Evaluation Considerations when Develo** Translational Clinical Biomarker
Authors:
Yichen Lu,
Jane Fridlyand,
Tiffany Tang,
Ting Qi,
Noah Simon,
Ning Leng
Abstract:
Finding translational biomarkers stands center stage of the future of personalized medicine in healthcare. We observed notable challenges in identifying robust biomarkers as some with great performance in one scenario often fail to perform well in new trials (e.g. different population, indications). With rapid development in the clinical trial world (e.g. assay, disease definition), new trials ver…
▽ More
Finding translational biomarkers stands center stage of the future of personalized medicine in healthcare. We observed notable challenges in identifying robust biomarkers as some with great performance in one scenario often fail to perform well in new trials (e.g. different population, indications). With rapid development in the clinical trial world (e.g. assay, disease definition), new trials very likely differ from legacy ones in many perspectives and in development of biomarkers this heterogeneity should be considered. In response, we recommend considering building in the heterogeneity when evaluating biomarkers. In this paper, we present one evaluation strategy by using leave-one-study-out (LOSO) in place of conventional cross-validation (cv) methods to account for the potential heterogeneity across trials used for building and testing the biomarkers. To demonstrate the performance of K-fold vs LOSO cv in estimating the effect size of biomarkers, we leveraged data from clinical trials and simulation studies. In our assessment, LOSO cv provided a more objective estimate of the future performance. This conclusion remained true across different evaluation metrics and different statistical methods.
△ Less
Submitted 13 July, 2021;
originally announced July 2021.
-
On the Optimality of Nuclear-norm-based Matrix Completion for Problems with Smooth Non-linear Structure
Authors:
Yunhua Xiang,
Tianyu Zhang,
Xu Wang,
Ali Shojaie,
Noah Simon
Abstract:
Originally developed for imputing missing entries in low rank, or approximately low rank matrices, matrix completion has proven widely effective in many problems where there is no reason to assume low-dimensional linear structure in the underlying matrix, as would be imposed by rank constraints. In this manuscript, we build some theoretical intuition for this behavior. We consider matrices which a…
▽ More
Originally developed for imputing missing entries in low rank, or approximately low rank matrices, matrix completion has proven widely effective in many problems where there is no reason to assume low-dimensional linear structure in the underlying matrix, as would be imposed by rank constraints. In this manuscript, we build some theoretical intuition for this behavior. We consider matrices which are not necessarily low-rank, but lie in a low-dimensional non-linear manifold. We show that nuclear-norm penalization is still effective for recovering these matrices when observations are missing completely at random. In particular, we give upper bounds on the rate of convergence as a function of the number of rows, columns, and observed entries in the matrix, as well as the smoothness and dimension of the non-linear embedding. We additionally give a minimax lower bound: This lower bound agrees with our upper bound (up to a logarithmic factor), which shows that nuclear-norm penalization is (up to log terms) minimax rate optimal for these problems.
△ Less
Submitted 5 May, 2021;
originally announced May 2021.
-
A Sieve Stochastic Gradient Descent Estimator for Online Nonparametric Regression in Sobolev ellipsoids
Authors:
Tianyu Zhang,
Noah Simon
Abstract:
The goal of regression is to recover an unknown underlying function that best links a set of predictors to an outcome from noisy observations. In nonparametric regression, one assumes that the regression function belongs to a pre-specified infinite-dimensional function space (the hypothesis space). In the online setting, when the observations come in a stream, it is computationally-preferable to i…
▽ More
The goal of regression is to recover an unknown underlying function that best links a set of predictors to an outcome from noisy observations. In nonparametric regression, one assumes that the regression function belongs to a pre-specified infinite-dimensional function space (the hypothesis space). In the online setting, when the observations come in a stream, it is computationally-preferable to iteratively update an estimate rather than refitting an entire model repeatedly. Inspired by nonparametric sieve estimation and stochastic approximation methods, we propose a sieve stochastic gradient descent estimator (Sieve-SGD) when the hypothesis space is a Sobolev ellipsoid. We show that Sieve-SGD has rate-optimal mean squared error (MSE) under a set of simple and direct conditions. The proposed estimator can be constructed with a low computational (time and space) expense: We also formally show that Sieve-SGD requires almost minimal memory usage among all statistically rate-optimal estimators.
△ Less
Submitted 6 January, 2022; v1 submitted 1 April, 2021;
originally announced April 2021.
-
An Online Projection Estimator for Nonparametric Regression in Reproducing Kernel Hilbert Spaces
Authors:
Tianyu Zhang,
Noah Simon
Abstract:
The goal of nonparametric regression is to recover an underlying regression function from noisy observations, under the assumption that the regression function belongs to a pre-specified infinite dimensional function space. In the online setting, when the observations come in a stream, it is generally computationally infeasible to refit the whole model repeatedly. There are as of yet no methods th…
▽ More
The goal of nonparametric regression is to recover an underlying regression function from noisy observations, under the assumption that the regression function belongs to a pre-specified infinite dimensional function space. In the online setting, when the observations come in a stream, it is generally computationally infeasible to refit the whole model repeatedly. There are as of yet no methods that are both computationally efficient and statistically rate-optimal. In this paper, we propose an estimator for online nonparametric regression. Notably, our estimator is an empirical risk minimizer (ERM) in a deterministic linear space, which is quite different from existing methods using random features and functional stochastic gradient. Our theoretical analysis shows that this estimator obtains rate-optimal generalization error when the regression function is known to live in a reproducing kernel Hilbert space. We also show, theoretically and empirically, that the computational expense of our estimator is much lower than other rate-optimal estimators proposed for this online setting.
△ Less
Submitted 1 April, 2021;
originally announced April 2021.
-
Planet Hunters TESS II: Findings from the first two years of TESS
Authors:
Nora L. Eisner,
Oscar Barragán,
Chris Lintott,
Suzanne Aigrain,
Belinda Nicholson,
Tabetha S. Boyajian,
Steve B. Howell,
Cole Johnston,
Ben Lakeland,
Grant Miller,
Adam McMaster,
Hannu Parviainen,
Emily J. Safron,
Megan E. Schwamb,
Laura Trouille,
Sophia Vaughan,
Norbert Zicher,
Campbell Allen,
Sarah Allen,
Mark Bouslog,
Cliff Johnson,
Molly N. Simon,
Zach Wolfenbarger,
Elisabeth M. L. Baeten,
David M. Bundy
, et al. (1 additional authors not shown)
Abstract:
We present the results from the first two years of the Planet Hunters TESS citizen science project, which identifies planet candidates in the TESS data by engaging members of the general public. Over 22,000 citizen scientists from around the world visually inspected the first 26 Sectors of TESS data in order to help identify transit-like signals. We use a clustering algorithm to combine these clas…
▽ More
We present the results from the first two years of the Planet Hunters TESS citizen science project, which identifies planet candidates in the TESS data by engaging members of the general public. Over 22,000 citizen scientists from around the world visually inspected the first 26 Sectors of TESS data in order to help identify transit-like signals. We use a clustering algorithm to combine these classifications into a ranked list of events for each sector, the top 500 of which are then visually vetted by the science team. We assess the detection efficiency of this methodology by comparing our results to the list of TESS Objects of Interest (TOIs) and show that we recover 85 % of the TOIs with radii greater than 4 Earth radii and 51 % of those with radii between 3 and 4 Earth radii. Additionally, we present our 90 most promising planet candidates that had not previously been identified by other teams, 73 of which exhibit only a single transit event in the TESS light curve, and outline our efforts to follow these candidates up using ground-based observatories. Finally, we present noteworthy stellar systems that were identified through the Planet Hunters TESS project.
△ Less
Submitted 27 November, 2020;
originally announced November 2020.
-
When to Impute? Imputation before and during cross-validation
Authors:
Byron C. Jaeger,
Nicholas J. Tierney,
Noah R. Simon
Abstract:
Cross-validation (CV) is a technique used to estimate generalization error for prediction models. For pipeline modeling algorithms (i.e. modeling procedures with multiple steps), it has been recommended the entire sequence of steps be carried out during each replicate of CV to mimic the application of the entire pipeline to an external testing set. While theoretically sound, following this recomme…
▽ More
Cross-validation (CV) is a technique used to estimate generalization error for prediction models. For pipeline modeling algorithms (i.e. modeling procedures with multiple steps), it has been recommended the entire sequence of steps be carried out during each replicate of CV to mimic the application of the entire pipeline to an external testing set. While theoretically sound, following this recommendation can lead to high computational costs when a pipeline modeling algorithm includes computationally expensive operations, e.g. imputation of missing values. There is a general belief that unsupervised variable selection (i.e. ignoring the outcome) can be applied before conducting CV without incurring bias, but there is less consensus for unsupervised imputation of missing values. We empirically assessed whether conducting unsupervised imputation prior to CV would result in biased estimates of generalization error or result in poorly selected tuning parameters and thus degrade the external performance of downstream models. Results show that despite optimistic bias, the reduced variance of imputation before CV compared to imputation during each replicate of CV leads to a lower overall root mean squared error for estimation of the true external R-squared and the performance of models tuned using CV with imputation before versus during each replication is minimally different. In conclusion, unsupervised imputation before CV appears valid in certain settings and may be a helpful strategy that enables analysts to use more flexible imputation techniques without incurring high computational costs.
△ Less
Submitted 1 October, 2020;
originally announced October 2020.
-
Ensembled sparse-input hierarchical networks for high-dimensional datasets
Authors:
Jean Feng,
Noah Simon
Abstract:
Neural networks have seen limited use in prediction for high-dimensional data with small sample sizes, because they tend to overfit and require tuning many more hyperparameters than existing off-the-shelf machine learning methods. With small modifications to the network architecture and training procedure, we show that dense neural networks can be a practical data analysis tool in these settings.…
▽ More
Neural networks have seen limited use in prediction for high-dimensional data with small sample sizes, because they tend to overfit and require tuning many more hyperparameters than existing off-the-shelf machine learning methods. With small modifications to the network architecture and training procedure, we show that dense neural networks can be a practical data analysis tool in these settings. The proposed method, Ensemble by Averaging Sparse-Input Hierarchical networks (EASIER-net), appropriately prunes the network structure by tuning only two L1-penalty parameters, one that controls the input sparsity and another that controls the number of hidden layers and nodes. The method selects variables from the true support if the irrelevant covariates are only weakly correlated with the response; otherwise, it exhibits a grou** effect, where strongly correlated covariates are selected at similar rates. On a collection of real-world datasets with different sizes, EASIER-net selected network architectures in a data-adaptive manner and achieved higher prediction accuracy than off-the-shelf methods on average.
△ Less
Submitted 10 May, 2020;
originally announced May 2020.
-
Spatial Matrix Completion for Spatially-Misaligned and High-Dimensional Air Pollution Data
Authors:
Phuong T. Vu,
Adam A. Szpiro,
Noah Simon
Abstract:
In health-pollution cohort studies, accurate predictions of pollutant concentrations at new locations are needed, since the locations of fixed monitoring sites and study participants are often spatially misaligned. For multi-pollution data, principal component analysis (PCA) is often incorporated to obtain low-rank (LR) structure of the data prior to spatial prediction. Recently developed predicti…
▽ More
In health-pollution cohort studies, accurate predictions of pollutant concentrations at new locations are needed, since the locations of fixed monitoring sites and study participants are often spatially misaligned. For multi-pollution data, principal component analysis (PCA) is often incorporated to obtain low-rank (LR) structure of the data prior to spatial prediction. Recently developed predictive PCA modifies the traditional algorithm to improve the overall predictive performance by leveraging both LR and spatial structures within the data. However, predictive PCA requires complete data or an initial imputation step. Nonparametric imputation techniques without accounting for spatial information may distort the underlying structure of the data, and thus further reduce the predictive performance. We propose a convex optimization problem inspired by the LR matrix completion framework and develop a proximal algorithm to solve it. Missing data are imputed and handled concurrently within the algorithm, which eliminates the necessity of a separate imputation step. We show that our algorithm has low computational burden and leads to reliable predictive performance as the severity of missing data increases.
△ Less
Submitted 21 January, 2022; v1 submitted 11 April, 2020;
originally announced April 2020.
-
A general framework for inference on algorithm-agnostic variable importance
Authors:
Brian D. Williamson,
Peter B. Gilbert,
Noah R. Simon,
Marco Carone
Abstract:
In many applications, it is of interest to assess the relative contribution of features (or subsets of features) toward the goal of predicting a response -- in other words, to gauge the variable importance of features. Most recent work on variable importance assessment has focused on describing the importance of features within the confines of a given prediction algorithm. However, such assessment…
▽ More
In many applications, it is of interest to assess the relative contribution of features (or subsets of features) toward the goal of predicting a response -- in other words, to gauge the variable importance of features. Most recent work on variable importance assessment has focused on describing the importance of features within the confines of a given prediction algorithm. However, such assessment does not necessarily characterize the prediction potential of features, and may provide a misleading reflection of the intrinsic value of these features. To address this limitation, we propose a general framework for nonparametric inference on interpretable algorithm-agnostic variable importance. We define variable importance as a population-level contrast between the oracle predictiveness of all available features versus all features except those under consideration. We propose a nonparametric efficient estimation procedure that allows the construction of valid confidence intervals, even when machine learning techniques are used. We also outline a valid strategy for testing the null importance hypothesis. Through simulations, we show that our proposal has good operating characteristics, and we illustrate its use with data from a study of an antibody against HIV-1 infection.
△ Less
Submitted 13 September, 2021; v1 submitted 7 April, 2020;
originally announced April 2020.
-
A flexible Bayesian framework to estimate age- and cause-specific child mortality over time from sample registration data
Authors:
Austin E Schumacher,
Tyler H McCormick,
Jon Wakefield,
Yue Chu,
Jamie Perin,
Francisco Villavicencio,
Noah Simon,
Li Liu
Abstract:
In order to implement disease-specific interventions in young age groups, policy makers in low- and middle-income countries require timely and accurate estimates of age- and cause-specific child mortality. High quality data is not available in settings where these interventions are most needed, but there is a push to create sample registration systems that collect detailed mortality information. C…
▽ More
In order to implement disease-specific interventions in young age groups, policy makers in low- and middle-income countries require timely and accurate estimates of age- and cause-specific child mortality. High quality data is not available in settings where these interventions are most needed, but there is a push to create sample registration systems that collect detailed mortality information. Current methods that estimate mortality from this data employ multistage frameworks without rigorous statistical justification that separately estimate all-cause and cause-specific mortality and are not sufficiently adaptable to capture important features of the data. We propose a flexible Bayesian modeling framework to estimate age- and cause-specific child mortality from sample registration data. We provide a theoretical justification for the framework, explore its properties via simulation, and use it to estimate mortality trends using data from the Maternal and Child Health Surveillance System in China.
△ Less
Submitted 18 May, 2021; v1 submitted 29 February, 2020;
originally announced March 2020.
-
BigSurvSGD: Big Survival Data Analysis via Stochastic Gradient Descent
Authors:
Aliasghar Tarkhan,
Noah Simon
Abstract:
In many biomedical applications, outcome is measured as a ``time-to-event'' (eg. disease progression or death). To assess the connection between features of a patient and this outcome, it is common to assume a proportional hazards model, and fit a proportional hazards regression (or Cox regression). To fit this model, a log-concave objective function known as the ``partial likelihood'' is maximize…
▽ More
In many biomedical applications, outcome is measured as a ``time-to-event'' (eg. disease progression or death). To assess the connection between features of a patient and this outcome, it is common to assume a proportional hazards model, and fit a proportional hazards regression (or Cox regression). To fit this model, a log-concave objective function known as the ``partial likelihood'' is maximized. For moderate-sized datasets, an efficient Newton-Raphson algorithm that leverages the structure of the objective can be employed. However, in large datasets this approach has two issues: 1) The computational tricks that leverage structure can also lead to computational instability; 2) The objective does not naturally decouple: Thus, if the dataset does not fit in memory, the model can be very computationally expensive to fit. This additionally means that the objective is not directly amenable to stochastic gradient-based optimization methods. To overcome these issues, we propose a simple, new framing of proportional hazards regression: This results in an objective function that is amenable to stochastic gradient descent. We show that this simple modification allows us to efficiently fit survival models with very large datasets. This also facilitates training complex, eg. neural-network-based, models with survival data.
△ Less
Submitted 9 August, 2020; v1 submitted 28 February, 2020;
originally announced March 2020.
-
Approval policies for modifications to Machine Learning-Based Software as a Medical Device: A study of bio-creep
Authors:
Jean Feng,
Scott Emerson,
Noah Simon
Abstract:
Successful deployment of machine learning algorithms in healthcare requires careful assessments of their performance and safety. To date, the FDA approves locked algorithms prior to marketing and requires future updates to undergo separate premarket reviews. However, this negates a key feature of machine learning--the ability to learn from a growing dataset and improve over time. This paper frames…
▽ More
Successful deployment of machine learning algorithms in healthcare requires careful assessments of their performance and safety. To date, the FDA approves locked algorithms prior to marketing and requires future updates to undergo separate premarket reviews. However, this negates a key feature of machine learning--the ability to learn from a growing dataset and improve over time. This paper frames the design of an approval policy, which we refer to as an automatic algorithmic change protocol (aACP), as an online hypothesis testing problem. As this process has obvious analogy with noninferiority testing of new drugs, we investigate how repeated testing and adoption of modifications might lead to gradual deterioration in prediction accuracy, also known as ``biocreep'' in the drug development literature. We consider simple policies that one might consider but do not necessarily offer any error-rate guarantees, as well as policies that do provide error-rate control. For the latter, we define two online error-rates appropriate for this context: Bad Approval Count (BAC) and Bad Approval and Benchmark Ratios (BABR). We control these rates in the simple setting of a constant population and data source using policies aACP-BAC and aACP-BABR, which combine alpha-investing, group-sequential, and gate-kee** methods. In simulation studies, bio-creep regularly occurred when using policies with no error-rate guarantees, whereas aACP-BAC and -BABR controlled the rate of bio-creep without substantially impacting our ability to approve beneficial modifications.
△ Less
Submitted 28 December, 2019;
originally announced December 2019.
-
Selective prediction-set models with coverage guarantees
Authors:
Jean Feng,
Arjun Sondhi,
Jessica Perry,
Noah Simon
Abstract:
Though black-box predictors are state-of-the-art for many complex tasks, they often fail to properly quantify predictive uncertainty and may provide inappropriate predictions for unfamiliar data. Instead, we can learn more reliable models by letting them either output a prediction set or abstain when the uncertainty is high. We propose training these selective prediction-set models using an uncert…
▽ More
Though black-box predictors are state-of-the-art for many complex tasks, they often fail to properly quantify predictive uncertainty and may provide inappropriate predictions for unfamiliar data. Instead, we can learn more reliable models by letting them either output a prediction set or abstain when the uncertainty is high. We propose training these selective prediction-set models using an uncertainty-aware loss minimization framework, which unifies ideas from decision theory and robust maximum likelihood. Moreover, since black-box methods are not guaranteed to output well-calibrated prediction sets, we show how to calculate point estimates and confidence intervals for the true coverage of any selective prediction-set model, as well as a uniform mixture of K set models obtained from K-fold sample-splitting. When applied to predicting in-hospital mortality and length-of-stay for ICU patients, our model outperforms existing approaches on both in-sample and out-of-sample age groups, and our recalibration method provides accurate inference for prediction set coverage.
△ Less
Submitted 10 December, 2021; v1 submitted 13 June, 2019;
originally announced June 2019.
-
Using Propensity Scores to Develop and Evaluate Treatment Rules with Observational Data
Authors:
Jeremy Roth,
Noah Simon
Abstract:
In this paper, we outline a principled approach to estimate an individualized treatment rule that is appropriate for data from observational studies where, in addition to treatment assignment not being independent of individual characteristics, some characteristics may affect treatment assignment in the current study but not be available in future clinical settings where the estimated rule would b…
▽ More
In this paper, we outline a principled approach to estimate an individualized treatment rule that is appropriate for data from observational studies where, in addition to treatment assignment not being independent of individual characteristics, some characteristics may affect treatment assignment in the current study but not be available in future clinical settings where the estimated rule would be applied. The estimation framework is quite flexible and accommodates any prediction method that uses observation weights, where the observation weights themselves are a ratio of two flexibly estimated propensity scores. We also discuss how to obtain a trustworthy estimate of the rule's population benefit based on simple propensity-score-based estimators of average treatment effect. We implement our approach in the R package DevTreatRules and share the code needed to reproduce our results on GitHub.
△ Less
Submitted 3 June, 2019; v1 submitted 29 May, 2019;
originally announced May 2019.
-
Estimation of cell lineage trees by maximum-likelihood phylogenetics
Authors:
Jean Feng,
William S DeWitt III,
Aaron McKenna,
Noah Simon,
Amy Willis,
Frederick A Matsen IV
Abstract:
CRISPR technology has enabled large-scale cell lineage tracing for complex multicellular organisms by mutating synthetic genomic barcodes during organismal development. However, these sophisticated biological tools currently use ad-hoc and outmoded computational methods to reconstruct the cell lineage tree from the mutated barcodes. Because these methods are agnostic to the biological mechanism, t…
▽ More
CRISPR technology has enabled large-scale cell lineage tracing for complex multicellular organisms by mutating synthetic genomic barcodes during organismal development. However, these sophisticated biological tools currently use ad-hoc and outmoded computational methods to reconstruct the cell lineage tree from the mutated barcodes. Because these methods are agnostic to the biological mechanism, they are unable to take full advantage of the data's structure. We propose a statistical model for the mutation process and develop a procedure to estimate the tree topology, branch lengths, and mutation parameters by iteratively applying penalized maximum likelihood estimation. In contrast to existing techniques, our method estimates time along each branch, rather than number of mutation events, thus providing a detailed account of tissue-type differentiation. Via simulations, we demonstrate that our method is substantially more accurate than existing approaches. Our reconstructed trees also better recapitulate known aspects of zebrafish development and reproduce similar results across fish replicates.
△ Less
Submitted 29 March, 2019;
originally announced April 2019.
-
An analysis of the cost of hyper-parameter selection via split-sample validation, with applications to penalized regression
Authors:
Jean Feng,
Noah Simon
Abstract:
In the regression setting, given a set of hyper-parameters, a model-estimation procedure constructs a model from training data. The optimal hyper-parameters that minimize generalization error of the model are usually unknown. In practice they are often estimated using split-sample validation. Up to now, there is an open question regarding how the generalization error of the selected model grows wi…
▽ More
In the regression setting, given a set of hyper-parameters, a model-estimation procedure constructs a model from training data. The optimal hyper-parameters that minimize generalization error of the model are usually unknown. In practice they are often estimated using split-sample validation. Up to now, there is an open question regarding how the generalization error of the selected model grows with the number of hyper-parameters to be estimated. To answer this question, we establish finite-sample oracle inequalities for selection based on a single training/test split and based on cross-validation. We show that if the model-estimation procedures are smoothly parameterized by the hyper-parameters, the error incurred from tuning hyper-parameters shrinks at nearly a parametric rate. Hence for semi- and non-parametric model-estimation procedures with a fixed number of hyper-parameters, this additional error is negligible. For parametric model-estimation procedures, adding a hyper-parameter is roughly equivalent to adding a parameter to the model itself. In addition, we specialize these ideas for penalized regression problems with multiple penalty parameters. We establish that the fitted models are Lipschitz in the penalty parameters and thus our oracle inequalities apply. This result encourages development of regularization methods with many penalty parameters.
△ Less
Submitted 28 March, 2019;
originally announced March 2019.
-
Generalized Sparse Additive Models
Authors:
Asad Haris,
Noah Simon,
Ali Shojaie
Abstract:
We present a unified framework for estimation and analysis of generalized additive models in high dimensions. The framework defines a large class of penalized regression estimators, encompassing many existing methods. An efficient computational algorithm for this class is presented that easily scales to thousands of observations and features. We prove minimax optimal convergence bounds for this cl…
▽ More
We present a unified framework for estimation and analysis of generalized additive models in high dimensions. The framework defines a large class of penalized regression estimators, encompassing many existing methods. An efficient computational algorithm for this class is presented that easily scales to thousands of observations and features. We prove minimax optimal convergence bounds for this class under a weak compatibility condition. In addition, we characterize the rate of convergence when this compatibility condition is not met. Finally, we also show that the optimal penalty parameters for structure and sparsity penalties in our framework are linked, allowing cross-validation to be conducted over only a single tuning parameter. We complement our theoretical results with empirical studies comparing some existing methods within this framework.
△ Less
Submitted 11 March, 2019;
originally announced March 2019.
-
Wavelet regression and additive models for irregularly spaced data
Authors:
Asad Haris,
Noah Simon,
Ali Shojaie
Abstract:
We present a novel approach for nonparametric regression using wavelet basis functions. Our proposal, $\texttt{waveMesh}$, can be applied to non-equispaced data with sample size not necessarily a power of 2. We develop an efficient proximal gradient descent algorithm for computing the estimator and establish adaptive minimax convergence rates. The main appeal of our approach is that it naturally e…
▽ More
We present a novel approach for nonparametric regression using wavelet basis functions. Our proposal, $\texttt{waveMesh}$, can be applied to non-equispaced data with sample size not necessarily a power of 2. We develop an efficient proximal gradient descent algorithm for computing the estimator and establish adaptive minimax convergence rates. The main appeal of our approach is that it naturally extends to additive and sparse additive models for a potentially large number of covariates. We prove minimax optimal convergence rates under a weak compatibility condition for sparse additive models. The compatibility condition holds when we have a small number of covariates. Additionally, we establish convergence rates for when the condition is not met. We complement our theoretical results with empirical studies comparing $\texttt{waveMesh}$ to existing methods.
△ Less
Submitted 11 March, 2019;
originally announced March 2019.
-
Sparse-Input Neural Networks for High-dimensional Nonparametric Regression and Classification
Authors:
Jean Feng,
Noah Simon
Abstract:
Neural networks are usually not the tool of choice for nonparametric high-dimensional problems where the number of input features is much larger than the number of observations. Though neural networks can approximate complex multivariate functions, they generally require a large number of training observations to obtain reasonable fits, unless one can learn the appropriate network structure. In th…
▽ More
Neural networks are usually not the tool of choice for nonparametric high-dimensional problems where the number of input features is much larger than the number of observations. Though neural networks can approximate complex multivariate functions, they generally require a large number of training observations to obtain reasonable fits, unless one can learn the appropriate network structure. In this manuscript, we show that neural networks can be applied successfully to high-dimensional settings if the true function falls in a low dimensional subspace, and proper regularization is used. We propose fitting a neural network with a sparse group lasso penalty on the first-layer input weights. This results in a neural net that only uses a small subset of the original features. In addition, we characterize the statistical convergence of the penalized empirical risk minimizer to the optimal neural network: we show that the excess risk of this penalized estimator only grows with the logarithm of the number of input features; and we show that the weights of irrelevant features converge to zero. Via simulation studies and data analyses, we show that these sparse-input neural networks outperform existing nonparametric high-dimensional estimation methods when the data has complex higher-order interactions.
△ Less
Submitted 21 June, 2019; v1 submitted 20 November, 2017;
originally announced November 2017.
-
Survival analysis of DNA mutation motifs with penalized proportional hazards
Authors:
Jean Feng,
David A. Shaw,
Vladimir N. Minin,
Noah Simon,
Frederick A. Matsen IV
Abstract:
Antibodies, an essential part of our immune system, develop through an intricate process to bind a wide array of pathogens. This process involves randomly mutating DNA sequences encoding these antibodies to find variants with improved binding, though mutations are not distributed uniformly across sequence sites. Immunologists observe this nonuniformity to be consistent with "mutation motifs", whic…
▽ More
Antibodies, an essential part of our immune system, develop through an intricate process to bind a wide array of pathogens. This process involves randomly mutating DNA sequences encoding these antibodies to find variants with improved binding, though mutations are not distributed uniformly across sequence sites. Immunologists observe this nonuniformity to be consistent with "mutation motifs", which are short DNA subsequences that affect how likely a given site is to experience a mutation. Quantifying the effect of motifs on mutation rates is challenging: a large number of possible motifs makes this statistical problem high dimensional, while the unobserved history of the mutation process leads to a nontrivial missing data problem. We introduce an $\ell_1$-penalized proportional hazards model to infer mutation motifs and their effects. In order to estimate model parameters, our method uses a Monte Carlo EM algorithm to marginalize over the unknown ordering of mutations. We show that our method performs better on simulated data compared to current methods and leads to more parsimonious models. The application of proportional hazards to mutation processes is, to our knowledge, novel and formalizes the current methods in a statistical framework that can be easily extended to analyze the effect of other biological features on mutation rates.
△ Less
Submitted 21 September, 2018; v1 submitted 10 November, 2017;
originally announced November 2017.
-
Gradient-based Regularization Parameter Selection for Problems with Non-smooth Penalty Functions
Authors:
Jean Feng,
Noah Simon
Abstract:
In high-dimensional and/or non-parametric regression problems, regularization (or penalization) is used to control model complexity and induce desired structure. Each penalty has a weight parameter that indicates how strongly the structure corresponding to that penalty should be enforced. Typically the parameters are chosen to minimize the error on a separate validation set using a simple grid sea…
▽ More
In high-dimensional and/or non-parametric regression problems, regularization (or penalization) is used to control model complexity and induce desired structure. Each penalty has a weight parameter that indicates how strongly the structure corresponding to that penalty should be enforced. Typically the parameters are chosen to minimize the error on a separate validation set using a simple grid search or a gradient-free optimization method. It is more efficient to tune parameters if the gradient can be determined, but this is often difficult for problems with non-smooth penalty functions. Here we show that for many penalized regression problems, the validation loss is actually smooth almost-everywhere with respect to the penalty parameters. We can therefore apply a modified gradient descent algorithm to tune parameters. Through simulation studies on example regression problems, we find that increasing the number of penalty parameters and tuning them using our method can decrease the generalization error.
△ Less
Submitted 28 March, 2017;
originally announced March 2017.
-
SCALPEL: Extracting Neurons from Calcium Imaging Data
Authors:
Ashley Petersen,
Noah Simon,
Daniela Witten
Abstract:
In the past few years, new technologies in the field of neuroscience have made it possible to simultaneously image activity in large populations of neurons at cellular resolution in behaving animals. In mid-2016, a huge repository of this so-called "calcium imaging" data was made publicly-available. The availability of this large-scale data resource opens the door to a host of scientific questions…
▽ More
In the past few years, new technologies in the field of neuroscience have made it possible to simultaneously image activity in large populations of neurons at cellular resolution in behaving animals. In mid-2016, a huge repository of this so-called "calcium imaging" data was made publicly-available. The availability of this large-scale data resource opens the door to a host of scientific questions, for which new statistical methods must be developed.
In this paper, we consider the first step in the analysis of calcium imaging data: namely, identifying the neurons in a calcium imaging video. We propose a dictionary learning approach for this task. First, we perform image segmentation to develop a dictionary containing a huge number of candidate neurons. Next, we refine the dictionary using clustering. Finally, we apply the dictionary in order to select neurons and estimate their corresponding activity over time, using a sparse group lasso optimization problem. We apply our proposal to three calcium imaging data sets.
Our proposed approach is implemented in the R package scalpel, which is available on CRAN.
△ Less
Submitted 20 March, 2017;
originally announced March 2017.
-
Rank conditional coverage and confidence intervals in high dimensional problems
Authors:
Jean Morrison,
Noah Simon
Abstract:
Confidence interval procedures used in low dimensional settings are often inappropriate for high dimensional applications. When a large number of parameters are estimated, marginal confidence intervals associated with the most significant estimates have very low coverage rates: They are too small and centered at biased estimates. The problem of forming confidence intervals in high dimensional sett…
▽ More
Confidence interval procedures used in low dimensional settings are often inappropriate for high dimensional applications. When a large number of parameters are estimated, marginal confidence intervals associated with the most significant estimates have very low coverage rates: They are too small and centered at biased estimates. The problem of forming confidence intervals in high dimensional settings has previously been studied through the lens of selection adjustment. In this framework, the goal is to control the proportion of non-covering intervals formed for selected parameters.
In this paper we approach the problem by considering the relationship between rank and coverage probability. Marginal confidence intervals have very low coverage rates for significant parameters and high rates for parameters with more boring estimates. Many selection adjusted intervals display the same pattern. This connection motivates us to propose a new coverage criterion for confidence intervals in multiple testing/covering problems --- the rank conditional coverage (RCC). This is the expected coverage rate of an interval given the significance ranking for the associated estimator. We propose interval construction via bootstrap** which produces small intervals and have a rank conditional coverage close to the nominal level. These methods are implemented in the R package rcc.
△ Less
Submitted 22 February, 2017;
originally announced February 2017.
-
Nonparametric Regression with Adaptive Truncation via a Convex Hierarchical Penalty
Authors:
Asad Haris,
Ali Shojaie,
Noah Simon
Abstract:
We consider the problem of non-parametric regression with a potentially large number of covariates. We propose a convex, penalized estimation framework that is particularly well-suited for high-dimensional sparse additive models. The proposed approach combines appealing features of finite basis representation and smoothing penalties for non-parametric estimation. In particular, in the case of addi…
▽ More
We consider the problem of non-parametric regression with a potentially large number of covariates. We propose a convex, penalized estimation framework that is particularly well-suited for high-dimensional sparse additive models. The proposed approach combines appealing features of finite basis representation and smoothing penalties for non-parametric estimation. In particular, in the case of additive models, a finite basis representation provides a parsimonious representation for fitted functions but is not adaptive when component functions posses different levels of complexity. On the other hand, a smoothing spline type penalty on the component functions is adaptive but does not offer a parsimonious representation of the estimated function. The proposed approach simultaneously achieves parsimony and adaptivity in a computationally efficient framework. We demonstrate these properties through empirical studies on both real and simulated datasets. We show that our estimator converges at the minimax rate for functions within a hierarchical class. We further establish minimax rates for a large class of sparse additive models. The proposed method is implemented using an efficient algorithm that scales similarly to the Lasso with the number of covariates and samples size.
△ Less
Submitted 18 June, 2019; v1 submitted 29 November, 2016;
originally announced November 2016.
-
Simultaneous detection and estimation of trait associations with genomic phenotypes
Authors:
Jean Morrison,
Noah Simon,
Daniela Witten
Abstract:
Genomic phenotypes, such as DNA methylation and chromatin accessibility, can be used to characterize the transcriptional and regulatory activity of DNA within a cell. Recent technological advances have made it possible to measure such phenotypes very densely. This density often results in spatial structure, in the sense that measurements at nearby sites are very similar.
In this paper, we consid…
▽ More
Genomic phenotypes, such as DNA methylation and chromatin accessibility, can be used to characterize the transcriptional and regulatory activity of DNA within a cell. Recent technological advances have made it possible to measure such phenotypes very densely. This density often results in spatial structure, in the sense that measurements at nearby sites are very similar.
In this paper, we consider the task of comparing genomic phenotypes across experimental conditions, cell types, or disease subgroups. We propose a new method, Joint Adaptive Differential Estimation (JADE), which leverages the spatial structure inherent to genomic phenotypes. JADE simultaneously estimates smooth underlying group average genomic phenotype profiles, and detects regions in which the average profile differs between groups. We evaluate JADE's performance in several biologically plausible simulation settings. We also consider an application to the detection of regions with differential methylation between mature skeletal muscle cells, myotubes and myoblasts.
△ Less
Submitted 14 November, 2016;
originally announced November 2016.
-
Graphical Models for Zero-Inflated Single Cell Gene Expression
Authors:
Andrew McDavid,
Raphael Gottardo,
Noah Simon,
Mathias Drton
Abstract:
Bulk gene expression experiments relied on aggregations of thousands of cells to measure the average expression in an organism. Advances in microfluidic and droplet sequencing now permit expression profiling in single cells. This study of cell-to-cell variation reveals that individual cells lack detectable expression of transcripts that appear abundant on a population level, giving rise to zero-in…
▽ More
Bulk gene expression experiments relied on aggregations of thousands of cells to measure the average expression in an organism. Advances in microfluidic and droplet sequencing now permit expression profiling in single cells. This study of cell-to-cell variation reveals that individual cells lack detectable expression of transcripts that appear abundant on a population level, giving rise to zero-inflated expression patterns. To infer gene co-regulatory networks from such data, we propose a multivariate Hurdle model. It is comprised of a mixture of singular Gaussian distributions. We employ neighborhood selection with the pseudo-likelihood and a group lasso penalty to select and fit undirected graphical models that capture conditional independences between genes. The proposed method is more sensitive than existing approaches in simulations, even under departures from our Hurdle model. The method is applied to data for T follicular helper cells, and a high-dimensional profile of mouse dendritic cells. It infers network structure not revealed by other methods; or in bulk data sets. An R implementation is available at https://github.com/amcdavid/HurdleNormal .
△ Less
Submitted 14 March, 2018; v1 submitted 18 October, 2016;
originally announced October 2016.
-
Graphical Models for Discrete and Continuous Data
Authors:
Rui Zhuang,
Noah Simon,
Johannes Lederer
Abstract:
We introduce a general framework for undirected graphical models. It generalizes Gaussian graphical models to a wide range of continuous, discrete, and combinations of different types of data. The models in the framework, called exponential trace models, are amenable to estimation based on maximum likelihood. We introduce a sampling-based approximation algorithm for computing the maximum likelihoo…
▽ More
We introduce a general framework for undirected graphical models. It generalizes Gaussian graphical models to a wide range of continuous, discrete, and combinations of different types of data. The models in the framework, called exponential trace models, are amenable to estimation based on maximum likelihood. We introduce a sampling-based approximation algorithm for computing the maximum likelihood estimator, and we apply this pipeline to learn simultaneous neural activities from spike data.
△ Less
Submitted 15 June, 2019; v1 submitted 18 September, 2016;
originally announced September 2016.
-
Tracing Slow Winds from T Tauri Stars via Low Velocity Forbidden Line Emission
Authors:
M. N. Simon,
I. Pascucci,
S. Edwards,
W. Feng,
U. Gorti,
D. Hollenbach,
E. Rigliaco,
J. T. Keane
Abstract:
Using Keck/HIRES spectra Δv ~ 7 km/s, we analyze forbidden lines of [O I] 6300 Å, [O I] 5577 Å and [S II] 6731 Å from 33 T Tauri stars covering a range of disk evolutionary stages. After removing a high velocity component (HVC) associated with microjets, we study the properties of the low velocity component (LVC). The LVC can be attributed to slow disk winds that could be magnetically (MHD) or the…
▽ More
Using Keck/HIRES spectra Δv ~ 7 km/s, we analyze forbidden lines of [O I] 6300 Å, [O I] 5577 Å and [S II] 6731 Å from 33 T Tauri stars covering a range of disk evolutionary stages. After removing a high velocity component (HVC) associated with microjets, we study the properties of the low velocity component (LVC). The LVC can be attributed to slow disk winds that could be magnetically (MHD) or thermally (photoevaporative) driven. Both of these winds play an important role in the evolution and dispersal of protoplanetary material.
LVC emission is seen in all 30 stars with detected [O I] but only in 2 out of eight with detected [S II] , so our analysis is largely based on the properties of the [O I] LVC. The LVC itself is resolved into broad (BC) and narrow (NC) kinematic components. Both components are found over a wide range of accretion rates and their luminosity is correlated with the accretion luminosity, but the NC is proportionately stronger than the BC in transition disks.
The FWHM of both the BC and NC correlates with disk inclination, consistent with Keplerian broadening from radii of 0.05 to 0.5 AU and 0.5 to 5 AU, respectively. The velocity centroids of the BC suggest formation in an MHD disk wind, with the largest blueshifts found in sources with closer to face-on orientations. The velocity centroids of the NC however, show no dependence on disk inclination. The origin of this component is less clear and the evidence for photoevaporation is not conclusive.
△ Less
Submitted 24 August, 2016;
originally announced August 2016.
-
Narrow Na and K Absorption Lines Toward T Tauri Stars - Tracing the Atomic Envelope of Molecular Clouds
Authors:
I. Pascucci,
S. Edwards,
M. Heyer,
E. Rigliaco,
L. Hillenbrand,
U. Gorti,
D. Hollenbach,
M. N. Simon
Abstract:
We present a detailed analysis of narrow of NaI and KI absorption resonance lines toward nearly 40 T Tauri stars in Taurus with the goal of clarifying their origin. The NaI 5889.95 angstrom line is detected toward all but one source, while the weaker KI 7698.96 angstrom line in about two thirds of the sample. The similarity in their peak centroids and the significant positive correlation between t…
▽ More
We present a detailed analysis of narrow of NaI and KI absorption resonance lines toward nearly 40 T Tauri stars in Taurus with the goal of clarifying their origin. The NaI 5889.95 angstrom line is detected toward all but one source, while the weaker KI 7698.96 angstrom line in about two thirds of the sample. The similarity in their peak centroids and the significant positive correlation between their equivalent widths demonstrate that these transitions trace the same atomic gas. The absorption lines are present towards both disk and diskless young stellar objects, which excludes cold gas within the circumstellar disk as the absorbing material. A comparison of NaI and CO detections and peak centroids demonstrates that the atomic and molecular gas are not co-located, the atomic gas is more extended than the molecular gas. The width of the atomic lines corroborates this finding and points to atomic gas about an order of magnitude warmer than the molecular gas. The distribution of NaI radial velocities shows a clear spatial gradient along the length of the Taurus molecular cloud filaments. This suggests that absorption is associated with the Taurus molecular cloud. Assuming the gradient is due to cloud rotation, the rotation of the atomic gas is consistent with differential galactic rotation while the rotation of the molecular gas, although with the same rotation axis, is retrograde. Our analysis shows that narrow NaI and KI absorption resonance lines are useful tracers of the atomic envelope of molecular clouds. In line with recent findings from giant molecular clouds, our results demonstrate that the velocity fields of the atomic and molecular gas are misaligned. The angular momentum of a molecular cloud is not simply inherited from the rotating Galactic disk from which it formed but may be redistributed by cloud-cloud interactions.
△ Less
Submitted 7 October, 2015;
originally announced October 2015.
-
Convex Modeling of Interactions with Strong Heredity
Authors:
Asad Haris,
Daniela Witten,
Noah Simon
Abstract:
We consider the task of fitting a regression model involving interactions among a potentially large set of covariates, in which we wish to enforce strong heredity. We propose FAMILY, a very general framework for this task. Our proposal is a generalization of several existing methods, such as VANISH [Radchenko and James, 2010], hierNet [Bien et al., 2013], the all-pairs lasso, and the lasso using o…
▽ More
We consider the task of fitting a regression model involving interactions among a potentially large set of covariates, in which we wish to enforce strong heredity. We propose FAMILY, a very general framework for this task. Our proposal is a generalization of several existing methods, such as VANISH [Radchenko and James, 2010], hierNet [Bien et al., 2013], the all-pairs lasso, and the lasso using only main effects. It can be formulated as the solution to a convex optimization problem, which we solve using an efficient alternating directions method of multipliers (ADMM) algorithm. This algorithm has guaranteed convergence to the global optimum, can be easily specialized to any convex penalty function of interest, and allows for a straightforward extension to the setting of generalized linear models. We derive an unbiased estimator of the degrees of freedom of FAMILY, and explore its performance in a simulation study and on an HIV sequence data set.
△ Less
Submitted 3 October, 2015; v1 submitted 13 October, 2014;
originally announced October 2014.
-
Fused Lasso Additive Model
Authors:
Ashley Petersen,
Daniela Witten,
Noah Simon
Abstract:
We consider the problem of predicting an outcome variable using $p$ covariates that are measured on $n$ independent observations, in the setting in which flexible and interpretable fits are desirable. We propose the fused lasso additive model (FLAM), in which each additive function is estimated to be piecewise constant with a small number of adaptively-chosen knots. FLAM is the solution to a conve…
▽ More
We consider the problem of predicting an outcome variable using $p$ covariates that are measured on $n$ independent observations, in the setting in which flexible and interpretable fits are desirable. We propose the fused lasso additive model (FLAM), in which each additive function is estimated to be piecewise constant with a small number of adaptively-chosen knots. FLAM is the solution to a convex optimization problem, for which a simple algorithm with guaranteed convergence to the global optimum is provided. FLAM is shown to be consistent in high dimensions, and an unbiased estimator of its degrees of freedom is proposed. We evaluate the performance of FLAM in a simulation study and on two data sets.
△ Less
Submitted 18 September, 2014;
originally announced September 2014.
-
Selection Bias Correction and Effect Size Estimation under Dependence
Authors:
Kean Ming Tan,
Noah Simon,
Daniela Witten
Abstract:
We consider large-scale studies in which it is of interest to test a very large number of hypotheses, and then to estimate the effect sizes corresponding to the rejected hypotheses. For instance, this setting arises in the analysis of gene expression or DNA sequencing data. However, naive estimates of the effect sizes suffer from selection bias, i.e., some of the largest naive estimates are large…
▽ More
We consider large-scale studies in which it is of interest to test a very large number of hypotheses, and then to estimate the effect sizes corresponding to the rejected hypotheses. For instance, this setting arises in the analysis of gene expression or DNA sequencing data. However, naive estimates of the effect sizes suffer from selection bias, i.e., some of the largest naive estimates are large due to chance alone. Many authors have proposed methods to reduce the effects of selection bias under the assumption that the naive estimates of the effect sizes are independent. Unfortunately, when the effect size estimates are dependent, these existing techniques can have very poor performance, and in practice there will often be dependence. We propose an estimator that adjusts for selection bias under a recently-proposed frequentist framework, without the independence assumption. We study some properties of the proposed estimator, and illustrate that it outperforms past proposals in a simulation study and on two gene expression data sets.
△ Less
Submitted 28 March, 2015; v1 submitted 16 May, 2014;
originally announced May 2014.