Skip to main content

Showing 1–50 of 66 results for author: Hastie, T

.
  1. arXiv:2406.08322  [pdf, other

    q-bio.QM cs.LG stat.ME

    MMIL: A novel algorithm for disease associated cell type discovery

    Authors: Erin Craig, Timothy Keyes, Jolanda Sarno, Maxim Zaslavsky, Garry Nolan, Kara Davis, Trevor Hastie, Robert Tibshirani

    Abstract: Single-cell datasets often lack individual cell labels, making it challenging to identify cells associated with disease. To address this, we introduce Mixture Modeling for Multiple Instance Learning (MMIL), an expectation maximization method that enables the training and calibration of cell-level classifiers using patient-level labels. Our approach can be used to train e.g. lasso logistic regressi… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: Erin Craig and Timothy Keyes contributed equally to this work

  2. arXiv:2405.08631  [pdf, other

    stat.CO cs.LG cs.MS cs.SE

    A Fast and Scalable Pathwise-Solver for Group Lasso and Elastic Net Penalized Regression via Block-Coordinate Descent

    Authors: James Yang, Trevor Hastie

    Abstract: We develop fast and scalable algorithms based on block-coordinate descent to solve the group lasso and the group elastic net for generalized linear models along a regularization path. Special attention is given when the loss is the usual least squares loss (Gaussian loss). We show that each block-coordinate update can be solved efficiently using Newton's method and further improved using an adapti… ▽ More

    Submitted 14 May, 2024; originally announced May 2024.

  3. arXiv:2404.17626  [pdf, other

    cs.LG q-bio.QM stat.AP stat.CO

    Using Pre-training and Interaction Modeling for ancestry-specific disease prediction in UK Biobank

    Authors: Thomas Le Menestrel, Erin Craig, Robert Tibshirani, Trevor Hastie, Manuel Rivas

    Abstract: Recent genome-wide association studies (GWAS) have uncovered the genetic basis of complex traits, but show an under-representation of non-European descent individuals, underscoring a critical gap in genetic research. Here, we assess whether we can improve disease prediction across diverse ancestries using multiomic data. We evaluate the performance of Group-LASSO INTERaction-NET (glinternet) and p… ▽ More

    Submitted 7 May, 2024; v1 submitted 26 April, 2024; originally announced April 2024.

  4. arXiv:2404.15017  [pdf, other

    stat.ME

    The mosaic permutation test: an exact and nonparametric goodness-of-fit test for factor models

    Authors: Asher Spector, Rina Foygel Barber, Trevor Hastie, Ronald N. Kahn, Emmanuel Candès

    Abstract: Financial firms often rely on factor models to explain correlations among asset returns. These models are important for managing risk, for example by modeling the probability that many assets will simultaneously lose value. Yet after major events, e.g., COVID-19, analysts may reassess whether existing models continue to fit well: specifically, after accounting for the factor exposures, are the res… ▽ More

    Submitted 23 April, 2024; originally announced April 2024.

    Comments: 38 pages, 13 figures

    MSC Class: 62H25 (Primary) 62G10; 62G09 (Secondary)

  5. arXiv:2310.19214  [pdf, other

    stat.ML cs.LG cs.MS math.OC

    Factor Fitting, Rank Allocation, and Partitioning in Multilevel Low Rank Matrices

    Authors: Tetiana Parshakova, Trevor Hastie, Eric Darve, Stephen Boyd

    Abstract: We consider multilevel low rank (MLR) matrices, defined as a row and column permutation of a sum of matrices, each one a block diagonal refinement of the previous one, with all blocks low rank given in factored form. MLR matrices extend low rank matrices but share many of their properties, such as the total storage required and complexity of matrix-vector multiplication. We address three problems… ▽ More

    Submitted 29 October, 2023; originally announced October 2023.

  6. arXiv:2307.12892  [pdf, other

    stat.ME cs.DS cs.LG

    A Statistical View of Column Subset Selection

    Authors: Anav Sood, Trevor Hastie

    Abstract: We consider the problem of selecting a small subset of representative variables from a large dataset. In the computer science literature, this dimensionality reduction problem is typically formalized as Column Subset Selection (CSS). Meanwhile, the typical statistical formalization is to find an information-maximizing set of Principal Variables. This paper shows that these two approaches are equiv… ▽ More

    Submitted 24 July, 2023; originally announced July 2023.

  7. arXiv:2307.12378  [pdf, other

    stat.ME

    Scalable solution to crossed random effects model with random slopes

    Authors: Disha Ghandwani, Swarnadip Ghosh, Trevor Hastie, Art B. Owen

    Abstract: The crossed random effects model is widely used, finding applications in various fields such as longitudinal studies, e-commerce, and recommender systems, among others. However, these models encounter scalability challenges, as the computational time for standard algorithms grows superlinearly with the number N of observations in the data set, commonly $Ω(N^{3/2})$ or worse. Recent work has develo… ▽ More

    Submitted 26 September, 2023; v1 submitted 23 July, 2023; originally announced July 2023.

  8. arXiv:2210.08721  [pdf, other

    stat.ML cs.LG

    RbX: Region-based explanations of prediction models

    Authors: Ismael Lemhadri, Harrison H. Li, Trevor Hastie

    Abstract: We introduce region-based explanations (RbX), a novel, model-agnostic method to generate local explanations of scalar outputs from a black-box prediction model using only query access. RbX is based on a greedy algorithm for building a convex polytope that approximates a region of feature space where model predictions are close to the prediction at some target point. This region is fully specified… ▽ More

    Submitted 16 October, 2022; originally announced October 2022.

    Comments: 13 pages, 4 figures

  9. arXiv:2202.09723  [pdf, other

    stat.ME

    Smooth multi-period forecasting with application to prediction of COVID-19 cases

    Authors: Elena Tuzhilina, Trevor J. Hastie, Daniel J. McDonald, J. Kenneth Tay, Robert Tibshirani

    Abstract: Forecasting methodologies have always attracted a lot of attention and have become an especially hot topic since the beginning of the COVID-19 pandemic. In this paper we consider the problem of multi-period forecasting that aims to predict several horizons at once. We propose a novel approach that forces the prediction to be "smooth" across horizons and apply it to two tasks: point estimation via… ▽ More

    Submitted 19 February, 2022; originally announced February 2022.

  10. arXiv:2201.11210  [pdf, other

    stat.ME

    Confidence Intervals for the Generalisation Error of Random Forests

    Authors: Samyak Rajanala, Stephen Bates, Trevor Hastie, Robert Tibshirani

    Abstract: Out-of-bag error is commonly used as an estimate of generalisation error in ensemble-based learning models such as random forests. We present confidence intervals for this quantity using the delta-method-after-bootstrap and the jackknife-after-bootstrap techniques. These methods do not require growing any additional trees. We show that these new confidence intervals have improved coverage properti… ▽ More

    Submitted 26 January, 2022; originally announced January 2022.

    Comments: 25 pages, 8 tables, 8 figures

  11. arXiv:2109.11057  [pdf, other

    stat.ML cs.LG stat.ME

    Weighted Low Rank Matrix Approximation and Acceleration

    Authors: Elena Tuzhilina, Trevor Hastie

    Abstract: Low-rank matrix approximation is one of the central concepts in machine learning, with applications in dimension reduction, de-noising, multivariate statistical methodology, and many more. A recent extension to LRMA is called low-rank matrix completion (LRMC). It solves the LRMA problem when some observations are missing and is especially useful for recommender systems. In this paper, we consider… ▽ More

    Submitted 22 September, 2021; originally announced September 2021.

  12. arXiv:2107.12713  [pdf, other

    stat.ME

    LinCDE: Conditional Density Estimation via Lindsey's Method

    Authors: Zijun Gao, Trevor Hastie

    Abstract: Conditional density estimation is a fundamental problem in statistics, with scientific and practical applications in biology, economics, finance and environmental studies, to name a few. In this paper, we propose a conditional density estimator based on gradient boosting and Lindsey's method (LinCDE). LinCDE admits flexible modeling of the density family and can capture distributional characterist… ▽ More

    Submitted 31 December, 2021; v1 submitted 27 July, 2021; originally announced July 2021.

    Comments: 50 pages, 20 figures

  13. arXiv:2105.13747  [pdf, other

    stat.ME math.ST stat.CO

    Scalable logistic regression with crossed random effects

    Authors: Swarnadip Ghosh, Trevor Hastie, Art B. Owen

    Abstract: The cost of both generalized least squares (GLS) and Gibbs sampling in a crossed random effects model can easily grow faster than $N^{3/2}$ for $N$ observations. Ghosh et al. (2020) develop a backfitting algorithm that reduces the cost to $O(N)$. Here we extend that method to a generalized linear mixed model for logistic regression. We use backfitting within an iteratively reweighted penalized lea… ▽ More

    Submitted 29 December, 2021; v1 submitted 28 May, 2021; originally announced May 2021.

    Comments: 32 pages, 5 figures

  14. arXiv:2104.00673  [pdf, other

    stat.ME math.ST stat.CO stat.ML

    Cross-validation: what does it estimate and how well does it do it?

    Authors: Stephen Bates, Trevor Hastie, Robert Tibshirani

    Abstract: Cross-validation is a widely-used technique to estimate prediction error, but its behavior is complex and not fully understood. Ideally, one would like to think that cross-validation estimates the prediction error for the model at hand, fit to the training data. We prove that this is not the case for the linear model fit by ordinary least squares; rather it estimates the average prediction error o… ▽ More

    Submitted 18 July, 2022; v1 submitted 1 April, 2021; originally announced April 2021.

  15. arXiv:2103.04277  [pdf, other

    stat.ME

    Estimating Heterogeneous Treatment Effects for General Responses

    Authors: Zijun Gao, Trevor Hastie

    Abstract: Heterogeneous treatment effect models allow us to compare treatments at subgroup and individual levels, and are of increasing popularity in applications like personalized medicine, advertising, and education. In this talk, we first survey different causal estimands used in practice, which focus on estimating the difference in conditional means. We then propose DINA, the difference in natural param… ▽ More

    Submitted 27 January, 2022; v1 submitted 7 March, 2021; originally announced March 2021.

  16. arXiv:2103.03475  [pdf, other

    stat.CO stat.ME

    Elastic Net Regularization Paths for All Generalized Linear Models

    Authors: J. Kenneth Tay, Balasubramanian Narasimhan, Trevor Hastie

    Abstract: The lasso and elastic net are popular regularized regression models for supervised learning. Friedman, Hastie, and Tibshirani (2010) introduced a computationally efficient algorithm for computing the elastic net regularization path for ordinary least squares regression, logistic regression and multinomial logistic regression, while Simon, Friedman, Hastie, and Tibshirani (2011) extended this work… ▽ More

    Submitted 5 March, 2021; originally announced March 2021.

  17. arXiv:2011.01650  [pdf, other

    stat.ME

    Canonical Correlation Analysis in high dimensions with structured regularization

    Authors: Elena Tuzhilina, Leonardo Tozzi, Trevor Hastie

    Abstract: Canonical correlation analysis (CCA) is a technique for measuring the association between two multivariate data matrices. A regularized modification of canonical correlation analysis (RCCA) which imposes an $\ell_2$ penalty on the CCA coefficients is widely used in applications with high-dimensional data. One limitation of such regularization is that it ignores any data structure, treating all the… ▽ More

    Submitted 29 July, 2021; v1 submitted 3 November, 2020; originally announced November 2020.

  18. arXiv:2010.02469  [pdf, other

    cs.LG stat.CO stat.ML

    Generalized Matrix Factorization: efficient algorithms for fitting generalized linear latent variable models to large data arrays

    Authors: Łukasz Kidziński, Francis K. C. Hui, David I. Warton, Trevor Hastie

    Abstract: Unmeasured or latent variables are often the cause of correlations between multivariate measurements, which are studied in a variety of fields such as psychology, ecology, and medicine. For Gaussian measurements, there are classical tools such as factor analysis or principal component analysis with a well-established theory and fast algorithms. Generalized Linear Latent Variable models (GLLVMs) ge… ▽ More

    Submitted 27 January, 2022; v1 submitted 6 October, 2020; originally announced October 2020.

  19. arXiv:2009.12969  [pdf, other

    cs.LG cs.IR cs.SI

    Simultaneous Relevance and Diversity: A New Recommendation Inference Approach

    Authors: Yifang Liu, Zhentao Xu, Qiyuan An, Yang Yi, Yanzhi Wang, Trevor Hastie

    Abstract: Relevance and diversity are both important to the success of recommender systems, as they help users to discover from a large pool of items a compact set of candidates that are not only interesting but exploratory as well. The challenge is that relevance and diversity usually act as two competing objectives in conventional recommender systems, which necessities the classic trade-off between exploi… ▽ More

    Submitted 27 September, 2020; originally announced September 2020.

    Comments: 9 pages

  20. arXiv:2007.10612  [pdf, other

    stat.ME math.ST stat.CO

    Backfitting for large scale crossed random effects regressions

    Authors: Swarnadip Ghosh, Trevor Hastie, Art B. Owen

    Abstract: Regression models with crossed random effect errors can be very expensive to compute. The cost of both generalized least squares and Gibbs sampling can easily grow as $N^{3/2}$ (or worse) for $N$ observations. Papaspiliopoulos et al. (2020) present a collapsed Gibbs sampler that costs $O(N)$, but under an extremely stringent sampling model. We propose a backfitting algorithm to compute a generaliz… ▽ More

    Submitted 18 March, 2021; v1 submitted 21 July, 2020; originally announced July 2020.

  21. arXiv:2006.01395  [pdf, other

    stat.ME cs.LG stat.ML

    Feature-weighted elastic net: using "features of features" for better prediction

    Authors: J. Kenneth Tay, Nima Aghaeepour, Trevor Hastie, Robert Tibshirani

    Abstract: In some supervised learning settings, the practitioner might have additional information on the features used for prediction. We propose a new method which leverages this additional information for better prediction. The method, which we call the feature-weighted elastic net ("fwelnet"), uses these "features of features" to adapt the relative penalties on the feature coefficients in the elastic ne… ▽ More

    Submitted 2 June, 2020; originally announced June 2020.

  22. arXiv:2006.00371  [pdf, other

    stat.ME cs.LG stat.ML

    Ridge Regularizaton: an Essential Concept in Data Science

    Authors: Trevor Hastie

    Abstract: Ridge or more formally $\ell_2$ regularization shows up in many areas of statistics and machine learning. It is one of those essential devices that any good data scientist needs to master for their craft. In this brief ridge fest I have collected together some of the magic and beauty of ridge that my colleagues and I have encountered over the past 40 years in applied statistics.

    Submitted 30 May, 2020; originally announced June 2020.

    Comments: 17 pages, 5 figures. This paper was invited by Technometrics to appear in a special section to celebrate the 50th anniversary of the 1970 original ridge paper by Hoerl and Kennard

  23. arXiv:2003.03881  [pdf, other

    stat.ME stat.AP

    Assessment of Heterogeneous Treatment Effect Estimation Accuracy via Matching

    Authors: Zijun Gao, Trevor Hastie, Robert Tibshirani

    Abstract: We study the assessment of the accuracy of heterogeneous treatment effect (HTE) estimation, where the HTE is not directly observable so standard computation of prediction errors is not applicable. To tackle the difficulty, we propose an assessment approach by constructing pseudo-observations of the HTE based on matching. Our contributions are three-fold: first, we introduce a novel matching distan… ▽ More

    Submitted 8 March, 2020; originally announced March 2020.

  24. The importance of transparency and reproducibility in artificial intelligence research

    Authors: Benjamin Haibe-Kains, George Alexandru Adam, Ahmed Hosny, Farnoosh Khodakarami, MAQC Society Board, Levi Waldron, Bo Wang, Chris McIntosh, Anshul Kundaje, Casey S. Greene, Michael M. Hoffman, Jeffrey T. Leek, Wolfgang Huber, Alvis Brazma, Joelle Pineau, Robert Tibshirani, Trevor Hastie, John P. A. Ioannidis, John Quackenbush, Hugo J. W. L. Aerts

    Abstract: In their study, McKinney et al. showed the high potential of artificial intelligence for breast cancer screening. However, the lack of detailed methods and computer code undermines its scientific value. We identify obstacles hindering transparent and reproducible AI research as faced by McKinney et al and provide solutions with implications for the broader field.

    Submitted 7 March, 2020; v1 submitted 28 February, 2020; originally announced March 2020.

    Journal ref: Nature 586 (2020) E14-E16

  25. arXiv:1907.05074  [pdf, other

    astro-ph.HE

    Gamma-ray Bursts as distance indicators through a machine learning approach

    Authors: Maria Dainotti, Vahé Petrosian, Malgorzata Bogdan, Blazej Miasojedow, Shigehiro Nagataki, Trevor Hastie, Zooey Nuyngen, Sankalp Gilda, Xavier Hernandez, Dominika Krol

    Abstract: Gamma-ray bursts (GRBs) are spectacularly energetic events, with the potential to inform on the early universe and its evolution, once their redshifts are known. Unfortunately, determining redshifts is a painstaking procedure requiring detailed follow-up multi-wavelength observations often involving various astronomical facilities, which have to be rapidly pointed at these serendipitous events. He… ▽ More

    Submitted 11 July, 2019; originally announced July 2019.

    Comments: 22 pages, 5 figures to be submitted

  26. arXiv:1903.08560  [pdf, other

    math.ST cs.LG stat.ML

    Surprises in High-Dimensional Ridgeless Least Squares Interpolation

    Authors: Trevor Hastie, Andrea Montanari, Saharon Rosset, Ryan J. Tibshirani

    Abstract: Interpolators -- estimators that achieve zero training error -- have attracted growing attention in machine learning, mainly because state-of-the art neural networks appear to be models of this type. In this paper, we study minimum $\ell_2$ norm ("ridgeless") interpolation in high-dimensional least squares regression. We consider two different models for the feature distribution: a linear model, w… ▽ More

    Submitted 7 December, 2020; v1 submitted 19 March, 2019; originally announced March 2019.

    Comments: 68 pages; 16 figures. This revision contains non-asymptotic version of earlier results, and results for general coefficients

  27. arXiv:1809.08771  [pdf, other

    stat.ML cs.LG stat.AP stat.ME stat.OT

    Modeling longitudinal data using matrix completion

    Authors: Łukasz Kidziński, Trevor Hastie

    Abstract: In clinical practice and biomedical research, measurements are often collected sparsely and irregularly in time while the data acquisition is expensive and inconvenient. Examples include measurements of spine bone mineral density, cancer growth through mammography or biopsy, a progression of defective vision, or assessment of gait in patients with neurological disorders. Since the data collection… ▽ More

    Submitted 3 August, 2021; v1 submitted 24 September, 2018; originally announced September 2018.

  28. arXiv:1711.00083  [pdf, other

    stat.ML

    Synth-Validation: Selecting the Best Causal Inference Method for a Given Dataset

    Authors: Alejandro Schuler, Ken Jung, Robert Tibshirani, Trevor Hastie, Nigam Shah

    Abstract: Many decisions in healthcare, business, and other policy domains are made without the support of rigorous evidence due to the cost and complexity of performing randomized experiments. Using observational data to answer causal questions is risky: subjects who receive different treatments also differ in other ways that affect outcomes. Many causal inference methods have been developed to mitigate th… ▽ More

    Submitted 31 October, 2017; originally announced November 2017.

  29. arXiv:1707.08692  [pdf, other

    stat.ME stat.CO

    Extended Comparisons of Best Subset Selection, Forward Stepwise Selection, and the Lasso

    Authors: Trevor Hastie, Robert Tibshirani, Ryan J. Tibshirani

    Abstract: In exciting new work, Bertsimas et al. (2016) showed that the classical best subset selection problem in regression modeling can be formulated as a mixed integer optimization (MIO) problem. Using recent advances in MIO algorithms, they demonstrated that best subset selection can now be solved at much larger problem sizes that what was thought possible in the statistics community. They presented em… ▽ More

    Submitted 29 July, 2017; v1 submitted 26 July, 2017; originally announced July 2017.

    Comments: 18 pages main paper, 34 pages supplement

  30. arXiv:1707.00102  [pdf, other

    stat.ML

    Some methods for heterogeneous treatment effect estimation in high-dimensions

    Authors: Scott Powers, Junyang Qian, Kenneth Jung, Alejandro Schuler, Nigam H. Shah, Trevor Hastie, Robert Tibshirani

    Abstract: When devising a course of treatment for a patient, doctors often have little quantitative evidence on which to base their decisions, beyond their medical education and published clinical trials. Stanford Health Care alone has millions of electronic medical records (EMRs) that are only just recently being leveraged to inform better treatment recommendations. These data present a unique challenge be… ▽ More

    Submitted 1 July, 2017; originally announced July 2017.

  31. arXiv:1706.10272  [pdf, other

    stat.ML

    Nuclear penalized multinomial regression with an application to predicting at bat outcomes in baseball

    Authors: Scott Powers, Trevor Hastie, Robert Tibshirani

    Abstract: We propose the nuclear norm penalty as an alternative to the ridge penalty for regularized multinomial regression. This convex relaxation of reduced-rank multinomial regression has the advantage of leveraging underlying structure among the response categories to make better predictions. We apply our method, nuclear penalized multinomial regression (NPMR), to Major League Baseball play-by-play data… ▽ More

    Submitted 30 June, 2017; originally announced June 2017.

  32. arXiv:1703.02081  [pdf, other

    stat.CO

    Estimation and prediction in sparse and unbalanced tables

    Authors: Qingyuan Zhao, Trevor Hastie, Daryl Pregibon

    Abstract: We consider the problem where we have a multi-way table of means, indexed by several factors, where each factor can have a large number of levels. The entry in each cell is the mean of some response, averaged over the observations falling into that cell. Some cells may be very sparsely populated, and in extreme cases, not populated at all. We might still like to estimate an expected response in su… ▽ More

    Submitted 6 March, 2017; originally announced March 2017.

    Comments: 14 pages, 1 figure

  33. arXiv:1609.06764  [pdf, other

    stat.ML

    Saturating Splines and Feature Selection

    Authors: Nicholas Boyd, Trevor Hastie, Stephen Boyd, Benjamin Recht, Michael Jordan

    Abstract: We extend the adaptive regression spline model by incorporating saturation, the natural requirement that a function extend as a constant outside a certain range. We fit saturating splines to data using a convex optimization problem over a space of measures, which we solve using an efficient algorithm based on the conditional gradient method. Unlike many existing approaches, our algorithm solves th… ▽ More

    Submitted 4 December, 2017; v1 submitted 21 September, 2016; originally announced September 2016.

    Comments: Adding missing references and related work

  34. arXiv:1601.07994  [pdf, ps, other

    stat.AP q-bio.QM

    Customized training with an application to mass spectrometric imaging of cancer tissue

    Authors: Scott Powers, Trevor Hastie, Robert Tibshirani

    Abstract: We introduce a simple, interpretable strategy for making predictions on test data when the features of the test data are available at the time of model fitting. Our proposal - customized training - clusters the data to find training points close to each test point and then fits an $\ell_1$-regularized model (lasso) separately in each training cluster. This approach combines the local adaptivity of… ▽ More

    Submitted 29 January, 2016; originally announced January 2016.

    Comments: Published at http://dx.doi.org/10.1214/15-AOAS866 in the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

    Report number: IMS-AOAS-AOAS866

    Journal ref: Annals of Applied Statistics 2015, Vol. 9, No. 4, 1709-1725

  35. arXiv:1511.06606  [pdf, other

    cs.LG

    Data Representation and Compression Using Linear-Programming Approximations

    Authors: Hristo S. Paskov, John C. Mitchell, Trevor J. Hastie

    Abstract: We propose `Dracula', a new framework for unsupervised feature selection from sequential data such as text. Dracula learns a dictionary of $n$-grams that efficiently compresses a given corpus and recursively compresses its own dictionary; in effect, Dracula is a `deep' extension of Compressive Feature Learning. It requires solving a binary linear program that may be relaxed to a linear program. Bo… ▽ More

    Submitted 2 May, 2016; v1 submitted 20 November, 2015; originally announced November 2015.

  36. arXiv:1509.05962  [pdf

    stat.ML cs.AI cs.CV cs.LG cs.NE

    Telugu OCR Framework using Deep Learning

    Authors: Rakesh Achanta, Trevor Hastie

    Abstract: In this paper, we address the task of Optical Character Recognition(OCR) for the Telugu script. We present an end-to-end framework that segments the text image, classifies the characters and extracts lines using a language model. The segmentation is based on mathematical morphology. The classification module, which is the most challenging task of the three, is a deep convolutional neural network.… ▽ More

    Submitted 14 February, 2017; v1 submitted 19 September, 2015; originally announced September 2015.

  37. arXiv:1508.04178  [pdf, other

    stat.ME math.ST

    Confounder Adjustment in Multiple Hypothesis Testing

    Authors: **gshu Wang, Qingyuan Zhao, Trevor Hastie, Art B. Owen

    Abstract: We consider large-scale studies in which thousands of significance tests are performed simultaneously. In some of these studies, the multiple testing procedure can be severely biased by latent confounding factors such as batch effects and unmeasured covariates that correlate with both primary variable(s) of interest (e.g. treatment variable, phenotype) and the outcome. Over the past decade, many s… ▽ More

    Submitted 19 June, 2016; v1 submitted 17 August, 2015; originally announced August 2015.

    Comments: The first two authors contributed equally to this paper

    MSC Class: 62H25; 62J15

  38. arXiv:1506.03850  [pdf, other

    stat.ML

    Generalized Additive Model Selection

    Authors: Alexandra Chouldechova, Trevor Hastie

    Abstract: We introduce GAMSEL (Generalized Additive Model Selection), a penalized likelihood approach for fitting sparse generalized additive models in high dimension. Our method interpolates between null, linear and additive models by allowing the effect of each variable to be estimated as being either zero, linear, or a low-complexity curve, as determined by the data. We present a blockwise coordinate des… ▽ More

    Submitted 16 June, 2015; v1 submitted 11 June, 2015; originally announced June 2015.

    Comments: 23 pages, 10 figures

  39. arXiv:1410.2596  [pdf, other

    stat.ME stat.ML

    Matrix Completion and Low-Rank SVD via Fast Alternating Least Squares

    Authors: Trevor Hastie, Rahul Mazumder, Jason Lee, Reza Zadeh

    Abstract: The matrix-completion problem has attracted a lot of attention, largely as a result of the celebrated Netflix competition. Two popular approaches for solving the problem are nuclear-norm-regularized matrix approximation (Candes and Tao, 2009, Mazumder, Hastie and Tibshirani, 2010), and maximum-margin matrix factorization (Srebro, Rennie and Jaakkola, 2005). These two procedures are in some cases s… ▽ More

    Submitted 9 October, 2014; originally announced October 2014.

  40. arXiv:1407.4543  [pdf, other

    stat.ML stat.CO

    Sparse Quadratic Discriminant Analysis and Community Bayes

    Authors: Ya Le, Trevor Hastie

    Abstract: We develop a class of rules spanning the range between quadratic discriminant analysis and naive Bayes, through a path of sparse graphical models. A group lasso penalty is used to introduce shrinkage and encourage a similar pattern of sparsity across precision matrices. It gives sparse estimates of interactions and produces interpretable models. Inspired by the connected-components structure of th… ▽ More

    Submitted 19 October, 2016; v1 submitted 16 July, 2014; originally announced July 2014.

    Comments: Revised version (adding more experiments)

  41. arXiv:1403.7274  [pdf, other

    stat.AP

    Bias Correction in Species Distribution Models: Pooling Survey and Collection Data for Multiple Species

    Authors: William Fithian, Jane Elith, Trevor Hastie, David A. Keith

    Abstract: Presence-only records may provide data on the distributions of rare species, but commonly suffer from large, unknown biases due to their typically haphazard collection schemes. Presence-absence or count data collected in systematic, planned surveys are more reliable but typically less abundant. We proposed a probabilistic model to allow for joint analysis of presence-only and survey data to expl… ▽ More

    Submitted 7 August, 2014; v1 submitted 27 March, 2014; originally announced March 2014.

  42. arXiv:1312.7851  [pdf, other

    stat.OT stat.ME

    Effective Degrees of Freedom: A Flawed Metaphor

    Authors: Lucas Janson, William Fithian, Trevor Hastie

    Abstract: To most applied statisticians, a fitting procedure's degrees of freedom is synonymous with its model complexity, or its capacity for overfitting to data. In particular, it is often used to parameterize the bias-variance tradeoff in model selection. We argue that, contrary to folk intuition, model complexity and degrees of freedom are not synonymous and may correspond very poorly. We exhibit and th… ▽ More

    Submitted 13 July, 2014; v1 submitted 30 December, 2013; originally announced December 2013.

  43. arXiv:1311.6529  [pdf, ps, other

    stat.CO stat.ML

    A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

    Authors: Noah Simon, Jerome Friedman, Trevor Hastie

    Abstract: In this paper we purpose a blockwise descent algorithm for group-penalized multiresponse regression. Using a quasi-newton framework we extend this to group-penalized multinomial regression. We give a publicly available implementation for these in R, and compare the speed of this algorithm to a competing algorithm --- we show that our implementation is an order of magnitude faster than its competit… ▽ More

    Submitted 25 November, 2013; originally announced November 2013.

  44. arXiv:1311.4555  [pdf, other

    stat.ML stat.CO stat.ME

    Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife

    Authors: Stefan Wager, Trevor Hastie, Bradley Efron

    Abstract: We study the variability of predictions made by bagged learners and random forests, and show how to estimate standard errors for these methods. Our work builds on variance estimates for bagging proposed by Efron (1992, 2012) that are based on the jackknife and the infinitesimal jackknife (IJ). In practice, bagged predictors are computed using a finite number B of bootstrap replicates, and working… ▽ More

    Submitted 28 March, 2014; v1 submitted 18 November, 2013; originally announced November 2013.

    Comments: To appear in Journal of Machine Learning Research (JMLR)

  45. arXiv:1308.2719  [pdf, other

    stat.ME

    Learning interactions through hierarchical group-lasso regularization

    Authors: Michael Lim, Trevor Hastie

    Abstract: We introduce a method for learning pairwise interactions in a manner that satisfies strong hierarchy: whenever an interaction is estimated to be nonzero, both its associated main effects are also included in the model. We motivate our approach by modeling pairwise interactions for categorical variables with arbitrary numbers of levels, and then show how we can accommodate continuous variables and… ▽ More

    Submitted 12 August, 2013; originally announced August 2013.

    Comments: 35 pages, about 9 figures

  46. arXiv:1306.3706  [pdf, ps, other

    stat.CO stat.ML

    Local case-control sampling: Efficient subsampling in imbalanced data sets

    Authors: William Fithian, Trevor Hastie

    Abstract: For classification problems with significant class imbalance, subsampling can reduce computational costs at the price of inflated variance in estimating model parameters. We propose a method for subsampling efficiently for logistic regression by adjusting the class balance locally in feature space via an accept-reject scheme. Our method generalizes standard case-control sampling, using a pilot est… ▽ More

    Submitted 23 September, 2014; v1 submitted 16 June, 2013; originally announced June 2013.

    Comments: Published in at http://dx.doi.org/10.1214/14-AOS1220 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

    Report number: IMS-AOS-AOS1220

    Journal ref: Annals of Statistics 2014, Vol. 42, No. 5, 1693-1724

  47. arXiv:1302.2303  [pdf, other

    stat.ME

    False Variable Selection Rates in Regression

    Authors: Max Grazier G'Sell, Trevor Hastie, Robert Tibshirani

    Abstract: There has been recent interest in extending the ideas of False Discovery Rates (FDR) to variable selection in regression settings. Traditionally the FDR in these settings has been defined in terms of the coefficients of the full regression model. Recent papers have struggled with controlling this quantity when the predictors are correlated. This paper shows that this full model definition of FDR s… ▽ More

    Submitted 10 February, 2013; originally announced February 2013.

    Comments: 14 figures, 21 pages. Submitted to Annals of Applied Statistics

  48. Finite-sample equivalence in statistical models for presence-only data

    Authors: William Fithian, Trevor Hastie

    Abstract: Statistical modeling of presence-only data has attracted much recent attention in the ecological literature, leading to a proliferation of methods, including the inhomogeneous Poisson process (IPP) model, maximum entropy (Maxent) modeling of species distributions and logistic regression models. Several recent articles have shown the close relationships between these methods. We explain why the IPP… ▽ More

    Submitted 8 January, 2014; v1 submitted 30 July, 2012; originally announced July 2012.

    Comments: Published in at http://dx.doi.org/10.1214/13-AOAS667 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

    Report number: IMS-AOAS-AOAS667

    Journal ref: Annals of Applied Statistics 2013, Vol. 7, No. 4, 1917-1939

  49. arXiv:1205.5012  [pdf, other

    stat.ML cs.CV cs.LG math.OC

    Learning Mixed Graphical Models

    Authors: Jason D. Lee, Trevor J. Hastie

    Abstract: We consider the problem of learning the structure of a pairwise graphical model over continuous and discrete variables. We present a new pairwise model for graphical models with both continuous and discrete variables that is amenable to structure learning. In previous work, authors have considered structure learning of Gaussian graphical models and structure learning of discrete models. Our… ▽ More

    Submitted 3 July, 2013; v1 submitted 22 May, 2012; originally announced May 2012.

  50. arXiv:1111.5479  [pdf, ps, other

    stat.ML cs.LG

    The Graphical Lasso: New Insights and Alternatives

    Authors: Rahul Mazumder, Trevor Hastie

    Abstract: The graphical lasso \citep{FHT2007a} is an algorithm for learning the structure in an undirected Gaussian graphical model, using $\ell_1$ regularization to control the number of zeros in the precision matrix ${\BΘ}={\BΣ}^{-1}$ \citep{BGA2008,yuan_lin_07}. The {\texttt R} package \GL\ \citep{FHT2007a} is popular, fast, and allows one to efficiently build a path of models for different values of the… ▽ More

    Submitted 7 August, 2012; v1 submitted 23 November, 2011; originally announced November 2011.

    Comments: This is a revised version of our previous manuscript with the same name ArXiv id: http://arxiv.longhoe.net/abs/1111.5479