Skip to main content

Showing 1–50 of 109 results for author: Yu, B

Searching in archive stat. Search in all archives.
.
  1. arXiv:2406.19958  [pdf, other

    stat.ML cs.LG math.ST

    The Computational Curse of Big Data for Bayesian Additive Regression Trees: A Hitting Time Analysis

    Authors: Yan Shuo Tan, Omer Ronen, Theo Saarinen, Bin Yu

    Abstract: Bayesian Additive Regression Trees (BART) is a popular Bayesian non-parametric regression model that is commonly used in causal inference and beyond. Its strong predictive performance is supported by theoretical guarantees that its posterior distribution concentrates around the true regression function at optimal rates under various data generative settings and for appropriate prior choices. In th… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

    MSC Class: 62G08; 65C40

  2. arXiv:2406.09657  [pdf, other

    cs.LG stat.ML

    ScaLES: Scalable Latent Exploration Score for Pre-Trained Generative Networks

    Authors: Omer Ronen, Ahmed Imtiaz Humayun, Randall Balestriero, Richard Baraniuk, Bin Yu

    Abstract: We develop Scalable Latent Exploration Score (ScaLES) to mitigate over-exploration in Latent Space Optimization (LSO), a popular method for solving black-box discrete optimization problems. LSO utilizes continuous optimization within the latent space of a Variational Autoencoder (VAE) and is known to be susceptible to over-exploration, which manifests in unrealistic solutions that reduce its pract… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  3. arXiv:2406.08447  [pdf, other

    cs.LG cs.AI cs.CL stat.ML

    The Impact of Initialization on LoRA Finetuning Dynamics

    Authors: Soufiane Hayou, Nikhil Ghosh, Bin Yu

    Abstract: In this paper, we study the role of initialization in Low Rank Adaptation (LoRA) as originally introduced in Hu et al. (2021). Essentially, to start from the pretrained model as initialization for finetuning, one can either initialize B to zero and A to random (default initialization in PEFT package), or vice-versa. In both cases, the product BA is equal to zero at initialization, which makes fine… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: TDLR: Different Initializations lead to completely different finetuning dynamics. One initialization (set A random and B zero) is generally better than the natural opposite initialization. arXiv admin note: text overlap with arXiv:2402.12354

  4. arXiv:2406.01252  [pdf, other

    cs.CL cs.AI stat.ML

    Towards Scalable Automated Alignment of LLMs: A Survey

    Authors: Boxi Cao, Keming Lu, Xinyu Lu, Jiawei Chen, Mengjie Ren, Hao Xiang, Peilin Liu, Yaojie Lu, Ben He, Xianpei Han, Le Sun, Hongyu Lin, Bowen Yu

    Abstract: Alignment is the most critical step in building large language models (LLMs) that meet human needs. With the rapid development of LLMs gradually surpassing human capabilities, traditional alignment methods based on human-annotation are increasingly unable to meet the scalability demands. Therefore, there is an urgent need to explore new sources of automated alignment signals and technical approach… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

  5. arXiv:2404.00522  [pdf, other

    cs.LG stat.ML

    Minimum-Norm Interpolation Under Covariate Shift

    Authors: Neil Mallinar, Austin Zane, Spencer Frei, Bin Yu

    Abstract: Transfer learning is a critical part of real-world machine learning deployments and has been extensively studied in experimental works with overparameterized neural networks. However, even in the simplest setting of linear regression a notable gap still exists in the theoretical understanding of transfer learning. In-distribution research on high-dimensional linear regression has led to the identi… ▽ More

    Submitted 30 March, 2024; originally announced April 2024.

  6. arXiv:2403.08971  [pdf, other

    stat.CO

    Designing a Data Science simulation with MERITS: A Primer

    Authors: Corrine F Elliott, James Duncan, Tiffany M Tang, Merle Behr, Karl Kumbier, Bin Yu

    Abstract: Simulations play a crucial role in the modern scientific process. Yet despite (or due to) their ubiquity, the Data Science community shares neither a comprehensive definition for a "high-quality" study nor a consolidated guide to designing one. Inspired by the Predictability-Computability-Stability (PCS) framework for 'veridical' Data Science, we propose six MERITS that a Data Science simulation s… ▽ More

    Submitted 13 March, 2024; originally announced March 2024.

    Comments: 26 pages (main text); 1 figure; 2 tables; *Authors contributed equally to this manuscript; **Authors contributed equally to this manuscript

  7. arXiv:2402.15926  [pdf, other

    cs.LG stat.ML

    Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency

    Authors: **gfeng Wu, Peter L. Bartlett, Matus Telgarsky, Bin Yu

    Abstract: We consider gradient descent (GD) with a constant stepsize applied to logistic regression with linearly separable data, where the constant stepsize $η$ is so large that the loss initially oscillates. We show that GD exits this initial oscillatory phase rapidly -- in $\mathcal{O}(η)$ steps -- and subsequently achieves an $\tilde{\mathcal{O}}(1 / (ηt) )$ convergence rate after $t$ additional steps.… ▽ More

    Submitted 9 June, 2024; v1 submitted 24 February, 2024; originally announced February 2024.

    Comments: COLT 2024 camera ready

  8. arXiv:2402.12354  [pdf, other

    cs.LG cs.AI cs.CL stat.ML

    LoRA+: Efficient Low Rank Adaptation of Large Models

    Authors: Soufiane Hayou, Nikhil Ghosh, Bin Yu

    Abstract: In this paper, we show that Low Rank Adaptation (LoRA) as originally introduced in Hu et al. (2021) leads to suboptimal finetuning of models with large width (embedding dimension). This is due to the fact that adapter matrices A and B in LoRA are updated with the same learning rate. Using scaling arguments for large width networks, we demonstrate that using the same learning rate for A and B does… ▽ More

    Submitted 19 February, 2024; originally announced February 2024.

    Comments: 27 pages

  9. arXiv:2310.02533  [pdf, other

    cs.LG stat.ML

    Quantifying and mitigating the impact of label errors on model disparity metrics

    Authors: Julius Adebayo, Melissa Hall, Bowen Yu, Bobbie Chern

    Abstract: Errors in labels obtained via human annotation adversely affect a model's performance. Existing approaches propose ways to mitigate the effect of label error on a model's downstream accuracy, yet little is known about its impact on a model's disparity metrics. Here we study the effect of label error on a model's disparity metrics. We empirically characterize how varying levels of label error, in b… ▽ More

    Submitted 3 October, 2023; originally announced October 2023.

    Comments: Conference paper at ICLR 2023

  10. arXiv:2309.10301  [pdf, other

    stat.ML cs.LG

    Prominent Roles of Conditionally Invariant Components in Domain Adaptation: Theory and Algorithms

    Authors: Keru Wu, Yuansi Chen, Wooseok Ha, Bin Yu

    Abstract: Domain adaptation (DA) is a statistical learning problem that arises when the distribution of the source data used to train a model differs from that of the target data used to evaluate the model. While many DA algorithms have demonstrated considerable empirical success, blindly applying these algorithms can often lead to worse performance on new datasets. To address this, it is crucial to clarify… ▽ More

    Submitted 19 September, 2023; originally announced September 2023.

  11. arXiv:2308.16878  [pdf, other

    stat.AP physics.app-ph

    On the Role of Non-Localities in Fundamental Diagram Estimation

    Authors: **g Liu, Fangfang Zheng, Boxi Yu, Saif Jabari

    Abstract: We consider the role of non-localities in speed-density data used to fit fundamental diagrams from vehicle trajectories. We demonstrate that the use of anticipated densities results in a clear classification of speed-density data into stationary and non-stationary points, namely, acceleration and deceleration regimes and their separating boundary. The separating boundary represents a locus of stat… ▽ More

    Submitted 31 August, 2023; originally announced August 2023.

  12. arXiv:2308.03215  [pdf, other

    stat.ML cs.LG

    The Effect of SGD Batch Size on Autoencoder Learning: Sparsity, Sharpness, and Feature Learning

    Authors: Nikhil Ghosh, Spencer Frei, Wooseok Ha, Bin Yu

    Abstract: In this work, we investigate the dynamics of stochastic gradient descent (SGD) when training a single-neuron autoencoder with linear or ReLU activation on orthogonal data. We show that for this non-convex problem, randomly initialized SGD with a constant step size successfully finds a global minimum for any batch size choice. However, the particular global minimum found depends upon the batch size… ▽ More

    Submitted 6 August, 2023; originally announced August 2023.

  13. arXiv:2307.01932  [pdf, other

    stat.ME cs.AI cs.LG stat.ML

    MDI+: A Flexible Random Forest-Based Feature Importance Framework

    Authors: Abhineet Agarwal, Ana M. Kenney, Yan Shuo Tan, Tiffany M. Tang, Bin Yu

    Abstract: Mean decrease in impurity (MDI) is a popular feature importance measure for random forests (RFs). We show that the MDI for a feature $X_k$ in each tree in an RF is equivalent to the unnormalized $R^2$ value in a linear regression of the response on the collection of decision stumps that split on $X_k$. We use this interpretation to propose a flexible feature importance framework called MDI+. Speci… ▽ More

    Submitted 4 July, 2023; originally announced July 2023.

  14. arXiv:2307.00190  [pdf

    stat.AP

    Estimands in Real-World Evidence Studies

    Authors: Jie Chen, Daniel Scharfstein, Hongwei Wang, Binbing Yu, Yang Song, Weili He, John Scott, Xiwu Lin, Hana Lee

    Abstract: A Real-World Evidence (RWE) Scientific Working Group (SWG) of the American Statistical Association Biopharmaceutical Section (ASA BIOP) has been reviewing statistical considerations for the generation of RWE to support regulatory decision-making. As part of the effort, the working group is addressing estimands in RWE studies. Constructing the right estimand -- the target of estimation -- which ref… ▽ More

    Submitted 30 June, 2023; originally announced July 2023.

  15. arXiv:2210.09352  [pdf, other

    stat.ML cs.AI cs.LG math.ST

    A Mixing Time Lower Bound for a Simplified Version of BART

    Authors: Omer Ronen, Theo Saarinen, Yan Shuo Tan, James Duncan, Bin Yu

    Abstract: Bayesian Additive Regression Trees (BART) is a popular Bayesian non-parametric regression algorithm. The posterior is a distribution over sums of decision trees, and predictions are made by averaging approximate samples from the posterior. The combination of strong predictive performance and the ability to provide uncertainty measures has led BART to be commonly used in the social sciences, bios… ▽ More

    Submitted 17 October, 2022; originally announced October 2022.

  16. arXiv:2207.14481  [pdf, other

    econ.EM stat.ME

    Same Root Different Leaves: Time Series and Cross-Sectional Methods in Panel Data

    Authors: Dennis Shen, Peng Ding, Jasjeet Sekhon, Bin Yu

    Abstract: A central goal in social science is to evaluate the causal effect of a policy. One dominant approach is through panel data analysis in which the behaviors of multiple units are observed over time. The information across time and space motivates two general approaches: (i) horizontal regression (i.e., unconfoundedness), which exploits time series patterns, and (ii) vertical regression (e.g., synthe… ▽ More

    Submitted 8 October, 2022; v1 submitted 29 July, 2022; originally announced July 2022.

  17. arXiv:2205.15135  [pdf, other

    cs.LG cs.AI stat.AP stat.ME stat.ML

    Group Probability-Weighted Tree Sums for Interpretable Modeling of Heterogeneous Data

    Authors: Keyan Nasseri, Chandan Singh, James Duncan, Aaron Kornblith, Bin Yu

    Abstract: Machine learning in high-stakes domains, such as healthcare, faces two critical challenges: (1) generalizing to diverse data distributions given limited training data while (2) maintaining interpretability. To address these challenges, we propose an instance-weighted tree-sum method that effectively pools data across diverse groups to output a concise, rule-based model. Given distinct groups of in… ▽ More

    Submitted 30 May, 2022; originally announced May 2022.

    Comments: arXiv admin note: substantial text overlap with arXiv:2201.11931

  18. arXiv:2202.00858  [pdf, other

    cs.LG cs.AI stat.AP stat.ME stat.ML

    Hierarchical Shrinkage: improving the accuracy and interpretability of tree-based methods

    Authors: Abhineet Agarwal, Yan Shuo Tan, Omer Ronen, Chandan Singh, Bin Yu

    Abstract: Tree-based models such as decision trees and random forests (RF) are a cornerstone of modern machine-learning practice. To mitigate overfitting, trees are typically regularized by a variety of techniques that modify their structure (e.g. pruning). We introduce Hierarchical Shrinkage (HS), a post-hoc algorithm that does not modify the tree structure, and instead regularizes the tree by shrinking th… ▽ More

    Submitted 1 February, 2022; originally announced February 2022.

  19. arXiv:2201.11931  [pdf, other

    cs.LG cs.AI stat.AP stat.ME stat.ML

    Fast Interpretable Greedy-Tree Sums

    Authors: Yan Shuo Tan, Chandan Singh, Keyan Nasseri, Abhineet Agarwal, James Duncan, Omer Ronen, Matthew Epland, Aaron Kornblith, Bin Yu

    Abstract: Modern machine learning has achieved impressive prediction performance, but often sacrifices interpretability, a critical consideration in high-stakes domains such as medicine. In such settings, practitioners often use highly interpretable decision tree models, but these suffer from inductive bias against additive structure. To overcome this bias, we propose Fast Interpretable Greedy-Tree Sums (FI… ▽ More

    Submitted 8 July, 2023; v1 submitted 27 January, 2022; originally announced January 2022.

  20. arXiv:2111.10734  [pdf, other

    cs.LG cs.AI cs.CV stat.ML

    Deep Probability Estimation

    Authors: Sheng Liu, Aakash Kaku, Weicheng Zhu, Matan Leibovich, Sreyas Mohan, Boyang Yu, Haoxiang Huang, Laure Zanna, Narges Razavian, Jonathan Niles-Weed, Carlos Fernandez-Granda

    Abstract: Reliable probability estimation is of crucial importance in many real-world applications where there is inherent (aleatoric) uncertainty. Probability-estimation models are trained on observed outcomes (e.g. whether it has rained or not, or whether a patient has died or not), because the ground-truth probabilities of the events of interest are typically unknown. The problem is therefore analogous t… ▽ More

    Submitted 11 October, 2022; v1 submitted 20 November, 2021; originally announced November 2021.

    Comments: SL, AK, WZ, ML, SM contributed equally to this work; 36 pages, 17 figures, 12 tables

    Journal ref: Proceedings of the 39th International Conference on Machine Learning, PMLR 162:13746-13781, 2022

  21. arXiv:2111.07167  [pdf, other

    stat.ML cs.LG math.ST

    The Three Stages of Learning Dynamics in High-Dimensional Kernel Methods

    Authors: Nikhil Ghosh, Song Mei, Bin Yu

    Abstract: To understand how deep learning works, it is crucial to understand the training dynamics of neural networks. Several interesting hypotheses about these dynamics have been made based on empirically observed phenomena, but there exists a limited theoretical understanding of when and why such phenomena occur. In this paper, we consider the training dynamics of gradient flow on kernel least-squares… ▽ More

    Submitted 13 November, 2021; originally announced November 2021.

  22. arXiv:2110.09626  [pdf, other

    stat.ML cs.IT cs.LG

    A cautionary tale on fitting decision trees to data from additive models: generalization lower bounds

    Authors: Yan Shuo Tan, Abhineet Agarwal, Bin Yu

    Abstract: Decision trees are important both as interpretable models amenable to high-stakes decision-making, and as building blocks of ensemble methods such as random forests and gradient boosting. Their statistical properties, however, are not well understood. The most cited prior works have focused on deriving pointwise consistency guarantees for CART in a classical nonparametric regression setting. We ta… ▽ More

    Submitted 18 October, 2021; originally announced October 2021.

  23. arXiv:2110.08634  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    Towards Robust Waveform-Based Acoustic Models

    Authors: Dino Oglic, Zoran Cvetkovic, Peter Sollich, Steve Renals, Bin Yu

    Abstract: We study the problem of learning robust acoustic models in adverse environments, characterized by a significant mismatch between training and test conditions. This problem is of paramount importance for the deployment of speech recognition systems that need to perform well in unseen environments. First, we characterize data augmentation theoretically as an instance of vicinal risk minimization, wh… ▽ More

    Submitted 29 June, 2022; v1 submitted 16 October, 2021; originally announced October 2021.

    Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022

  24. arXiv:2108.08445  [pdf, ps, other

    stat.AP

    Seven Principles for Rapid-Response Data Science: Lessons Learned from Covid-19 Forecasting

    Authors: Bin Yu, Chandan Singh

    Abstract: In this article, we take a step back to distill seven principles out of our experience in the spring of 2020, when our 12-person rapid-response team used skills of data science and beyond to help distribute Covid PPE. This process included tap** into domain knowledge of epidemiology and medical logistics chains, curating a relevant data repository, develo** models for short-term county-level d… ▽ More

    Submitted 29 March, 2022; v1 submitted 18 August, 2021; originally announced August 2021.

    Comments: 4 pages, accepted in special issue of "Statistical Science" on COVID-19 Response

  25. arXiv:2108.06847  [pdf, other

    stat.ML cs.LG

    Interpreting and improving deep-learning models with reality checks

    Authors: Chandan Singh, Wooseok Ha, Bin Yu

    Abstract: Recent deep-learning models have achieved impressive predictive performance by learning complex functions of many variables, often at the cost of interpretability. This chapter covers recent work aiming to interpret models by attributing importance to features and feature groups for a single prediction. Importantly, the proposed attributions assign importance to interactions between features, in a… ▽ More

    Submitted 18 August, 2021; v1 submitted 15 August, 2021; originally announced August 2021.

  26. arXiv:2108.02422  [pdf

    stat.AP

    Divergent Effects of Factors on Crashes under Autonomous and Conventional Driving Modes Using A Hierarchical Bayesian Approach

    Authors: Weixi Ren, Bo Yu, Yuren Chen, Kun Gao, Shan Bao

    Abstract: Influencing factors on crashes involved with autonomous vehicles (AVs) have been paid increasing attention. However, there is a lack of comparative analyses between influencing factors on crashes of AVs and human-driven vehicles. To fill this research gap, the study aims to explore the divergent effects of factors on crashes under autonomous and conventional driving modes. This study obtained 154… ▽ More

    Submitted 7 April, 2022; v1 submitted 5 August, 2021; originally announced August 2021.

    Comments: 42 pages,10 figures

    MSC Class: 62P30 ACM Class: G.3.1

  27. arXiv:2107.09145  [pdf, other

    stat.ML cs.LG

    Adaptive wavelet distillation from neural networks through interpretations

    Authors: Wooseok Ha, Chandan Singh, Francois Lanusse, Srigokul Upadhyayula, Bin Yu

    Abstract: Recent deep-learning models have achieved impressive prediction performance, but often sacrifice interpretability and computational efficiency. Interpretability is crucial in many disciplines, such as science and medicine, where models must be carefully vetted or where interpretation is the goal itself. Moreover, interpretable models are concise and often yield computational efficiency. Here, we p… ▽ More

    Submitted 26 August, 2021; v1 submitted 19 July, 2021; originally announced July 2021.

  28. arXiv:2106.02096  [pdf, ps, other

    stat.ML cs.LG

    Shape-Preserving Dimensionality Reduction : An Algorithm and Measures of Topological Equivalence

    Authors: Byeongsu Yu, Kisung You

    Abstract: We introduce a linear dimensionality reduction technique preserving topological features via persistent homology. The method is designed to find linear projection $L$ which preserves the persistent diagram of a point cloud $\mathbb{X}$ via simulated annealing. The projection $L$ induces a set of canonical simplicial maps from the Rips (or Čech) filtration of $\mathbb{X}$ to that of $L\mathbb{X}$.… ▽ More

    Submitted 13 June, 2021; v1 submitted 3 June, 2021; originally announced June 2021.

    Comments: 18 pages, 2 figures

  29. arXiv:2011.06593  [pdf, other

    q-bio.QM stat.AP

    A stability-driven protocol for drug response interpretable prediction (staDRIP)

    Authors: Xiao Li, Tiffany M. Tang, Xuewei Wang, Jean-Pierre A. Kocher, Bin Yu

    Abstract: Modern cancer -omics and pharmacological data hold great promise in precision cancer medicine for develo** individualized patient treatments. However, high heterogeneity and noise in such data pose challenges for predicting the response of cancer cell lines to therapeutic drugs accurately. As a result, arbitrary human judgment calls are rampant throughout the predictive modeling pipeline. In thi… ▽ More

    Submitted 16 November, 2020; v1 submitted 12 November, 2020; originally announced November 2020.

    Comments: Machine Learning for Health (ML4H) at NeurIPS 2020 - Extended Abstract

  30. arXiv:2008.10109  [pdf, other

    stat.ME cs.LG stat.AP

    Stable discovery of interpretable subgroups via calibration in causal studies

    Authors: Raaz Dwivedi, Yan Shuo Tan, Briton Park, Mian Wei, Kevin Horgan, David Madigan, Bin Yu

    Abstract: Building on Yu and Kumbier's PCS framework and for randomized experiments, we introduce a novel methodology for Stable Discovery of Interpretable Subgroups via Calibration (StaDISC), with large heterogeneous treatment effects. StaDISC was developed during our re-analysis of the 1999-2000 VIGOR study, an 8076 patient randomized controlled trial (RCT), that compared the risk of adverse events from a… ▽ More

    Submitted 28 September, 2020; v1 submitted 23 August, 2020; originally announced August 2020.

    Comments: Raaz Dwivedi and Yan Shuo Tan are joint first authors and contributed equally to this work. 52 pages, 8 Figures, 9 Tables. To appear in International Statistical Review, 2020

  31. arXiv:2006.10189  [pdf, other

    cs.LG cs.IT math.ST stat.ML

    Revisiting minimum description length complexity in overparameterized models

    Authors: Raaz Dwivedi, Chandan Singh, Bin Yu, Martin J. Wainwright

    Abstract: Complexity is a fundamental concept underlying statistical learning theory that aims to inform generalization performance. Parameter count, while successful in low-dimensional settings, is not well-justified for overparameterized settings when the number of parameters is more than the number of training samples. We revisit complexity measures based on Rissanen's principle of minimum description le… ▽ More

    Submitted 12 October, 2023; v1 submitted 17 June, 2020; originally announced June 2020.

    Comments: First two authors contributed equally

  32. arXiv:2006.07841  [pdf, other

    cs.LG stat.ML

    Classify and Generate Reciprocally: Simultaneous Positive-Unlabelled Learning and Conditional Generation with Extra Data

    Authors: Bing Yu, Ke Sun, He Wang, Zhouchen Lin, Zhanxing Zhu

    Abstract: The scarcity of class-labeled data is a ubiquitous bottleneck in many machine learning problems. While abundant unlabeled data typically exist and provide a potential solution, it is highly challenging to exploit them. In this paper, we address this problem by leveraging Positive-Unlabeled~(PU) classification and the conditional generation with extra unlabeled data \emph{simultaneously}. In partic… ▽ More

    Submitted 8 February, 2024; v1 submitted 14 June, 2020; originally announced June 2020.

  33. Knowledge Distillation: A Survey

    Authors: Jian** Gou, Baosheng Yu, Stephen John Maybank, Dacheng Tao

    Abstract: In recent years, deep neural networks have been successful in both industry and academia, especially for computer vision tasks. The great success of deep learning is mainly due to its scalability to encode large-scale data and to maneuver billions of model parameters. However, it is a challenge to deploy these cumbersome deep models on devices with limited resources, e.g., mobile phones and embedd… ▽ More

    Submitted 20 May, 2021; v1 submitted 9 June, 2020; originally announced June 2020.

    Comments: It has been accepted for publication in International Journal of Computer Vision (2021)

  34. arXiv:2005.12781  [pdf, other

    cs.LG cs.IR stat.ML

    How to Grow a (Product) Tree: Personalized Category Suggestions for eCommerce Type-Ahead

    Authors: Jacopo Tagliabue, Bingqing Yu, Marie Beaulieu

    Abstract: In an attempt to balance precision and recall in the search page, leading digital shops have been effectively nudging users into select category facets as early as in the type-ahead suggestions. In this work, we present SessionPath, a novel neural network model that improves facet suggestions on two counts: first, the model is able to leverage session embeddings to provide scalable personalization… ▽ More

    Submitted 26 May, 2020; originally announced May 2020.

  35. arXiv:2005.11411  [pdf, other

    cs.LG math.ST stat.ML

    Instability, Computational Efficiency and Statistical Accuracy

    Authors: Nhat Ho, Koulik Khamaru, Raaz Dwivedi, Martin J. Wainwright, Michael I. Jordan, Bin Yu

    Abstract: Many statistical estimators are defined as the fixed point of a data-dependent operator, with estimators based on minimizing a cost function being an important special case. The limiting performance of such estimators depends on the properties of the population-level operator in the idealized limit of infinitely many samples. We develop a general framework that yields bounds on statistical accurac… ▽ More

    Submitted 20 March, 2022; v1 submitted 22 May, 2020; originally announced May 2020.

    Comments: 68 pages, 6 Figures, 2 Tables. First three authors contributed equally

  36. Curating a COVID-19 data repository and forecasting county-level death counts in the United States

    Authors: Nick Altieri, Rebecca L. Barter, James Duncan, Raaz Dwivedi, Karl Kumbier, Xiao Li, Robert Netzorg, Briton Park, Chandan Singh, Yan Shuo Tan, Tiffany Tang, Yu Wang, Chao Zhang, Bin Yu

    Abstract: As the COVID-19 outbreak evolves, accurate forecasting continues to play an extremely important role in informing policy decisions. In this paper, we present our continuous curation of a large data repository containing COVID-19 information from a range of sources. We use this data to develop predictions and corresponding prediction intervals for the short-term trajectory of COVID-19 cumulative de… ▽ More

    Submitted 9 August, 2020; v1 submitted 16 May, 2020; originally announced May 2020.

    Comments: Authors ordered alphabetically. All authors contributed significantly to this work. All collected data, modeling code, forecasts, and visualizations are updated daily and available at \url{https://github.com/Yu-Group/covid19-severity-prediction}

    Journal ref: Published in Harvard Data Science Review, 2020

  37. arXiv:2003.07160  [pdf, other

    cs.IR cs.LG stat.ML

    "An Image is Worth a Thousand Features": Scalable Product Representations for In-Session Type-Ahead Personalization

    Authors: Bingqing Yu, Jacopo Tagliabue, Ciro Greco, Federico Bianchi

    Abstract: We address the problem of personalizing query completion in a digital commerce setting, in which the bounce rate is typically high and recurring users are rare. We focus on in-session personalization and improve a standard noisy channel model by injecting dense vectors computed from product images at query time. We argue that image-based personalization displays several advantages over alternative… ▽ More

    Submitted 11 March, 2020; originally announced March 2020.

    ACM Class: I.2.6; I.2.7

  38. arXiv:2003.01926  [pdf, other

    stat.ML astro-ph.IM cs.LG

    Transformation Importance with Applications to Cosmology

    Authors: Chandan Singh, Wooseok Ha, Francois Lanusse, Vanessa Boehm, Jia Liu, Bin Yu

    Abstract: Machine learning lies at the heart of new possibilities for scientific discovery, knowledge generation, and artificial intelligence. Its potential benefits to these fields requires going beyond predictive accuracy and focusing on interpretability. In particular, many scientific problems require interpretations in a domain-specific interpretable feature space (e.g. the frequency domain) whereas att… ▽ More

    Submitted 14 June, 2021; v1 submitted 4 March, 2020; originally announced March 2020.

    Comments: Published in ICLR 2020 Workshop on Fundamental Science in the era of AI

  39. arXiv:1912.07254  [pdf, other

    cs.LG stat.ML

    VLSI Mask Optimization: From Shallow To Deep Learning

    Authors: Haoyu Yang, Wei Zhong, Yuzhe Ma, Hao Geng, Ran Chen, Wanli Chen, Bei Yu

    Abstract: VLSI mask optimization is one of the most critical stages in manufacturability aware design, which is costly due to the complicated mask optimization and lithography simulation. Recent researches have shown prominent advantages of machine learning techniques dealing with complicated and big data problems, which bring potential of dedicated machine learning solution for DFM problems and facilitate… ▽ More

    Submitted 16 December, 2019; originally announced December 2019.

    Comments: 6 pages; accepted by 25th Asia and South Pacific Design Automation Conference (ASP-DAC 2020)

  40. arXiv:1912.05796  [pdf, other

    cs.LG cs.AI stat.ML

    Automatic Layout Generation with Applications in Machine Learning Engine Evaluation

    Authors: Haoyu Yang, Wen Chen, Piyush Pathak, Frank Gennari, Ya-Chieh Lai, Bei Yu

    Abstract: Machine learning-based lithography hotspot detection has been deeply studied recently, from varies feature extraction techniques to efficient learning models. It has been observed that such machine learning-based frameworks are providing satisfactory metal layer hotspot prediction results on known public metal layer benchmarks. In this work, we seek to evaluate how these machine learning-based hot… ▽ More

    Submitted 12 December, 2019; originally announced December 2019.

    Comments: 6 pages, submitted to 1st ACM/IEEE Workshop on Machine Learning for CAD (MLCAD) for review

  41. arXiv:1911.09307  [pdf, other

    cs.LG stat.ML

    Patch-level Neighborhood Interpolation: A General and Effective Graph-based Regularization Strategy

    Authors: Ke Sun, Bing Yu, Zhouchen Lin, Zhanxing Zhu

    Abstract: Regularization plays a crucial role in machine learning models, especially for deep neural networks. The existing regularization techniques mainly rely on the i.i.d. assumption and only consider the knowledge from the current sample, without the leverage of the neighboring relationship between samples. In this work, we propose a general regularizer called \textbf{Patch-level Neighborhood Interpola… ▽ More

    Submitted 22 October, 2023; v1 submitted 21 November, 2019; originally announced November 2019.

    Comments: Accepted in ACML 2023 conference track

  42. arXiv:1911.02549  [pdf, other

    cs.LG cs.PF stat.ML

    MLPerf Inference Benchmark

    Authors: Vijay Janapa Reddi, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, Maximilien Breughe, Mark Charlebois, William Chou, Ramesh Chukka, Cody Coleman, Sam Davis, Pan Deng, Greg Diamos, Jared Duke, Dave Fick, J. Scott Gardner, Itay Hubara, Sachin Idgunji, Thomas B. Jablin, Jeff Jiao, Tom St. John, Pankaj Kanwar, David Lee , et al. (22 additional authors not shown)

    Abstract: Machine-learning (ML) hardware and software system demand is burgeoning. Driven by ML applications, the number of different ML inference systems has exploded. Over 100 organizations are building ML inference chips, and the systems that incorporate existing models span at least three orders of magnitude in power consumption and five orders of magnitude in performance; they range from embedded devic… ▽ More

    Submitted 9 May, 2020; v1 submitted 6 November, 2019; originally announced November 2019.

    Comments: ISCA 2020

  43. arXiv:1909.13584  [pdf, other

    cs.LG cs.CV stat.ML

    Interpretations are useful: penalizing explanations to align neural networks with prior knowledge

    Authors: Laura Rieger, Chandan Singh, W. James Murdoch, Bin Yu

    Abstract: For an explanation of a deep learning model to be effective, it must provide both insight into a model and suggest a corresponding action in order to achieve some objective. Too often, the litany of proposed explainable deep learning methods stop at the first step, providing practitioners with insight into a model, but no way to act on it. In this paper, we propose contextual decomposition explana… ▽ More

    Submitted 8 October, 2020; v1 submitted 30 September, 2019; originally announced September 2019.

    Comments: 18 pages; published in ICML2020; Erratum: numbers in table 1 were too high (now corrected) with the trend remaining the same

  44. arXiv:1907.13258  [pdf, other

    stat.ME

    Incremental causal effects

    Authors: Dominik Rothenhäusler, Bin Yu

    Abstract: Causal evidence is needed to act and it is often enough for the evidence to point towards a direction of the effect of an action. For example, policymakers might be interested in estimating the effect of slightly increasing taxes on private spending across the whole population. We study identifiability and estimation of causal effects, where a continuous treatment is slightly shifted across the wh… ▽ More

    Submitted 7 August, 2020; v1 submitted 30 July, 2019; originally announced July 2019.

  45. arXiv:1906.10845  [pdf, other

    stat.ML cs.LG

    A Debiased MDI Feature Importance Measure for Random Forests

    Authors: Xiao Li, Yu Wang, Sumanta Basu, Karl Kumbier, Bin Yu

    Abstract: Tree ensembles such as Random Forests have achieved impressive empirical success across a wide variety of applications. To understand how these models make predictions, people routinely turn to feature importance measures calculated from tree ensembles. It has long been known that Mean Decrease Impurity (MDI), one of the most widely used measures of feature importance, incorrectly assigns high imp… ▽ More

    Submitted 26 October, 2019; v1 submitted 26 June, 2019; originally announced June 2019.

    Comments: NeurIPS'19. The first two authors contributed equally to this paper

  46. arXiv:1906.10773  [pdf, other

    cs.LG cs.CR stat.ML

    Are Adversarial Perturbations a Showstopper for ML-Based CAD? A Case Study on CNN-Based Lithographic Hotspot Detection

    Authors: Kang Liu, Haoyu Yang, Yuzhe Ma, Benjamin Tan, Bei Yu, Evangeline F. Y. Young, Ramesh Karri, Siddharth Garg

    Abstract: There is substantial interest in the use of machine learning (ML) based techniques throughout the electronic computer-aided design (CAD) flow, particularly those based on deep learning. However, while deep learning methods have surpassed state-of-the-art performance in several applications, they have exhibited intrinsic susceptibility to adversarial perturbations --- small but deliberate alteratio… ▽ More

    Submitted 25 June, 2019; originally announced June 2019.

    Journal ref: ACM Trans. Des. Autom. Electron. Syst. 25, 5, Article 48 (August 2020)

  47. arXiv:1905.12247  [pdf, other

    stat.ML cs.LG stat.CO

    Fast mixing of Metropolized Hamiltonian Monte Carlo: Benefits of multi-step gradients

    Authors: Yuansi Chen, Raaz Dwivedi, Martin J. Wainwright, Bin Yu

    Abstract: Hamiltonian Monte Carlo (HMC) is a state-of-the-art Markov chain Monte Carlo sampling algorithm for drawing samples from smooth probability densities over continuous spaces. We study the variant most widely used in practice, Metropolized HMC with the Störmer-Verlet or leapfrog integrator, and make two primary contributions. First, we provide a non-asymptotic upper bound on the mixing time of the M… ▽ More

    Submitted 11 January, 2021; v1 submitted 29 May, 2019; originally announced May 2019.

    Comments: 73 pages, 2 figures, fixed a mistake in the proof of Lemma 11, accepted in JMLR

  48. arXiv:1905.10157  [pdf, other

    cs.LG stat.ML

    On the Learning Dynamics of Two-layer Nonlinear Convolutional Neural Networks

    Authors: Bing Yu, Junzhao Zhang, Zhanxing Zhu

    Abstract: Convolutional neural networks (CNNs) have achieved remarkable performance in various fields, particularly in the domain of computer vision. However, why this architecture works well remains to be a mystery. In this work we move a small step toward understanding the success of CNNs by investigating the learning dynamics of a two-layer nonlinear convolutional neural network over some specific data d… ▽ More

    Submitted 24 May, 2019; originally announced May 2019.

  49. arXiv:1905.07631  [pdf, other

    stat.ML cs.LG stat.ME

    Disentangled Attribution Curves for Interpreting Random Forests and Boosted Trees

    Authors: Summer Devlin, Chandan Singh, W. James Murdoch, Bin Yu

    Abstract: Tree ensembles, such as random forests and AdaBoost, are ubiquitous machine learning models known for achieving strong predictive performance across a wide variety of domains. However, this strong performance comes at the cost of interpretability (i.e. users are unable to understand the relationships a trained random forest has learned and why it is making its predictions). In particular, it is ch… ▽ More

    Submitted 18 May, 2019; originally announced May 2019.

    Comments: Under review

  50. arXiv:1905.01078  [pdf, other

    cs.LG cs.CR stat.ML

    CharBot: A Simple and Effective Method for Evading DGA Classifiers

    Authors: Jonathan Peck, Claire Nie, Raaghavi Sivaguru, Charles Grumer, Femi Olumofin, Bin Yu, Anderson Nascimento, Martine De Cock

    Abstract: Domain generation algorithms (DGAs) are commonly leveraged by malware to create lists of domain names which can be used for command and control (C&C) purposes. Approaches based on machine learning have recently been developed to automatically detect generated domain names in real-time. In this work, we present a novel DGA called CharBot which is capable of producing large numbers of unregistered d… ▽ More

    Submitted 30 May, 2019; v1 submitted 3 May, 2019; originally announced May 2019.