Skip to main content

Showing 1–32 of 32 results for author: Rostamizadeh, A

.
  1. arXiv:2401.13160  [pdf, other

    cs.LG cs.CL

    SpacTor-T5: Pre-training T5 Models with Span Corruption and Replaced Token Detection

    Authors: Ke Ye, Heinrich Jiang, Afshin Rostamizadeh, Ayan Chakrabarti, Giulia DeSalvo, Jean-François Kagy, Lazaros Karydas, Gui Citovsky, Sanjiv Kumar

    Abstract: Pre-training large language models is known to be extremely resource intensive and often times inefficient, under-utilizing the information encapsulated in the training text sequences. In this paper, we present SpacTor, a new training procedure consisting of (1) a hybrid objective combining span corruption (SC) and token replacement detection (RTD), and (2) a two-stage curriculum that optimizes th… ▽ More

    Submitted 23 January, 2024; originally announced January 2024.

    Comments: 9+13 pages, 5 figures

  2. arXiv:2310.08461  [pdf, other

    cs.CL cs.AI cs.LG

    DistillSpec: Improving Speculative Decoding via Knowledge Distillation

    Authors: Yongchao Zhou, Kaifeng Lyu, Ankit Singh Rawat, Aditya Krishna Menon, Afshin Rostamizadeh, Sanjiv Kumar, Jean-François Kagy, Rishabh Agarwal

    Abstract: Speculative decoding (SD) accelerates large language model inference by employing a faster draft model for generating multiple tokens, which are then verified in parallel by the larger target model, resulting in the text generated according to the target model distribution. However, identifying a compact draft model that is well-aligned with the target model is challenging. To tackle this issue, w… ▽ More

    Submitted 30 March, 2024; v1 submitted 12 October, 2023; originally announced October 2023.

  3. arXiv:2301.12052  [pdf, other

    cs.LG cs.AI stat.ML

    Leveraging Importance Weights in Subset Selection

    Authors: Gui Citovsky, Giulia DeSalvo, Sanjiv Kumar, Srikumar Ramalingam, Afshin Rostamizadeh, Yunjuan Wang

    Abstract: We present a subset selection algorithm designed to work with arbitrary model families in a practical batch setting. In such a setting, an algorithm can sample examples one at a time but, in order to limit overhead costs, is only able to update its state (i.e. further train model weights) once a large enough batch of examples is selected. Our algorithm, IWeS, selects examples by importance samplin… ▽ More

    Submitted 27 January, 2023; originally announced January 2023.

    Comments: ICLR 2023

  4. arXiv:2210.03822  [pdf, other

    cs.LG cs.AI

    Is margin all you need? An extensive empirical study of active learning on tabular data

    Authors: Dara Bahri, Heinrich Jiang, Tal Schuster, Afshin Rostamizadeh

    Abstract: Given a labeled training set and a collection of unlabeled data, the goal of active learning (AL) is to identify the best unlabeled points to label. In this comprehensive study, we analyze the performance of a variety of AL algorithms on deep neural networks trained on 69 real-world tabular classification datasets from the OpenML-CC18 benchmark. We consider different data regimes and the effect of… ▽ More

    Submitted 7 October, 2022; originally announced October 2022.

  5. arXiv:2107.14263  [pdf, other

    cs.LG cs.AI

    Batch Active Learning at Scale

    Authors: Gui Citovsky, Giulia DeSalvo, Claudio Gentile, Lazaros Karydas, Anand Rajagopalan, Afshin Rostamizadeh, Sanjiv Kumar

    Abstract: The ability to train complex and highly effective models often requires an abundance of training data, which can easily become a bottleneck in cost, time, and computational resources. Batch active learning, which adaptively issues batched queries to a labeling oracle, is a common approach for addressing this problem. The practical benefits of batch sampling come with the downside of less adaptivit… ▽ More

    Submitted 29 July, 2021; originally announced July 2021.

  6. arXiv:2106.02654  [pdf, other

    cs.LG cs.AI stat.ML

    Churn Reduction via Distillation

    Authors: Heinrich Jiang, Harikrishna Narasimhan, Dara Bahri, Andrew Cotter, Afshin Rostamizadeh

    Abstract: In real-world systems, models are frequently updated as more data becomes available, and in addition to achieving high accuracy, the goal is to also maintain a low difference in predictions compared to the base model (i.e. predictive "churn"). If model retraining results in vastly different behavior, then it could cause negative effects in downstream systems, especially if this churn can be avoide… ▽ More

    Submitted 14 March, 2022; v1 submitted 4 June, 2021; originally announced June 2021.

    Journal ref: ICLR 2022

  7. arXiv:2106.02552  [pdf, other

    cs.LG cs.AI stat.ML

    Active Covering

    Authors: Heinrich Jiang, Afshin Rostamizadeh

    Abstract: We analyze the problem of active covering, where the learner is given an unlabeled dataset and can sequentially label query examples. The objective is to label query all of the positive examples in the fewest number of total label queries. We show under standard non-parametric assumptions that a classical support estimator can be repurposed as an offline algorithm attaining an excess query cost of… ▽ More

    Submitted 4 June, 2021; originally announced June 2021.

    Comments: ICML 2021

  8. arXiv:2010.05273  [pdf, other

    cs.LG cs.AI stat.ML

    Federated Learning via Posterior Averaging: A New Perspective and Practical Algorithms

    Authors: Maruan Al-Shedivat, Jennifer Gillenwater, Eric Xing, Afshin Rostamizadeh

    Abstract: Federated learning is typically approached as an optimization problem, where the goal is to minimize a global loss function by distributing computation across client devices that possess local data and specify different parts of the global objective. We present an alternative perspective and formulate federated learning as a posterior inference problem, where the goal is to infer a global posterio… ▽ More

    Submitted 29 January, 2021; v1 submitted 11 October, 2020; originally announced October 2020.

    Comments: ICLR 2021. Code: https://github.com/alshedivat/fedpa

  9. arXiv:2006.14616  [pdf, ps, other

    cs.CV

    An Analysis of SVD for Deep Rotation Estimation

    Authors: Jake Levinson, Carlos Esteves, Kefan Chen, Noah Snavely, Angjoo Kanazawa, Afshin Rostamizadeh, Ameesh Makadia

    Abstract: Symmetric orthogonalization via SVD, and closely related procedures, are well-known techniques for projecting matrices onto $O(n)$ or $SO(n)$. These tools have long been used for applications in computer vision, for example optimal 3D alignment problems solved by orthogonal Procrustes, rotation averaging, or Essential matrix decomposition. Despite its utility in different settings, SVD orthogonali… ▽ More

    Submitted 25 June, 2020; originally announced June 2020.

  10. arXiv:1912.00594  [pdf, other

    cs.LG stat.ML

    Combining MixMatch and Active Learning for Better Accuracy with Fewer Labels

    Authors: Shuang Song, David Berthelot, Afshin Rostamizadeh

    Abstract: We propose using active learning based techniques to further improve the state-of-the-art semi-supervised learning MixMatch algorithm. We provide a thorough empirical evaluation of several active-learning and baseline methods, which successfully demonstrate a significant improvement on the benchmark CIFAR-10, CIFAR-100, and SVHN datasets (as much as 1.5% in absolute accuracy). We also provide an e… ▽ More

    Submitted 2 December, 2019; v1 submitted 2 December, 2019; originally announced December 2019.

  11. arXiv:1907.00038  [pdf, ps, other

    cs.LG stat.ML

    The Practical Challenges of Active Learning: Lessons Learned from Live Experimentation

    Authors: Jean-François Kagy, Tolga Kayadelen, Ji Ma, Afshin Rostamizadeh, Jana Strnadova

    Abstract: We tested in a live setting the use of active learning for selecting text sentences for human annotations used in training a Thai segmentation machine learning model. In our study, two concurrent annotated samples were constructed, one through random sampling of sentences from a text corpus, and the other through model-based scoring and ranking of sentences from the same corpus. In the course of t… ▽ More

    Submitted 28 June, 2019; originally announced July 2019.

    Comments: Presented at 2019 ICML Workshop on Human in the Loop Learning (HILL 2019), Long Beach, USA

  12. arXiv:1904.13389  [pdf, other

    cs.LG cs.AI cs.DS cs.IT stat.ML

    Categorical Feature Compression via Submodular Optimization

    Authors: MohammadHossein Bateni, Lin Chen, Hossein Esfandiari, Thomas Fu, Vahab S. Mirrokni, Afshin Rostamizadeh

    Abstract: In the era of big data, learning from categorical features with very large vocabularies (e.g., 28 million for the Criteo click prediction dataset) has become a practical challenge for machine learning researchers and practitioners. We design a highly-scalable vocabulary compression algorithm that seeks to maximize the mutual information between the compressed categorical feature and the target bin… ▽ More

    Submitted 30 April, 2019; originally announced April 2019.

    Comments: Accepted to ICML 2019. Authors are listed in alphabetical order

  13. arXiv:1904.03257  [pdf, ps, other

    cs.LG cs.DB cs.DC cs.SE stat.ML

    MLSys: The New Frontier of Machine Learning Systems

    Authors: Alexander Ratner, Dan Alistarh, Gustavo Alonso, David G. Andersen, Peter Bailis, Sarah Bird, Nicholas Carlini, Bryan Catanzaro, Jennifer Chayes, Eric Chung, Bill Dally, Jeff Dean, Inderjit S. Dhillon, Alexandros Dimakis, Pradeep Dubey, Charles Elkan, Grigori Fursin, Gregory R. Ganger, Lise Getoor, Phillip B. Gibbons, Garth A. Gibson, Joseph E. Gonzalez, Justin Gottschlich, Song Han, Kim Hazelwood , et al. (44 additional authors not shown)

    Abstract: Machine learning (ML) techniques are enjoying rapidly increasing adoption. However, designing and implementing the systems that support ML models in real-world deployments remains a significant obstacle, in large part due to the radically different development and deployment profile of modern ML methods, and the range of practical concerns that come with broader adoption. We propose to foster a ne… ▽ More

    Submitted 1 December, 2019; v1 submitted 29 March, 2019; originally announced April 2019.

  14. arXiv:1810.05934  [pdf, other

    cs.LG stat.ML

    A System for Massively Parallel Hyperparameter Tuning

    Authors: Liam Li, Kevin Jamieson, Afshin Rostamizadeh, Ekaterina Gonina, Moritz Hardt, Benjamin Recht, Ameet Talwalkar

    Abstract: Modern learning models are characterized by large hyperparameter spaces and long training times. These properties, coupled with the rise of parallel computing and the growing demand to productionize machine learning workloads, motivate the need to develop mature hyperparameter optimization functionality in distributed computing settings. We address this challenge by first introducing a simple and… ▽ More

    Submitted 15 March, 2020; v1 submitted 13 October, 2018; originally announced October 2018.

    Comments: v2: Corrected typo in Algorithm 1 v3: Added comparison to BOHB and parallel version of synchronous SHA. Add PBT to experiment in Section 4.3.1 v4: Added acknowledgements and slight edit to related work

    Journal ref: Conference on Machine Learning and Systems 2020

  15. arXiv:1806.10175  [pdf, other

    stat.ML cs.IT cs.LG

    Learning a Compressed Sensing Measurement Matrix via Gradient Unrolling

    Authors: Shanshan Wu, Alexandros G. Dimakis, Sujay Sanghavi, Felix X. Yu, Daniel Holtmann-Rice, Dmitry Storcheus, Afshin Rostamizadeh, Sanjiv Kumar

    Abstract: Linear encoding of sparse vectors is widely popular, but is commonly data-independent -- missing any possible extra (but a priori unknown) structure beyond sparsity. In this paper we present a new method to learn linear encoders that adapt to data, while still performing well with the widely used $\ell_1$ decoder. The convex $\ell_1$ decoder prevents gradient propagation as needed in standard grad… ▽ More

    Submitted 2 July, 2019; v1 submitted 26 June, 2018; originally announced June 2018.

    Comments: 17 pages, 7 tables, 8 figures, published in ICML 2019; part of this work was done while Shanshan was an intern at Google Research, New York

  16. arXiv:1605.08795  [pdf, other

    cs.DS

    Greedy Column Subset Selection: New Bounds and Distributed Algorithms

    Authors: Jason Altschuler, Aditya Bhaskara, Gang Fu, Vahab Mirrokni, Afshin Rostamizadeh, Morteza Zadimoghaddam

    Abstract: The problem of column subset selection has recently attracted a large body of research, with feature selection serving as one obvious and important application. Among the techniques that have been applied to solve this problem, the greedy algorithm has been shown to be quite effective in practice. However, theoretical guarantees on its performance have not been explored thoroughly, especially in a… ▽ More

    Submitted 11 June, 2016; v1 submitted 27 May, 2016; originally announced May 2016.

    Comments: to appear in International Conference on Machine Learning (ICML) 2016

    Journal ref: Proceedings of The 33rd International Conference on Machine Learning, PMLR 48:2539-2548, 2016

  17. arXiv:1603.06560  [pdf, other

    cs.LG stat.ML

    Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization

    Authors: Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, Ameet Talwalkar

    Abstract: Performance of machine learning algorithms depends critically on identifying a good set of hyperparameters. While recent approaches use Bayesian optimization to adaptively select configurations, we focus on speeding up random search through adaptive resource allocation and early-stop**. We formulate hyperparameter optimization as a pure-exploration non-stochastic infinite-armed bandit problem wh… ▽ More

    Submitted 18 June, 2018; v1 submitted 21 March, 2016; originally announced March 2016.

    Comments: Changes: - Updated to JMLR version

    Journal ref: Journal of Machine Learning Research 18 (2018) 1-52

  18. arXiv:1509.08880  [pdf, ps, other

    stat.ML cs.LG

    Foundations of Coupled Nonlinear Dimensionality Reduction

    Authors: Mehryar Mohri, Afshin Rostamizadeh, Dmitry Storcheus

    Abstract: In this paper we introduce and analyze the learning scenario of \emph{coupled nonlinear dimensionality reduction}, which combines two major steps of machine learning pipeline: projection onto a manifold and subsequent supervised learning. First, we present new generalization bounds for this scenario and, second, we introduce an algorithm that follows from these bounds. The generalization error bou… ▽ More

    Submitted 25 November, 2015; v1 submitted 29 September, 2015; originally announced September 2015.

    Comments: 12 pages, 3 figures, authors in alphabetical order

  19. arXiv:1504.01117  [pdf, ps, other

    cs.DB

    An $\tilde{O}(\frac{1}{\sqrt{T}})$-error online algorithm for retrieving heavily perturbated statistical databases in the low-dimensional querying mode

    Authors: Krzysztof Choromanski, Afshin Rostamizadeh, Umar Syed

    Abstract: We give the first $\tilde{O}(\frac{1}{\sqrt{T}})$-error online algorithm for reconstructing noisy statistical databases, where $T$ is the number of (online) sample queries received. The algorithm, which requires only $O(\log T)$ memory, aims to learn a hidden database-vector $w^{*} \in \mathbb{R}^{D}$ in order to accurately answer a stream of queries regarding the hidden database, which arrive in… ▽ More

    Submitted 5 April, 2015; originally announced April 2015.

  20. arXiv:1408.2044  [pdf

    cs.LG stat.ML

    Matrix Coherence and the Nystrom Method

    Authors: Ameet Talwalkar, Afshin Rostamizadeh

    Abstract: The Nystrom method is an efficient technique used to speed up large-scale learning applications by generating low-rank approximations. Crucial to the performance of this technique is the assumption that a matrix can be well approximated by working exclusively with a subset of its columns. In this work we relate this assumption to the concept of matrix coherence, connecting coherence to the perform… ▽ More

    Submitted 9 August, 2014; originally announced August 2014.

    Comments: Appears in Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence (UAI2010)

    Report number: UAI-P-2010-PG-572-579

  21. arXiv:1311.6838  [pdf, ps, other

    cs.LG cs.GT

    Learning Prices for Repeated Auctions with Strategic Buyers

    Authors: Kareem Amin, Afshin Rostamizadeh, Umar Syed

    Abstract: Inspired by real-time ad exchanges for online display advertising, we consider the problem of inferring a buyer's value distribution for a good when the buyer is repeatedly interacting with a seller through a posted-price mechanism. We model the buyer as a strategic agent, whose goal is to maximize her long-term surplus, and we are interested in mechanisms that maximize the seller's long-term reve… ▽ More

    Submitted 26 November, 2013; originally announced November 2013.

    Comments: Neural Information Processing Systems (NIPS 2013)

  22. arXiv:1305.0208  [pdf, ps, other

    cs.LG

    Perceptron Mistake Bounds

    Authors: Mehryar Mohri, Afshin Rostamizadeh

    Abstract: We present a brief survey of existing mistake bounds and introduce novel bounds for the Perceptron or the kernel Perceptron algorithm. Our novel bounds generalize beyond standard margin-loss type bounds, allow for any convex and Lipschitz loss function, and admit a very simple proof.

    Submitted 22 July, 2013; v1 submitted 1 May, 2013; originally announced May 2013.

  23. arXiv:1205.2653  [pdf

    cs.LG stat.ML

    L2 Regularization for Learning Kernels

    Authors: Corinna Cortes, Mehryar Mohri, Afshin Rostamizadeh

    Abstract: The choice of the kernel is critical to the success of many learning algorithms but it is typically left to the user. Instead, the training data can be used to learn the kernel by selecting it out of a given family, such as that of non-negative linear combinations of p base kernels, constrained by a trace or L1 regularization. This paper studies the problem of learning kernels with the same family… ▽ More

    Submitted 9 May, 2012; originally announced May 2012.

    Comments: Appears in Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI2009)

    Report number: UAI-P-2009-PG-109-116

  24. arXiv:1205.2628  [pdf

    cs.LG stat.ML

    Multiple Source Adaptation and the Renyi Divergence

    Authors: Yishay Mansour, Mehryar Mohri, Afshin Rostamizadeh

    Abstract: This paper presents a novel theoretical study of the general problem of multiple source adaptation using the notion of Renyi divergence. Our results build on our previous work [12], but significantly broaden the scope of that work in several directions. We extend previous multiple source loss guarantees based on distribution weighted combinations to arbitrary target distributions P, not necessaril… ▽ More

    Submitted 9 May, 2012; originally announced May 2012.

    Comments: Appears in Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI2009)

    Report number: UAI-P-2009-PG-367-374

  25. arXiv:1203.0550  [pdf, other

    cs.LG cs.AI

    Algorithms for Learning Kernels Based on Centered Alignment

    Authors: Corinna Cortes, Mehryar Mohri, Afshin Rostamizadeh

    Abstract: This paper presents new and effective algorithms for learning kernels. In particular, as shown by our empirical results, these algorithms consistently outperform the so-called uniform combination solution that has proven to be difficult to improve upon in the past, as well as other algorithms for learning kernels based on convex combinations of base kernels in both classification and regression. O… ▽ More

    Submitted 29 April, 2024; v1 submitted 2 March, 2012; originally announced March 2012.

    Journal ref: Journal of Machine Learning Research 13 (2012) 795-828

  26. arXiv:1202.3712  [pdf

    cs.LG stat.ML

    Ensembles of Kernel Predictors

    Authors: Corinna Cortes, Mehryar Mohri, Afshin Rostamizadeh

    Abstract: This paper examines the problem of learning with a finite and possibly large set of p base kernels. It presents a theoretical and empirical analysis of an approach addressing this problem based on ensembles of kernel predictors. This includes novel theoretical guarantees based on the Rademacher complexity of the corresponding hypothesis sets, the introduction and analysis of a learning algorithm b… ▽ More

    Submitted 14 February, 2012; originally announced February 2012.

    Report number: UAI-P-2011-PG-145-152

  27. arXiv:1104.0729  [pdf, ps, other

    cs.LG stat.ML

    Online and Batch Learning Algorithms for Data with Missing Features

    Authors: Afshin Rostamizadeh, Alekh Agarwal, Peter Bartlett

    Abstract: We introduce new online and batch algorithms that are robust to data with missing features, a situation that arises in many practical applications. In the online setup, we allow for the comparison hypothesis to change as a function of the subset of features that is observed on any given round, extending the standard setting where the comparison hypothesis is fixed throughout. In the batch setup, w… ▽ More

    Submitted 16 June, 2011; v1 submitted 5 April, 2011; originally announced April 2011.

    Journal ref: 27th Conference on Uncertainty in Artificial Intelligence (UAI 2011)

  28. arXiv:1004.2008  [pdf, ps, other

    cs.AI

    Matrix Coherence and the Nystrom Method

    Authors: Ameet Talwalkar, Afshin Rostamizadeh

    Abstract: The Nystrom method is an efficient technique to speed up large-scale learning applications by generating low-rank approximations. Crucial to the performance of this technique is the assumption that a matrix can be well approximated by working exclusively with a subset of its columns. In this work we relate this assumption to the concept of matrix coherence and connect matrix coherence to the per… ▽ More

    Submitted 12 April, 2010; originally announced April 2010.

  29. arXiv:0912.3309  [pdf, ps, other

    cs.AI

    New Generalization Bounds for Learning Kernels

    Authors: Corinna Cortes, Mehryar Mohri, Afshin Rostamizadeh

    Abstract: This paper presents several novel generalization bounds for the problem of learning kernels based on the analysis of the Rademacher complexity of the corresponding hypothesis sets. Our bound for learning kernels with a convex combination of p base kernels has only a log(p) dependency on the number of kernels, p, which is considerably more favorable than the previous best bound given for the same… ▽ More

    Submitted 16 December, 2009; originally announced December 2009.

  30. arXiv:0902.3430  [pdf, ps, other

    cs.LG cs.AI

    Domain Adaptation: Learning Bounds and Algorithms

    Authors: Yishay Mansour, Mehryar Mohri, Afshin Rostamizadeh

    Abstract: This paper addresses the general problem of domain adaptation which arises in a variety of applications where the distribution of the labeled sample available somewhat differs from that of the test data. Building on previous work by Ben-David et al. (2007), we introduce a novel distance between distributions, discrepancy distance, that is tailored to adaptation problems with arbitrary loss functio… ▽ More

    Submitted 30 November, 2023; v1 submitted 19 February, 2009; originally announced February 2009.

    Comments: 12 pages, 4 figures

  31. arXiv:0811.1629  [pdf, ps, other

    cs.LG

    Stability Bound for Stationary Phi-mixing and Beta-mixing Processes

    Authors: Mehryar Mohri, Afshin Rostamizadeh

    Abstract: Most generalization bounds in learning theory are based on some measure of the complexity of the hypothesis class used, independently of any algorithm. In contrast, the notion of algorithmic stability can be used to derive tight generalization bounds that are tailored to specific learning algorithms by exploiting their particular properties. However, as in much of learning theory, existing stabi… ▽ More

    Submitted 11 November, 2008; originally announced November 2008.

    Comments: 23 pages, 1 figure, submitted to JMLR

  32. arXiv:0805.2775  [pdf, ps, other

    cs.LG

    Sample Selection Bias Correction Theory

    Authors: Corinna Cortes, Mehryar Mohri, Michael Riley, Afshin Rostamizadeh

    Abstract: This paper presents a theoretical analysis of sample selection bias correction. The sample bias correction technique commonly used in machine learning consists of reweighting the cost of an error on each training point of a biased sample to more closely reflect the unbiased distribution. This relies on weights derived by various estimation techniques based on finite samples. We analyze the effec… ▽ More

    Submitted 18 May, 2008; originally announced May 2008.

    Comments: 16 pages