Search | arXiv e-print repository

SpacTor-T5: Pre-training T5 Models with Span Corruption and Replaced Token Detection

Authors: Ke Ye, Heinrich Jiang, Afshin Rostamizadeh, Ayan Chakrabarti, Giulia DeSalvo, Jean-François Kagy, Lazaros Karydas, Gui Citovsky, Sanjiv Kumar

Abstract: Pre-training large language models is known to be extremely resource intensive and often times inefficient, under-utilizing the information encapsulated in the training text sequences. In this paper, we present SpacTor, a new training procedure consisting of (1) a hybrid objective combining span corruption (SC) and token replacement detection (RTD), and (2) a two-stage curriculum that optimizes th… ▽ More Pre-training large language models is known to be extremely resource intensive and often times inefficient, under-utilizing the information encapsulated in the training text sequences. In this paper, we present SpacTor, a new training procedure consisting of (1) a hybrid objective combining span corruption (SC) and token replacement detection (RTD), and (2) a two-stage curriculum that optimizes the hybrid objective over the initial $τ$ iterations, then transitions to standard SC loss. We show empirically that the effectiveness of the hybrid objective is tied to the two-stage pre-training schedule, and provide extensive analysis on why this is the case. In our experiments with encoder-decoder architectures (T5) on a variety of NLP tasks, SpacTor-T5 yields the same downstream performance as standard SC pre-training, while enabling a 50% reduction in pre-training iterations and 40% reduction in total FLOPs. Alternatively, given the same amount of computing budget, we find that SpacTor results in significantly improved downstream benchmark performance. △ Less

Submitted 23 January, 2024; originally announced January 2024.

Comments: 9+13 pages, 5 figures

arXiv:2310.08461 [pdf, other]

DistillSpec: Improving Speculative Decoding via Knowledge Distillation

Authors: Yongchao Zhou, Kaifeng Lyu, Ankit Singh Rawat, Aditya Krishna Menon, Afshin Rostamizadeh, Sanjiv Kumar, Jean-François Kagy, Rishabh Agarwal

Abstract: Speculative decoding (SD) accelerates large language model inference by employing a faster draft model for generating multiple tokens, which are then verified in parallel by the larger target model, resulting in the text generated according to the target model distribution. However, identifying a compact draft model that is well-aligned with the target model is challenging. To tackle this issue, w… ▽ More Speculative decoding (SD) accelerates large language model inference by employing a faster draft model for generating multiple tokens, which are then verified in parallel by the larger target model, resulting in the text generated according to the target model distribution. However, identifying a compact draft model that is well-aligned with the target model is challenging. To tackle this issue, we propose DistillSpec that uses knowledge distillation to better align the draft model with the target model, before applying SD. DistillSpec makes two key design choices, which we demonstrate via systematic study to be crucial to improving the draft and target alignment: utilizing on-policy data generation from the draft model, and tailoring the divergence function to the task and decoding strategy. Notably, DistillSpec yields impressive 10 - 45% speedups over standard SD on a range of standard benchmarks, using both greedy and non-greedy sampling. Furthermore, we combine DistillSpec with lossy SD to achieve fine-grained control over the latency vs. task performance trade-off. Finally, in practical scenarios with models of varying sizes, first using distillation to boost the performance of the target model and then applying DistillSpec to train a well-aligned draft model can reduce decoding latency by 6-10x with minimal performance drop, compared to standard decoding without distillation. △ Less

Submitted 30 March, 2024; v1 submitted 12 October, 2023; originally announced October 2023.

arXiv:2301.12052 [pdf, other]

Leveraging Importance Weights in Subset Selection

Authors: Gui Citovsky, Giulia DeSalvo, Sanjiv Kumar, Srikumar Ramalingam, Afshin Rostamizadeh, Yunjuan Wang

Abstract: We present a subset selection algorithm designed to work with arbitrary model families in a practical batch setting. In such a setting, an algorithm can sample examples one at a time but, in order to limit overhead costs, is only able to update its state (i.e. further train model weights) once a large enough batch of examples is selected. Our algorithm, IWeS, selects examples by importance samplin… ▽ More We present a subset selection algorithm designed to work with arbitrary model families in a practical batch setting. In such a setting, an algorithm can sample examples one at a time but, in order to limit overhead costs, is only able to update its state (i.e. further train model weights) once a large enough batch of examples is selected. Our algorithm, IWeS, selects examples by importance sampling where the sampling probability assigned to each example is based on the entropy of models trained on previously selected batches. IWeS admits significant performance improvement compared to other subset selection algorithms for seven publicly available datasets. Additionally, it is competitive in an active learning setting, where the label information is not available at selection time. We also provide an initial theoretical analysis to support our importance weighting approach, proving generalization and sampling rate bounds. △ Less

Submitted 27 January, 2023; originally announced January 2023.

Comments: ICLR 2023

arXiv:2210.03822 [pdf, other]

Is margin all you need? An extensive empirical study of active learning on tabular data

Authors: Dara Bahri, Heinrich Jiang, Tal Schuster, Afshin Rostamizadeh

Abstract: Given a labeled training set and a collection of unlabeled data, the goal of active learning (AL) is to identify the best unlabeled points to label. In this comprehensive study, we analyze the performance of a variety of AL algorithms on deep neural networks trained on 69 real-world tabular classification datasets from the OpenML-CC18 benchmark. We consider different data regimes and the effect of… ▽ More Given a labeled training set and a collection of unlabeled data, the goal of active learning (AL) is to identify the best unlabeled points to label. In this comprehensive study, we analyze the performance of a variety of AL algorithms on deep neural networks trained on 69 real-world tabular classification datasets from the OpenML-CC18 benchmark. We consider different data regimes and the effect of self-supervised model pre-training. Surprisingly, we find that the classical margin sampling technique matches or outperforms all others, including current state-of-art, in a wide range of experimental settings. To researchers, we hope to encourage rigorous benchmarking against margin, and to practitioners facing tabular data labeling constraints that hyper-parameter-free margin may often be all they need. △ Less

Submitted 7 October, 2022; originally announced October 2022.

arXiv:2107.14263 [pdf, other]

Batch Active Learning at Scale

Authors: Gui Citovsky, Giulia DeSalvo, Claudio Gentile, Lazaros Karydas, Anand Rajagopalan, Afshin Rostamizadeh, Sanjiv Kumar

Abstract: The ability to train complex and highly effective models often requires an abundance of training data, which can easily become a bottleneck in cost, time, and computational resources. Batch active learning, which adaptively issues batched queries to a labeling oracle, is a common approach for addressing this problem. The practical benefits of batch sampling come with the downside of less adaptivit… ▽ More The ability to train complex and highly effective models often requires an abundance of training data, which can easily become a bottleneck in cost, time, and computational resources. Batch active learning, which adaptively issues batched queries to a labeling oracle, is a common approach for addressing this problem. The practical benefits of batch sampling come with the downside of less adaptivity and the risk of sampling redundant examples within a batch -- a risk that grows with the batch size. In this work, we analyze an efficient active learning algorithm, which focuses on the large batch setting. In particular, we show that our sampling method, which combines notions of uncertainty and diversity, easily scales to batch sizes (100K-1M) several orders of magnitude larger than used in previous studies and provides significant improvements in model training efficiency compared to recent baselines. Finally, we provide an initial theoretical analysis, proving label complexity guarantees for a related sampling method, which we show is approximately equivalent to our sampling method in specific settings. △ Less

Submitted 29 July, 2021; originally announced July 2021.

arXiv:2106.02654 [pdf, other]

Churn Reduction via Distillation

Authors: Heinrich Jiang, Harikrishna Narasimhan, Dara Bahri, Andrew Cotter, Afshin Rostamizadeh

Abstract: In real-world systems, models are frequently updated as more data becomes available, and in addition to achieving high accuracy, the goal is to also maintain a low difference in predictions compared to the base model (i.e. predictive "churn"). If model retraining results in vastly different behavior, then it could cause negative effects in downstream systems, especially if this churn can be avoide… ▽ More In real-world systems, models are frequently updated as more data becomes available, and in addition to achieving high accuracy, the goal is to also maintain a low difference in predictions compared to the base model (i.e. predictive "churn"). If model retraining results in vastly different behavior, then it could cause negative effects in downstream systems, especially if this churn can be avoided with limited impact on model accuracy. In this paper, we show an equivalence between training with distillation using the base model as the teacher and training with an explicit constraint on the predictive churn. We then show that distillation performs strongly for low churn training against a number of recent baselines on a wide range of datasets and model architectures, including fully-connected networks, convolutional networks, and transformers. △ Less

Submitted 14 March, 2022; v1 submitted 4 June, 2021; originally announced June 2021.

Journal ref: ICLR 2022

arXiv:2106.02552 [pdf, other]

Active Covering

Authors: Heinrich Jiang, Afshin Rostamizadeh

Abstract: We analyze the problem of active covering, where the learner is given an unlabeled dataset and can sequentially label query examples. The objective is to label query all of the positive examples in the fewest number of total label queries. We show under standard non-parametric assumptions that a classical support estimator can be repurposed as an offline algorithm attaining an excess query cost of… ▽ More We analyze the problem of active covering, where the learner is given an unlabeled dataset and can sequentially label query examples. The objective is to label query all of the positive examples in the fewest number of total label queries. We show under standard non-parametric assumptions that a classical support estimator can be repurposed as an offline algorithm attaining an excess query cost of $\widetildeΘ(n^{D/(D+1)})$ compared to the optimal learner, where $n$ is the number of datapoints and $D$ is the dimension. We then provide a simple active learning method that attains an improved excess query cost of $\widetilde{O}(n^{(D-1)/D})$. Furthermore, the proposed algorithms only require access to the positive labeled examples, which in certain settings provides additional computational and privacy benefits. Finally, we show that the active learning method consistently outperforms offline methods as well as a variety of baselines on a wide range of benchmark image-based datasets. △ Less

Submitted 4 June, 2021; originally announced June 2021.

Comments: ICML 2021

arXiv:2010.05273 [pdf, other]

Federated Learning via Posterior Averaging: A New Perspective and Practical Algorithms

Authors: Maruan Al-Shedivat, Jennifer Gillenwater, Eric Xing, Afshin Rostamizadeh

Abstract: Federated learning is typically approached as an optimization problem, where the goal is to minimize a global loss function by distributing computation across client devices that possess local data and specify different parts of the global objective. We present an alternative perspective and formulate federated learning as a posterior inference problem, where the goal is to infer a global posterio… ▽ More Federated learning is typically approached as an optimization problem, where the goal is to minimize a global loss function by distributing computation across client devices that possess local data and specify different parts of the global objective. We present an alternative perspective and formulate federated learning as a posterior inference problem, where the goal is to infer a global posterior distribution by having client devices each infer the posterior of their local data. While exact inference is often intractable, this perspective provides a principled way to search for global optima in federated settings. Further, starting with the analysis of federated quadratic objectives, we develop a computation- and communication-efficient approximate posterior inference algorithm -- federated posterior averaging (FedPA). Our algorithm uses MCMC for approximate inference of local posteriors on the clients and efficiently communicates their statistics to the server, where the latter uses them to refine a global estimate of the posterior mode. Finally, we show that FedPA generalizes federated averaging (FedAvg), can similarly benefit from adaptive optimizers, and yields state-of-the-art results on four realistic and challenging benchmarks, converging faster, to better optima. △ Less

Submitted 29 January, 2021; v1 submitted 11 October, 2020; originally announced October 2020.

Comments: ICLR 2021. Code: https://github.com/alshedivat/fedpa

arXiv:2006.14616 [pdf, ps, other]

An Analysis of SVD for Deep Rotation Estimation

Authors: Jake Levinson, Carlos Esteves, Kefan Chen, Noah Snavely, Angjoo Kanazawa, Afshin Rostamizadeh, Ameesh Makadia

Abstract: Symmetric orthogonalization via SVD, and closely related procedures, are well-known techniques for projecting matrices onto $O(n)$ or $SO(n)$. These tools have long been used for applications in computer vision, for example optimal 3D alignment problems solved by orthogonal Procrustes, rotation averaging, or Essential matrix decomposition. Despite its utility in different settings, SVD orthogonali… ▽ More Symmetric orthogonalization via SVD, and closely related procedures, are well-known techniques for projecting matrices onto $O(n)$ or $SO(n)$. These tools have long been used for applications in computer vision, for example optimal 3D alignment problems solved by orthogonal Procrustes, rotation averaging, or Essential matrix decomposition. Despite its utility in different settings, SVD orthogonalization as a procedure for producing rotation matrices is typically overlooked in deep learning models, where the preferences tend toward classic representations like unit quaternions, Euler angles, and axis-angle, or more recently-introduced methods. Despite the importance of 3D rotations in computer vision and robotics, a single universally effective representation is still missing. Here, we explore the viability of SVD orthogonalization for 3D rotations in neural networks. We present a theoretical analysis that shows SVD is the natural choice for projecting onto the rotation group. Our extensive quantitative analysis shows simply replacing existing representations with the SVD orthogonalization procedure obtains state of the art performance in many deep learning applications covering both supervised and unsupervised training. △ Less

Submitted 25 June, 2020; originally announced June 2020.

arXiv:1912.00594 [pdf, other]

Combining MixMatch and Active Learning for Better Accuracy with Fewer Labels

Authors: Shuang Song, David Berthelot, Afshin Rostamizadeh

Abstract: We propose using active learning based techniques to further improve the state-of-the-art semi-supervised learning MixMatch algorithm. We provide a thorough empirical evaluation of several active-learning and baseline methods, which successfully demonstrate a significant improvement on the benchmark CIFAR-10, CIFAR-100, and SVHN datasets (as much as 1.5% in absolute accuracy). We also provide an e… ▽ More We propose using active learning based techniques to further improve the state-of-the-art semi-supervised learning MixMatch algorithm. We provide a thorough empirical evaluation of several active-learning and baseline methods, which successfully demonstrate a significant improvement on the benchmark CIFAR-10, CIFAR-100, and SVHN datasets (as much as 1.5% in absolute accuracy). We also provide an empirical analysis of the cost trade-off between incrementally gathering more labeled versus unlabeled data. This analysis can be used to measure the relative value of labeled/unlabeled data at different points of the learning curve, where we find that although the incremental value of labeled data can be as much as 20x that of unlabeled, it quickly diminishes to less than 3x once more than 2,000 labeled example are observed. Code can be found at https://github.com/google-research/mma. △ Less

Submitted 2 December, 2019; v1 submitted 2 December, 2019; originally announced December 2019.

arXiv:1907.00038 [pdf, ps, other]

The Practical Challenges of Active Learning: Lessons Learned from Live Experimentation

Authors: Jean-François Kagy, Tolga Kayadelen, Ji Ma, Afshin Rostamizadeh, Jana Strnadova

Abstract: We tested in a live setting the use of active learning for selecting text sentences for human annotations used in training a Thai segmentation machine learning model. In our study, two concurrent annotated samples were constructed, one through random sampling of sentences from a text corpus, and the other through model-based scoring and ranking of sentences from the same corpus. In the course of t… ▽ More We tested in a live setting the use of active learning for selecting text sentences for human annotations used in training a Thai segmentation machine learning model. In our study, two concurrent annotated samples were constructed, one through random sampling of sentences from a text corpus, and the other through model-based scoring and ranking of sentences from the same corpus. In the course of the experiment, we observed the effect of significant changes to the learning environment which are likely to occur in real-world learning tasks. We describe how our active learning strategy interacted with these events and discuss other practical challenges encountered in using active learning in the live setting. △ Less

Submitted 28 June, 2019; originally announced July 2019.

Comments: Presented at 2019 ICML Workshop on Human in the Loop Learning (HILL 2019), Long Beach, USA

arXiv:1904.13389 [pdf, other]

Categorical Feature Compression via Submodular Optimization

Authors: MohammadHossein Bateni, Lin Chen, Hossein Esfandiari, Thomas Fu, Vahab S. Mirrokni, Afshin Rostamizadeh

Abstract: In the era of big data, learning from categorical features with very large vocabularies (e.g., 28 million for the Criteo click prediction dataset) has become a practical challenge for machine learning researchers and practitioners. We design a highly-scalable vocabulary compression algorithm that seeks to maximize the mutual information between the compressed categorical feature and the target bin… ▽ More In the era of big data, learning from categorical features with very large vocabularies (e.g., 28 million for the Criteo click prediction dataset) has become a practical challenge for machine learning researchers and practitioners. We design a highly-scalable vocabulary compression algorithm that seeks to maximize the mutual information between the compressed categorical feature and the target binary labels and we furthermore show that its solution is guaranteed to be within a $1-1/e \approx 63\%$ factor of the global optimal solution. To achieve this, we introduce a novel re-parametrization of the mutual information objective, which we prove is submodular, and design a data structure to query the submodular function in amortized $O(\log n )$ time (where $n$ is the input vocabulary size). Our complete algorithm is shown to operate in $O(n \log n )$ time. Additionally, we design a distributed implementation in which the query data structure is decomposed across $O(k)$ machines such that each machine only requires $O(\frac n k)$ space, while still preserving the approximation guarantee and using only logarithmic rounds of computation. We also provide analysis of simple alternative heuristic compression methods to demonstrate they cannot achieve any approximation guarantee. Using the large-scale Criteo learning task, we demonstrate better performance in retaining mutual information and also verify competitive learning performance compared to other baseline methods. △ Less

Submitted 30 April, 2019; originally announced April 2019.

Comments: Accepted to ICML 2019. Authors are listed in alphabetical order

arXiv:1904.03257 [pdf, ps, other]

MLSys: The New Frontier of Machine Learning Systems

Authors: Alexander Ratner, Dan Alistarh, Gustavo Alonso, David G. Andersen, Peter Bailis, Sarah Bird, Nicholas Carlini, Bryan Catanzaro, Jennifer Chayes, Eric Chung, Bill Dally, Jeff Dean, Inderjit S. Dhillon, Alexandros Dimakis, Pradeep Dubey, Charles Elkan, Grigori Fursin, Gregory R. Ganger, Lise Getoor, Phillip B. Gibbons, Garth A. Gibson, Joseph E. Gonzalez, Justin Gottschlich, Song Han, Kim Hazelwood , et al. (44 additional authors not shown)

Abstract: Machine learning (ML) techniques are enjoying rapidly increasing adoption. However, designing and implementing the systems that support ML models in real-world deployments remains a significant obstacle, in large part due to the radically different development and deployment profile of modern ML methods, and the range of practical concerns that come with broader adoption. We propose to foster a ne… ▽ More Machine learning (ML) techniques are enjoying rapidly increasing adoption. However, designing and implementing the systems that support ML models in real-world deployments remains a significant obstacle, in large part due to the radically different development and deployment profile of modern ML methods, and the range of practical concerns that come with broader adoption. We propose to foster a new systems machine learning research community at the intersection of the traditional systems and ML communities, focused on topics such as hardware systems for ML, software systems for ML, and ML optimized for metrics beyond predictive accuracy. To do this, we describe a new conference, MLSys, that explicitly targets research at the intersection of systems and machine learning with a program committee split evenly between experts in systems and ML, and an explicit focus on topics at the intersection of the two. △ Less

Submitted 1 December, 2019; v1 submitted 29 March, 2019; originally announced April 2019.

arXiv:1810.05934 [pdf, other]

A System for Massively Parallel Hyperparameter Tuning

Authors: Liam Li, Kevin Jamieson, Afshin Rostamizadeh, Ekaterina Gonina, Moritz Hardt, Benjamin Recht, Ameet Talwalkar

Abstract: Modern learning models are characterized by large hyperparameter spaces and long training times. These properties, coupled with the rise of parallel computing and the growing demand to productionize machine learning workloads, motivate the need to develop mature hyperparameter optimization functionality in distributed computing settings. We address this challenge by first introducing a simple and… ▽ More Modern learning models are characterized by large hyperparameter spaces and long training times. These properties, coupled with the rise of parallel computing and the growing demand to productionize machine learning workloads, motivate the need to develop mature hyperparameter optimization functionality in distributed computing settings. We address this challenge by first introducing a simple and robust hyperparameter optimization algorithm called ASHA, which exploits parallelism and aggressive early-stop** to tackle large-scale hyperparameter optimization problems. Our extensive empirical results show that ASHA outperforms existing state-of-the-art hyperparameter optimization methods; scales linearly with the number of workers in distributed settings; and is suitable for massive parallelism, as demonstrated on a task with 500 workers. We then describe several design decisions we encountered, along with our associated solutions, when integrating ASHA in Determined AI's end-to-end production-quality machine learning system that offers hyperparameter tuning as a service. △ Less

Submitted 15 March, 2020; v1 submitted 13 October, 2018; originally announced October 2018.

Comments: v2: Corrected typo in Algorithm 1 v3: Added comparison to BOHB and parallel version of synchronous SHA. Add PBT to experiment in Section 4.3.1 v4: Added acknowledgements and slight edit to related work

Journal ref: Conference on Machine Learning and Systems 2020

arXiv:1806.10175 [pdf, other]

Learning a Compressed Sensing Measurement Matrix via Gradient Unrolling

Authors: Shanshan Wu, Alexandros G. Dimakis, Sujay Sanghavi, Felix X. Yu, Daniel Holtmann-Rice, Dmitry Storcheus, Afshin Rostamizadeh, Sanjiv Kumar

Abstract: Linear encoding of sparse vectors is widely popular, but is commonly data-independent -- missing any possible extra (but a priori unknown) structure beyond sparsity. In this paper we present a new method to learn linear encoders that adapt to data, while still performing well with the widely used $\ell_1$ decoder. The convex $\ell_1$ decoder prevents gradient propagation as needed in standard grad… ▽ More Linear encoding of sparse vectors is widely popular, but is commonly data-independent -- missing any possible extra (but a priori unknown) structure beyond sparsity. In this paper we present a new method to learn linear encoders that adapt to data, while still performing well with the widely used $\ell_1$ decoder. The convex $\ell_1$ decoder prevents gradient propagation as needed in standard gradient-based training. Our method is based on the insight that unrolling the convex decoder into $T$ projected subgradient steps can address this issue. Our method can be seen as a data-driven way to learn a compressed sensing measurement matrix. We compare the empirical performance of 10 algorithms over 6 sparse datasets (3 synthetic and 3 real). Our experiments show that there is indeed additional structure beyond sparsity in the real datasets; our method is able to discover it and exploit it to create excellent reconstructions with fewer measurements (by a factor of 1.1-3x) compared to the previous state-of-the-art methods. We illustrate an application of our method in learning label embeddings for extreme multi-label classification, and empirically show that our method is able to match or outperform the precision scores of SLEEC, which is one of the state-of-the-art embedding-based approaches. △ Less

Submitted 2 July, 2019; v1 submitted 26 June, 2018; originally announced June 2018.

Comments: 17 pages, 7 tables, 8 figures, published in ICML 2019; part of this work was done while Shanshan was an intern at Google Research, New York

arXiv:1605.08795 [pdf, other]

Greedy Column Subset Selection: New Bounds and Distributed Algorithms

Authors: Jason Altschuler, Aditya Bhaskara, Gang Fu, Vahab Mirrokni, Afshin Rostamizadeh, Morteza Zadimoghaddam

Abstract: The problem of column subset selection has recently attracted a large body of research, with feature selection serving as one obvious and important application. Among the techniques that have been applied to solve this problem, the greedy algorithm has been shown to be quite effective in practice. However, theoretical guarantees on its performance have not been explored thoroughly, especially in a… ▽ More The problem of column subset selection has recently attracted a large body of research, with feature selection serving as one obvious and important application. Among the techniques that have been applied to solve this problem, the greedy algorithm has been shown to be quite effective in practice. However, theoretical guarantees on its performance have not been explored thoroughly, especially in a distributed setting. In this paper, we study the greedy algorithm for the column subset selection problem from a theoretical and empirical perspective and show its effectiveness in a distributed setting. In particular, we provide an improved approximation guarantee for the greedy algorithm which we show is tight up to a constant factor, and present the first distributed implementation with provable approximation factors. We use the idea of randomized composable core-sets, developed recently in the context of submodular maximization. Finally, we validate the effectiveness of this distributed algorithm via an empirical study. △ Less

Submitted 11 June, 2016; v1 submitted 27 May, 2016; originally announced May 2016.

Comments: to appear in International Conference on Machine Learning (ICML) 2016

Journal ref: Proceedings of The 33rd International Conference on Machine Learning, PMLR 48:2539-2548, 2016

arXiv:1603.06560 [pdf, other]

Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization

Authors: Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, Ameet Talwalkar

Abstract: Performance of machine learning algorithms depends critically on identifying a good set of hyperparameters. While recent approaches use Bayesian optimization to adaptively select configurations, we focus on speeding up random search through adaptive resource allocation and early-stop**. We formulate hyperparameter optimization as a pure-exploration non-stochastic infinite-armed bandit problem wh… ▽ More Performance of machine learning algorithms depends critically on identifying a good set of hyperparameters. While recent approaches use Bayesian optimization to adaptively select configurations, we focus on speeding up random search through adaptive resource allocation and early-stop**. We formulate hyperparameter optimization as a pure-exploration non-stochastic infinite-armed bandit problem where a predefined resource like iterations, data samples, or features is allocated to randomly sampled configurations. We introduce a novel algorithm, Hyperband, for this framework and analyze its theoretical properties, providing several desirable guarantees. Furthermore, we compare Hyperband with popular Bayesian optimization methods on a suite of hyperparameter optimization problems. We observe that Hyperband can provide over an order-of-magnitude speedup over our competitor set on a variety of deep-learning and kernel-based learning problems. △ Less

Submitted 18 June, 2018; v1 submitted 21 March, 2016; originally announced March 2016.

Comments: Changes: - Updated to JMLR version

Journal ref: Journal of Machine Learning Research 18 (2018) 1-52

arXiv:1509.08880 [pdf, ps, other]

Foundations of Coupled Nonlinear Dimensionality Reduction

Authors: Mehryar Mohri, Afshin Rostamizadeh, Dmitry Storcheus

Abstract: In this paper we introduce and analyze the learning scenario of \emph{coupled nonlinear dimensionality reduction}, which combines two major steps of machine learning pipeline: projection onto a manifold and subsequent supervised learning. First, we present new generalization bounds for this scenario and, second, we introduce an algorithm that follows from these bounds. The generalization error bou… ▽ More In this paper we introduce and analyze the learning scenario of \emph{coupled nonlinear dimensionality reduction}, which combines two major steps of machine learning pipeline: projection onto a manifold and subsequent supervised learning. First, we present new generalization bounds for this scenario and, second, we introduce an algorithm that follows from these bounds. The generalization error bound is based on a careful analysis of the empirical Rademacher complexity of the relevant hypothesis set. In particular, we show an upper bound on the Rademacher complexity that is in $\widetilde O(\sqrt{Λ_{(r)}/m})$, where $m$ is the sample size and $Λ_{(r)}$ the upper bound on the Ky-Fan $r$-norm of the associated kernel matrix. We give both upper and lower bound guarantees in terms of that Ky-Fan $r$-norm, which strongly justifies the definition of our hypothesis set. To the best of our knowledge, these are the first learning guarantees for the problem of coupled dimensionality reduction. Our analysis and learning guarantees further apply to several special cases, such as that of using a fixed kernel with supervised dimensionality reduction or that of unsupervised learning of a kernel for dimensionality reduction followed by a supervised learning algorithm. Based on theoretical analysis, we suggest a structural risk minimization algorithm consisting of the coupled fitting of a low dimensional manifold and a separation function on that manifold. △ Less

Submitted 25 November, 2015; v1 submitted 29 September, 2015; originally announced September 2015.

Comments: 12 pages, 3 figures, authors in alphabetical order

arXiv:1504.01117 [pdf, ps, other]

An $\tilde{O}(\frac{1}{\sqrt{T}})$-error online algorithm for retrieving heavily perturbated statistical databases in the low-dimensional querying mode

Authors: Krzysztof Choromanski, Afshin Rostamizadeh, Umar Syed

Abstract: We give the first $\tilde{O}(\frac{1}{\sqrt{T}})$-error online algorithm for reconstructing noisy statistical databases, where $T$ is the number of (online) sample queries received. The algorithm, which requires only $O(\log T)$ memory, aims to learn a hidden database-vector $w^{*} \in \mathbb{R}^{D}$ in order to accurately answer a stream of queries regarding the hidden database, which arrive in… ▽ More We give the first $\tilde{O}(\frac{1}{\sqrt{T}})$-error online algorithm for reconstructing noisy statistical databases, where $T$ is the number of (online) sample queries received. The algorithm, which requires only $O(\log T)$ memory, aims to learn a hidden database-vector $w^{*} \in \mathbb{R}^{D}$ in order to accurately answer a stream of queries regarding the hidden database, which arrive in an online fashion from some unknown distribution $\mathcal{D}$. We assume the distribution $\mathcal{D}$ is defined on the neighborhood of a low-dimensional manifold. The presented algorithm runs in $O(dD)$-time per query, where $d$ is the dimensionality of the query-space. Contrary to the classical setting, there is no separate training set that is used by the algorithm to learn the database --- the stream on which the algorithm will be evaluated must also be used to learn the database-vector. The algorithm only has access to a binary oracle $\mathcal{O}$ that answers whether a particular linear function of the database-vector plus random noise is larger than a threshold, which is specified by the algorithm. We note that we allow for a significant $O(D)$ amount of noise to be added while other works focused on the low noise $o(\sqrt{D})$-setting. For a stream of $T$ queries our algorithm achieves an average error $\tilde{O}(\frac{1}{\sqrt{T}})$ by filtering out random noise, adapting threshold values given to the oracle based on its previous answers and, as a consequence, recovering with high precision a projection of a database-vector $w^{*}$ onto the manifold defining the query-space. △ Less

Submitted 5 April, 2015; originally announced April 2015.

arXiv:1408.2044 [pdf]

Matrix Coherence and the Nystrom Method

Authors: Ameet Talwalkar, Afshin Rostamizadeh

Abstract: The Nystrom method is an efficient technique used to speed up large-scale learning applications by generating low-rank approximations. Crucial to the performance of this technique is the assumption that a matrix can be well approximated by working exclusively with a subset of its columns. In this work we relate this assumption to the concept of matrix coherence, connecting coherence to the perform… ▽ More The Nystrom method is an efficient technique used to speed up large-scale learning applications by generating low-rank approximations. Crucial to the performance of this technique is the assumption that a matrix can be well approximated by working exclusively with a subset of its columns. In this work we relate this assumption to the concept of matrix coherence, connecting coherence to the performance of the Nystrom method. Making use of related work in the compressed sensing and the matrix completion literature, we derive novel coherence-based bounds for the Nystrom method in the low-rank setting. We then present empirical results that corroborate these theoretical bounds. Finally, we present more general empirical results for the full-rank setting that convincingly demonstrate the ability of matrix coherence to measure the degree to which information can be extracted from a subset of columns. △ Less

Submitted 9 August, 2014; originally announced August 2014.

Comments: Appears in Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence (UAI2010)

Report number: UAI-P-2010-PG-572-579

arXiv:1311.6838 [pdf, ps, other]

Learning Prices for Repeated Auctions with Strategic Buyers

Authors: Kareem Amin, Afshin Rostamizadeh, Umar Syed

Abstract: Inspired by real-time ad exchanges for online display advertising, we consider the problem of inferring a buyer's value distribution for a good when the buyer is repeatedly interacting with a seller through a posted-price mechanism. We model the buyer as a strategic agent, whose goal is to maximize her long-term surplus, and we are interested in mechanisms that maximize the seller's long-term reve… ▽ More Inspired by real-time ad exchanges for online display advertising, we consider the problem of inferring a buyer's value distribution for a good when the buyer is repeatedly interacting with a seller through a posted-price mechanism. We model the buyer as a strategic agent, whose goal is to maximize her long-term surplus, and we are interested in mechanisms that maximize the seller's long-term revenue. We define the natural notion of strategic regret --- the lost revenue as measured against a truthful (non-strategic) buyer. We present seller algorithms that are no-(strategic)-regret when the buyer discounts her future surplus --- i.e. the buyer prefers showing advertisements to users sooner rather than later. We also give a lower bound on strategic regret that increases as the buyer's discounting weakens and shows, in particular, that any seller algorithm will suffer linear strategic regret if there is no discounting. △ Less

Submitted 26 November, 2013; originally announced November 2013.

Comments: Neural Information Processing Systems (NIPS 2013)

arXiv:1305.0208 [pdf, ps, other]

Perceptron Mistake Bounds

Authors: Mehryar Mohri, Afshin Rostamizadeh

Abstract: We present a brief survey of existing mistake bounds and introduce novel bounds for the Perceptron or the kernel Perceptron algorithm. Our novel bounds generalize beyond standard margin-loss type bounds, allow for any convex and Lipschitz loss function, and admit a very simple proof. We present a brief survey of existing mistake bounds and introduce novel bounds for the Perceptron or the kernel Perceptron algorithm. Our novel bounds generalize beyond standard margin-loss type bounds, allow for any convex and Lipschitz loss function, and admit a very simple proof. △ Less

Submitted 22 July, 2013; v1 submitted 1 May, 2013; originally announced May 2013.

arXiv:1205.2653 [pdf]

L2 Regularization for Learning Kernels

Authors: Corinna Cortes, Mehryar Mohri, Afshin Rostamizadeh

Abstract: The choice of the kernel is critical to the success of many learning algorithms but it is typically left to the user. Instead, the training data can be used to learn the kernel by selecting it out of a given family, such as that of non-negative linear combinations of p base kernels, constrained by a trace or L1 regularization. This paper studies the problem of learning kernels with the same family… ▽ More The choice of the kernel is critical to the success of many learning algorithms but it is typically left to the user. Instead, the training data can be used to learn the kernel by selecting it out of a given family, such as that of non-negative linear combinations of p base kernels, constrained by a trace or L1 regularization. This paper studies the problem of learning kernels with the same family of kernels but with an L2 regularization instead, and for regression problems. We analyze the problem of learning kernels with ridge regression. We derive the form of the solution of the optimization problem and give an efficient iterative algorithm for computing that solution. We present a novel theoretical analysis of the problem based on stability and give learning bounds for orthogonal kernels that contain only an additive term O(pp/m) when compared to the standard kernel ridge regression stability bound. We also report the results of experiments indicating that L1 regularization can lead to modest improvements for a small number of kernels, but to performance degradations in larger-scale cases. In contrast, L2 regularization never degrades performance and in fact achieves significant improvements with a large number of kernels. △ Less

Submitted 9 May, 2012; originally announced May 2012.

Comments: Appears in Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI2009)

Report number: UAI-P-2009-PG-109-116

arXiv:1205.2628 [pdf]

Multiple Source Adaptation and the Renyi Divergence

Authors: Yishay Mansour, Mehryar Mohri, Afshin Rostamizadeh

Abstract: This paper presents a novel theoretical study of the general problem of multiple source adaptation using the notion of Renyi divergence. Our results build on our previous work [12], but significantly broaden the scope of that work in several directions. We extend previous multiple source loss guarantees based on distribution weighted combinations to arbitrary target distributions P, not necessaril… ▽ More This paper presents a novel theoretical study of the general problem of multiple source adaptation using the notion of Renyi divergence. Our results build on our previous work [12], but significantly broaden the scope of that work in several directions. We extend previous multiple source loss guarantees based on distribution weighted combinations to arbitrary target distributions P, not necessarily mixtures of the source distributions, analyze both known and unknown target distribution cases, and prove a lower bound. We further extend our bounds to deal with the case where the learner receives an approximate distribution for each source instead of the exact one, and show that similar loss guarantees can be achieved depending on the divergence between the approximate and true distributions. We also analyze the case where the labeling functions of the source domains are somewhat different. Finally, we report the results of experiments with both an artificial data set and a sentiment analysis task, showing the performance benefits of the distribution weighted combinations and the quality of our bounds based on the Renyi divergence. △ Less

Submitted 9 May, 2012; originally announced May 2012.

Comments: Appears in Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI2009)

Report number: UAI-P-2009-PG-367-374

arXiv:1203.0550 [pdf, other]

Algorithms for Learning Kernels Based on Centered Alignment

Authors: Corinna Cortes, Mehryar Mohri, Afshin Rostamizadeh

Abstract: This paper presents new and effective algorithms for learning kernels. In particular, as shown by our empirical results, these algorithms consistently outperform the so-called uniform combination solution that has proven to be difficult to improve upon in the past, as well as other algorithms for learning kernels based on convex combinations of base kernels in both classification and regression. O… ▽ More This paper presents new and effective algorithms for learning kernels. In particular, as shown by our empirical results, these algorithms consistently outperform the so-called uniform combination solution that has proven to be difficult to improve upon in the past, as well as other algorithms for learning kernels based on convex combinations of base kernels in both classification and regression. Our algorithms are based on the notion of centered alignment which is used as a similarity measure between kernels or kernel matrices. We present a number of novel algorithmic, theoretical, and empirical results for learning kernels based on our notion of centered alignment. In particular, we describe efficient algorithms for learning a maximum alignment kernel by showing that the problem can be reduced to a simple QP and discuss a one-stage algorithm for learning both a kernel and a hypothesis based on that kernel using an alignment-based regularization. Our theoretical results include a novel concentration bound for centered alignment between kernel matrices, the proof of the existence of effective predictors for kernels with high alignment, both for classification and for regression, and the proof of stability-based generalization bounds for a broad family of algorithms for learning kernels based on centered alignment. We also report the results of experiments with our centered alignment-based algorithms in both classification and regression. △ Less

Submitted 29 April, 2024; v1 submitted 2 March, 2012; originally announced March 2012.

Journal ref: Journal of Machine Learning Research 13 (2012) 795-828

arXiv:1202.3712 [pdf]

Ensembles of Kernel Predictors

Authors: Corinna Cortes, Mehryar Mohri, Afshin Rostamizadeh

Abstract: This paper examines the problem of learning with a finite and possibly large set of p base kernels. It presents a theoretical and empirical analysis of an approach addressing this problem based on ensembles of kernel predictors. This includes novel theoretical guarantees based on the Rademacher complexity of the corresponding hypothesis sets, the introduction and analysis of a learning algorithm b… ▽ More This paper examines the problem of learning with a finite and possibly large set of p base kernels. It presents a theoretical and empirical analysis of an approach addressing this problem based on ensembles of kernel predictors. This includes novel theoretical guarantees based on the Rademacher complexity of the corresponding hypothesis sets, the introduction and analysis of a learning algorithm based on these hypothesis sets, and a series of experiments using ensembles of kernel predictors with several data sets. Both convex combinations of kernel-based hypotheses and more general Lq-regularized nonnegative combinations are analyzed. These theoretical, algorithmic, and empirical results are compared with those achieved by using learning kernel techniques, which can be viewed as another approach for solving the same problem. △ Less

Submitted 14 February, 2012; originally announced February 2012.

Report number: UAI-P-2011-PG-145-152

arXiv:1104.0729 [pdf, ps, other]

Online and Batch Learning Algorithms for Data with Missing Features

Authors: Afshin Rostamizadeh, Alekh Agarwal, Peter Bartlett

Abstract: We introduce new online and batch algorithms that are robust to data with missing features, a situation that arises in many practical applications. In the online setup, we allow for the comparison hypothesis to change as a function of the subset of features that is observed on any given round, extending the standard setting where the comparison hypothesis is fixed throughout. In the batch setup, w… ▽ More We introduce new online and batch algorithms that are robust to data with missing features, a situation that arises in many practical applications. In the online setup, we allow for the comparison hypothesis to change as a function of the subset of features that is observed on any given round, extending the standard setting where the comparison hypothesis is fixed throughout. In the batch setup, we present a convex relation of a non-convex problem to jointly estimate an imputation function, used to fill in the values of missing features, along with the classification hypothesis. We prove regret bounds in the online setting and Rademacher complexity bounds for the batch i.i.d. setting. The algorithms are tested on several UCI datasets, showing superior performance over baselines. △ Less

Submitted 16 June, 2011; v1 submitted 5 April, 2011; originally announced April 2011.

Journal ref: 27th Conference on Uncertainty in Artificial Intelligence (UAI 2011)

arXiv:1004.2008 [pdf, ps, other]

Matrix Coherence and the Nystrom Method

Authors: Ameet Talwalkar, Afshin Rostamizadeh

Abstract: The Nystrom method is an efficient technique to speed up large-scale learning applications by generating low-rank approximations. Crucial to the performance of this technique is the assumption that a matrix can be well approximated by working exclusively with a subset of its columns. In this work we relate this assumption to the concept of matrix coherence and connect matrix coherence to the per… ▽ More The Nystrom method is an efficient technique to speed up large-scale learning applications by generating low-rank approximations. Crucial to the performance of this technique is the assumption that a matrix can be well approximated by working exclusively with a subset of its columns. In this work we relate this assumption to the concept of matrix coherence and connect matrix coherence to the performance of the Nystrom method. Making use of related work in the compressed sensing and the matrix completion literature, we derive novel coherence-based bounds for the Nystrom method in the low-rank setting. We then present empirical results that corroborate these theoretical bounds. Finally, we present more general empirical results for the full-rank setting that convincingly demonstrate the ability of matrix coherence to measure the degree to which information can be extracted from a subset of columns. △ Less

Submitted 12 April, 2010; originally announced April 2010.

arXiv:0912.3309 [pdf, ps, other]

New Generalization Bounds for Learning Kernels

Authors: Corinna Cortes, Mehryar Mohri, Afshin Rostamizadeh

Abstract: This paper presents several novel generalization bounds for the problem of learning kernels based on the analysis of the Rademacher complexity of the corresponding hypothesis sets. Our bound for learning kernels with a convex combination of p base kernels has only a log(p) dependency on the number of kernels, p, which is considerably more favorable than the previous best bound given for the same… ▽ More This paper presents several novel generalization bounds for the problem of learning kernels based on the analysis of the Rademacher complexity of the corresponding hypothesis sets. Our bound for learning kernels with a convex combination of p base kernels has only a log(p) dependency on the number of kernels, p, which is considerably more favorable than the previous best bound given for the same problem. We also give a novel bound for learning with a linear combination of p base kernels with an L_2 regularization whose dependency on p is only in p^{1/4}. △ Less

Submitted 16 December, 2009; originally announced December 2009.

arXiv:0902.3430 [pdf, ps, other]

Domain Adaptation: Learning Bounds and Algorithms

Authors: Yishay Mansour, Mehryar Mohri, Afshin Rostamizadeh

Abstract: This paper addresses the general problem of domain adaptation which arises in a variety of applications where the distribution of the labeled sample available somewhat differs from that of the test data. Building on previous work by Ben-David et al. (2007), we introduce a novel distance between distributions, discrepancy distance, that is tailored to adaptation problems with arbitrary loss functio… ▽ More This paper addresses the general problem of domain adaptation which arises in a variety of applications where the distribution of the labeled sample available somewhat differs from that of the test data. Building on previous work by Ben-David et al. (2007), we introduce a novel distance between distributions, discrepancy distance, that is tailored to adaptation problems with arbitrary loss functions. We give Rademacher complexity bounds for estimating the discrepancy distance from finite samples for different loss functions. Using this distance, we derive novel generalization bounds for domain adaptation for a wide family of loss functions. We also present a series of novel adaptation bounds for large classes of regularization-based algorithms, including support vector machines and kernel ridge regression based on the empirical discrepancy. This motivates our analysis of the problem of minimizing the empirical discrepancy for various loss functions for which we also give novel algorithms. We report the results of preliminary experiments that demonstrate the benefits of our discrepancy minimization algorithms for domain adaptation. △ Less

Submitted 30 November, 2023; v1 submitted 19 February, 2009; originally announced February 2009.

Comments: 12 pages, 4 figures

arXiv:0811.1629 [pdf, ps, other]

Stability Bound for Stationary Phi-mixing and Beta-mixing Processes

Authors: Mehryar Mohri, Afshin Rostamizadeh

Abstract: Most generalization bounds in learning theory are based on some measure of the complexity of the hypothesis class used, independently of any algorithm. In contrast, the notion of algorithmic stability can be used to derive tight generalization bounds that are tailored to specific learning algorithms by exploiting their particular properties. However, as in much of learning theory, existing stabi… ▽ More Most generalization bounds in learning theory are based on some measure of the complexity of the hypothesis class used, independently of any algorithm. In contrast, the notion of algorithmic stability can be used to derive tight generalization bounds that are tailored to specific learning algorithms by exploiting their particular properties. However, as in much of learning theory, existing stability analyses and bounds apply only in the scenario where the samples are independently and identically distributed. In many machine learning applications, however, this assumption does not hold. The observations received by the learning algorithm often have some inherent temporal dependence. This paper studies the scenario where the observations are drawn from a stationary phi-mixing or beta-mixing sequence, a widely adopted assumption in the study of non-i.i.d. processes that implies a dependence between observations weakening over time. We prove novel and distinct stability-based generalization bounds for stationary phi-mixing and beta-mixing sequences. These bounds strictly generalize the bounds given in the i.i.d. case and apply to all stable learning algorithms, thereby extending the use of stability-bounds to non-i.i.d. scenarios. We also illustrate the application of our phi-mixing generalization bounds to general classes of learning algorithms, including Support Vector Regression, Kernel Ridge Regression, and Support Vector Machines, and many other kernel regularization-based and relative entropy-based regularization algorithms. These novel bounds can thus be viewed as the first theoretical basis for the use of these algorithms in non-i.i.d. scenarios. △ Less

Submitted 11 November, 2008; originally announced November 2008.

Comments: 23 pages, 1 figure, submitted to JMLR

arXiv:0805.2775 [pdf, ps, other]

Sample Selection Bias Correction Theory

Authors: Corinna Cortes, Mehryar Mohri, Michael Riley, Afshin Rostamizadeh

Abstract: This paper presents a theoretical analysis of sample selection bias correction. The sample bias correction technique commonly used in machine learning consists of reweighting the cost of an error on each training point of a biased sample to more closely reflect the unbiased distribution. This relies on weights derived by various estimation techniques based on finite samples. We analyze the effec… ▽ More This paper presents a theoretical analysis of sample selection bias correction. The sample bias correction technique commonly used in machine learning consists of reweighting the cost of an error on each training point of a biased sample to more closely reflect the unbiased distribution. This relies on weights derived by various estimation techniques based on finite samples. We analyze the effect of an error in that estimation on the accuracy of the hypothesis returned by the learning algorithm for two estimation techniques: a cluster-based estimation technique and kernel mean matching. We also report the results of sample bias correction experiments with several data sets using these techniques. Our analysis is based on the novel concept of distributional stability which generalizes the existing concept of point-based stability. Much of our work and proof techniques can be used to analyze other importance weighting techniques and their effect on accuracy when using a distributionally stable algorithm. △ Less

Submitted 18 May, 2008; originally announced May 2008.

Comments: 16 pages

Showing 1–32 of 32 results for author: Rostamizadeh, A