-
Amazon SageMaker Model Monitor: A System for Real-Time Insights into Deployed Machine Learning Models
Authors:
David Nigenda,
Zohar Karnin,
Muhammad Bilal Zafar,
Raghu Ramesha,
Alan Tan,
Michele Donini,
Krishnaram Kenthapadi
Abstract:
With the increasing adoption of machine learning (ML) models and systems in high-stakes settings across different industries, guaranteeing a model's performance after deployment has become crucial. Monitoring models in production is a critical aspect of ensuring their continued performance and reliability. We present Amazon SageMaker Model Monitor, a fully managed service that continuously monitor…
▽ More
With the increasing adoption of machine learning (ML) models and systems in high-stakes settings across different industries, guaranteeing a model's performance after deployment has become crucial. Monitoring models in production is a critical aspect of ensuring their continued performance and reliability. We present Amazon SageMaker Model Monitor, a fully managed service that continuously monitors the quality of machine learning models hosted on Amazon SageMaker. Our system automatically detects data, concept, bias, and feature attribution drift in models in real-time and provides alerts so that model owners can take corrective actions and thereby maintain high quality models. We describe the key requirements obtained from customers, system design and architecture, and methodology for detecting different types of drift. Further, we provide quantitative evaluations followed by use cases, insights, and lessons learned from more than two years of production deployment.
△ Less
Submitted 5 August, 2022; v1 submitted 26 November, 2021;
originally announced November 2021.
-
Practical and sample efficient zero-shot HPO
Authors:
Fela Winkelmolen,
Nikita Ivkin,
H. Furkan Bozkurt,
Zohar Karnin
Abstract:
Zero-shot hyperparameter optimization (HPO) is a simple yet effective use of transfer learning for constructing a small list of hyperparameter (HP) configurations that complement each other. That is to say, for any given dataset, at least one of them is expected to perform well. Current techniques for obtaining this list are computationally expensive as they rely on running training jobs on a dive…
▽ More
Zero-shot hyperparameter optimization (HPO) is a simple yet effective use of transfer learning for constructing a small list of hyperparameter (HP) configurations that complement each other. That is to say, for any given dataset, at least one of them is expected to perform well. Current techniques for obtaining this list are computationally expensive as they rely on running training jobs on a diverse collection of datasets and a large collection of randomly drawn HPs. This cost is especially problematic in environments where the space of HPs is regularly changing due to new algorithm versions, or changing architectures of deep networks. We provide an overview of available approaches and introduce two novel techniques to handle the problem. The first is based on a surrogate model and adaptively chooses pairs of dataset, configuration to query. The second, for settings where finding, tuning and testing a surrogate model is problematic, is a multi-fidelity technique combining HyperBand with submodular optimization. We benchmark our methods experimentally on five tasks (XGBoost, LightGBM, CatBoost, MLP and AutoML) and show significant improvement in accuracy compared to standard zero-shot HPO with the same training budget. In addition to contributing new algorithms, we provide an extensive study of the zero-shot HPO technique resulting in (1) default hyper-parameters for popular algorithms that would benefit the community using them, (2) massive lookup tables to further the research of hyper-parameter tuning.
△ Less
Submitted 27 July, 2020;
originally announced July 2020.
-
An Empirical Process Approach to the Union Bound: Practical Algorithms for Combinatorial and Linear Bandits
Authors:
Julian Katz-Samuels,
Lalit Jain,
Zohar Karnin,
Kevin Jamieson
Abstract:
This paper proposes near-optimal algorithms for the pure-exploration linear bandit problem in the fixed confidence and fixed budget settings. Leveraging ideas from the theory of suprema of empirical processes, we provide an algorithm whose sample complexity scales with the geometry of the instance and avoids an explicit union bound over the number of arms. Unlike previous approaches which sample b…
▽ More
This paper proposes near-optimal algorithms for the pure-exploration linear bandit problem in the fixed confidence and fixed budget settings. Leveraging ideas from the theory of suprema of empirical processes, we provide an algorithm whose sample complexity scales with the geometry of the instance and avoids an explicit union bound over the number of arms. Unlike previous approaches which sample based on minimizing a worst-case variance (e.g. G-optimal design), we define an experimental design objective based on the Gaussian-width of the underlying arm set. We provide a novel lower bound in terms of this objective that highlights its fundamental role in the sample complexity. The sample complexity of our fixed confidence algorithm matches this lower bound, and in addition is computationally efficient for combinatorial classes, e.g. shortest-path, matchings and matroids, where the arm sets can be exponentially large in the dimension. Finally, we propose the first algorithm for linear bandits in the the fixed budget setting. Its guarantee matches our lower bound up to logarithmic factors.
△ Less
Submitted 20 June, 2020;
originally announced June 2020.
-
PruneNet: Channel Pruning via Global Importance
Authors:
Ashish Khetan,
Zohar Karnin
Abstract:
Channel pruning is one of the predominant approaches for accelerating deep neural networks. Most existing pruning methods either train from scratch with a sparsity inducing term such as group lasso, or prune redundant channels in a pretrained network and then fine tune the network. Both strategies suffer from some limitations: the use of group lasso is computationally expensive, difficult to conve…
▽ More
Channel pruning is one of the predominant approaches for accelerating deep neural networks. Most existing pruning methods either train from scratch with a sparsity inducing term such as group lasso, or prune redundant channels in a pretrained network and then fine tune the network. Both strategies suffer from some limitations: the use of group lasso is computationally expensive, difficult to converge and often suffers from worse behavior due to the regularization bias. The methods that start with a pretrained network either prune channels uniformly across the layers or prune channels based on the basic statistics of the network parameters. These approaches either ignore the fact that some CNN layers are more redundant than others or fail to adequately identify the level of redundancy in different layers. In this work, we investigate a simple-yet-effective method for pruning channels based on a computationally light-weight yet effective data driven optimization step that discovers the necessary width per layer. Experiments conducted on ILSVRC-$12$ confirm effectiveness of our approach. With non-uniform pruning across the layers on ResNet-$50$, we are able to match the FLOP reduction of state-of-the-art channel pruning results while achieving a $0.98\%$ higher accuracy. Further, we show that our pruned ResNet-$50$ network outperforms ResNet-$34$ and ResNet-$18$ networks, and that our pruned ResNet-$101$ outperforms ResNet-$50$.
△ Less
Submitted 22 May, 2020;
originally announced May 2020.
-
Asymmetric Random Projections
Authors:
Nick Ryder,
Zohar Karnin,
Edo Liberty
Abstract:
Random projections (RP) are a popular tool for reducing dimensionality while preserving local geometry. In many applications the data set to be projected is given to us in advance, yet the current RP techniques do not make use of information about the data. In this paper, we provide a computationally light way to extract statistics from the data that allows designing a data dependent RP with super…
▽ More
Random projections (RP) are a popular tool for reducing dimensionality while preserving local geometry. In many applications the data set to be projected is given to us in advance, yet the current RP techniques do not make use of information about the data. In this paper, we provide a computationally light way to extract statistics from the data that allows designing a data dependent RP with superior performance compared to data-oblivious RP. We tackle scenarios such as matrix multiplication and linear regression/classification in which we wish to estimate inner products between pairs of vectors from two possibly different sources. Our technique takes advantage of the difference between the sources and is provably superior to oblivious RPs. Additionally, we provide extensive experiments comparing RPs with our approach showing significant performance lifts in fast matrix multiplication, regression and classification problems.
△ Less
Submitted 22 June, 2019;
originally announced June 2019.
-
Discrepancy, Coresets, and Sketches in Machine Learning
Authors:
Zohar Karnin,
Edo Liberty
Abstract:
This paper defines the notion of class discrepancy for families of functions. It shows that low discrepancy classes admit small offline and streaming coresets. We provide general techniques for bounding the class discrepancy of machine learning problems. As corollaries of the general technique we bound the discrepancy (and therefore coreset complexity) of logistic regression, sigmoid activation lo…
▽ More
This paper defines the notion of class discrepancy for families of functions. It shows that low discrepancy classes admit small offline and streaming coresets. We provide general techniques for bounding the class discrepancy of machine learning problems. As corollaries of the general technique we bound the discrepancy (and therefore coreset complexity) of logistic regression, sigmoid activation loss, matrix covariance, kernel density and any analytic function of the dot product or the squared distance. Our results prove the existence of epsilon-approximation O(sqrt{d}/epsilon) sized coresets for the above problems. This resolves the long-standing open problem regarding the coreset complexity of Gaussian kernel density estimation. We provide two more related but independent results. First, an exponential improvement of the widely used merge-and-reduce trick which gives improved streaming sketches for any low discrepancy problem. Second, an extremely simple deterministic algorithm for finding low discrepancy sequences (and therefore coresets) for any positive semi-definite kernel. This paper establishes some explicit connections between class discrepancy, coreset complexity, learnability, and streaming algorithms.
△ Less
Submitted 11 June, 2019;
originally announced June 2019.
-
DARC: Differentiable ARchitecture Compression
Authors:
Shashank Singh,
Ashish Khetan,
Zohar Karnin
Abstract:
In many learning situations, resources at inference time are significantly more constrained than resources at training time. This paper studies a general paradigm, called Differentiable ARchitecture Compression (DARC), that combines model compression and architecture search to learn models that are resource-efficient at inference time. Given a resource-intensive base architecture, DARC utilizes th…
▽ More
In many learning situations, resources at inference time are significantly more constrained than resources at training time. This paper studies a general paradigm, called Differentiable ARchitecture Compression (DARC), that combines model compression and architecture search to learn models that are resource-efficient at inference time. Given a resource-intensive base architecture, DARC utilizes the training data to learn which sub-components can be replaced by cheaper alternatives. The high-level technique can be applied to any neural architecture, and we report experiments on state-of-the-art convolutional neural networks for image classification. For a WideResNet with $97.2\%$ accuracy on CIFAR-10, we improve single-sample inference speed by $2.28\times$ and memory footprint by $5.64\times$, with no accuracy loss. For a ResNet with $79.15\%$ Top1 accuracy on ImageNet, we improve batch inference speed by $1.29\times$ and memory footprint by $3.57\times$ with $1\%$ accuracy loss. We also give theoretical Rademacher complexity bounds in simplified cases, showing how DARC avoids overfitting despite over-parameterization.
△ Less
Submitted 20 May, 2019;
originally announced May 2019.
-
One-Shot Session Recommendation Systems with Combinatorial Items
Authors:
Yahel David,
Dotan Di Castro,
Zohar Karnin
Abstract:
In recent years, content recommendation systems in large websites (or \emph{content providers}) capture an increased focus. While the type of content varies, e.g.\ movies, articles, music, advertisements, etc., the high level problem remains the same. Based on knowledge obtained so far on the user, recommend the most desired content. In this paper we present a method to handle the well known user-…
▽ More
In recent years, content recommendation systems in large websites (or \emph{content providers}) capture an increased focus. While the type of content varies, e.g.\ movies, articles, music, advertisements, etc., the high level problem remains the same. Based on knowledge obtained so far on the user, recommend the most desired content. In this paper we present a method to handle the well known user-cold-start problem in recommendation systems. In this scenario, a recommendation system encounters a new user and the objective is to present items as relevant as possible with the hope of kee** the user's session as long as possible. We formulate an optimization problem aimed to maximize the length of this initial session, as this is believed to be the key to have the user come back and perhaps register to the system. In particular, our model captures the fact that a single round with low quality recommendation is likely to terminate the session. In such a case, we do not proceed to the next round as the user leaves the system, possibly never to seen again. We denote this phenomenon a \emph{One-Shot Session}. Our optimization problem is formulated as an MDP where the action space is of a combinatorial nature as we recommend in each round, multiple items. This huge action space presents a computational challenge making the straightforward solution intractable. We analyze the structure of the MDP to prove monotone and submodular like properties that allow a computationally efficient solution via a method denoted by \emph{Greedy Value Iteration} (G-VI).
△ Less
Submitted 5 July, 2016;
originally announced July 2016.
-
Near-Optimal Entrywise Sampling for Data Matrices
Authors:
Dimitris Achlioptas,
Zohar Karnin,
Edo Liberty
Abstract:
We consider the problem of selecting non-zero entries of a matrix $A$ in order to produce a sparse sketch of it, $B$, that minimizes $\|A-B\|_2$. For large $m \times n$ matrices, such that $n \gg m$ (for example, representing $n$ observations over $m$ attributes) we give sampling distributions that exhibit four important properties. First, they have closed forms computable from minimal information…
▽ More
We consider the problem of selecting non-zero entries of a matrix $A$ in order to produce a sparse sketch of it, $B$, that minimizes $\|A-B\|_2$. For large $m \times n$ matrices, such that $n \gg m$ (for example, representing $n$ observations over $m$ attributes) we give sampling distributions that exhibit four important properties. First, they have closed forms computable from minimal information regarding $A$. Second, they allow sketching of matrices whose non-zeros are presented to the algorithm in arbitrary order as a stream, with $O(1)$ computation per non-zero. Third, the resulting sketch matrices are not only sparse, but their non-zero entries are highly compressible. Lastly, and most importantly, under mild assumptions, our distributions are provably competitive with the optimal offline distribution. Note that the probabilities in the optimal offline distribution may be complex functions of all the entries in the matrix. Therefore, regardless of computational complexity, the optimal distribution might be impossible to compute in the streaming model.
△ Less
Submitted 19 November, 2013;
originally announced November 2013.