-
Solving Large-Scale Granular Resource Allocation Problems Efficiently with POP
Authors:
Deepak Narayanan,
Fiodar Kazhamiaka,
Firas Abuzaid,
Peter Kraft,
Akshay Agrawal,
Srikanth Kandula,
Stephen Boyd,
Matei Zaharia
Abstract:
Resource allocation problems in many computer systems can be formulated as mathematical optimization problems. However, finding exact solutions to these problems using off-the-shelf solvers is often intractable for large problem sizes with tight SLAs, leading system designers to rely on cheap, heuristic algorithms. We observe, however, that many allocation problems are granular: they consist of a…
▽ More
Resource allocation problems in many computer systems can be formulated as mathematical optimization problems. However, finding exact solutions to these problems using off-the-shelf solvers is often intractable for large problem sizes with tight SLAs, leading system designers to rely on cheap, heuristic algorithms. We observe, however, that many allocation problems are granular: they consist of a large number of clients and resources, each client requests a small fraction of the total number of resources, and clients can interchangeably use different resources. For these problems, we propose an alternative approach that reuses the original optimization problem formulation and leads to better allocations than domain-specific heuristics. Our technique, Partitioned Optimization Problems (POP), randomly splits the problem into smaller problems (with a subset of the clients and resources in the system) and coalesces the resulting sub-allocations into a global allocation for all clients. We provide theoretical and empirical evidence as to why random partitioning works well. In our experiments, POP achieves allocations within 1.5% of the optimal with orders-of-magnitude improvements in runtime compared to existing systems for cluster scheduling, traffic engineering, and load balancing.
△ Less
Submitted 22 October, 2021;
originally announced October 2021.
-
Don't Give Up on Large Optimization Problems; POP Them!
Authors:
Deepak Narayanan,
Fiodar Kazhamiaka,
Firas Abuzaid,
Peter Kraft,
Matei Zaharia
Abstract:
Resource allocation problems in many computer systems can be formulated as mathematical optimization problems. However, finding exact solutions to these problems using off-the-shelf solvers in an online setting is often intractable for "hyper-scale" system sizes with tight SLAs, leading system designers to rely on cheap, heuristic algorithms. In this work, we explore an alternative approach that r…
▽ More
Resource allocation problems in many computer systems can be formulated as mathematical optimization problems. However, finding exact solutions to these problems using off-the-shelf solvers in an online setting is often intractable for "hyper-scale" system sizes with tight SLAs, leading system designers to rely on cheap, heuristic algorithms. In this work, we explore an alternative approach that reuses the original optimization problem formulation. By splitting the original problem into smaller, more tractable problems for subsets of the system and then coalescing resulting sub-allocations into a global solution, we achieve empirically quasi-optimal (within 1.5%) performance for multiple domains with several orders-of-magnitude improvement in runtime. Deciding how to split a large problem into smaller sub-problems, and how to coalesce split allocations into a unified allocation, needs to be performed carefully in a domain-aware way. We show common principles for splitting problems effectively across a variety of tasks, including cluster scheduling, traffic engineering, and load balancing.
△ Less
Submitted 13 April, 2021;
originally announced April 2021.
-
To Index or Not to Index: Optimizing Exact Maximum Inner Product Search
Authors:
Firas Abuzaid,
Geet Sethi,
Peter Bailis,
Matei Zaharia
Abstract:
Exact Maximum Inner Product Search (MIPS) is an important task that is widely pertinent to recommender systems and high-dimensional similarity search. The brute-force approach to solving exact MIPS is computationally expensive, thus spurring recent development of novel indexes and pruning techniques for this task. In this paper, we show that a hardware-efficient brute-force approach, blocked matri…
▽ More
Exact Maximum Inner Product Search (MIPS) is an important task that is widely pertinent to recommender systems and high-dimensional similarity search. The brute-force approach to solving exact MIPS is computationally expensive, thus spurring recent development of novel indexes and pruning techniques for this task. In this paper, we show that a hardware-efficient brute-force approach, blocked matrix multiply (BMM), can outperform the state-of-the-art MIPS solvers by over an order of magnitude, for some -- but not all -- inputs.
In this paper, we also present a novel MIPS solution, MAXIMUS, that takes advantage of hardware efficiency and pruning of the search space. Like BMM, MAXIMUS is faster than other solvers by up to an order of magnitude, but again only for some inputs. Since no single solution offers the best runtime performance for all inputs, we introduce a new data-dependent optimizer, OPTIMUS, that selects online with minimal overhead the best MIPS solver for a given input. Together, OPTIMUS and MAXIMUS outperform state-of-the-art MIPS solvers by 3.2$\times$ on average, and up to 10.9$\times$, on widely studied MIPS datasets.
△ Less
Submitted 14 March, 2019; v1 submitted 5 June, 2017;
originally announced June 2017.
-
NoScope: Optimizing Neural Network Queries over Video at Scale
Authors:
Daniel Kang,
John Emmons,
Firas Abuzaid,
Peter Bailis,
Matei Zaharia
Abstract:
Recent advances in computer vision-in the form of deep neural networks-have made it possible to query increasing volumes of video data with high accuracy. However, neural network inference is computationally expensive at scale: applying a state-of-the-art object detector in real time (i.e., 30+ frames per second) to a single video requires a $4000 GPU. In response, we present NoScope, a system for…
▽ More
Recent advances in computer vision-in the form of deep neural networks-have made it possible to query increasing volumes of video data with high accuracy. However, neural network inference is computationally expensive at scale: applying a state-of-the-art object detector in real time (i.e., 30+ frames per second) to a single video requires a $4000 GPU. In response, we present NoScope, a system for querying videos that can reduce the cost of neural network video analysis by up to three orders of magnitude via inference-optimized model search. Given a target video, object to detect, and reference neural network, NoScope automatically searches for and trains a sequence, or cascade, of models that preserves the accuracy of the reference network but is specialized to the target video and are therefore far less computationally expensive. NoScope cascades two types of models: specialized models that forego the full generality of the reference model but faithfully mimic its behavior for the target video and object; and difference detectors that highlight temporal differences across frames. We show that the optimal cascade architecture differs across videos and objects, so NoScope uses an efficient cost-based optimizer to search across models and cascades. With this approach, NoScope achieves two to three order of magnitude speed-ups (265-15,500x real-time) on binary classification tasks over fixed-angle webcam and surveillance video while maintaining accuracy within 1-5% of state-of-the-art neural networks.
△ Less
Submitted 8 August, 2017; v1 submitted 7 March, 2017;
originally announced March 2017.
-
Caffe con Troll: Shallow Ideas to Speed Up Deep Learning
Authors:
Stefan Hadjis,
Firas Abuzaid,
Ce Zhang,
Christopher RĂ©
Abstract:
We present Caffe con Troll (CcT), a fully compatible end-to-end version of the popular framework Caffe with rebuilt internals. We built CcT to examine the performance characteristics of training and deploying general-purpose convolutional neural networks across different hardware architectures. We find that, by employing standard batching optimizations for CPU training, we achieve a 4.5x throughpu…
▽ More
We present Caffe con Troll (CcT), a fully compatible end-to-end version of the popular framework Caffe with rebuilt internals. We built CcT to examine the performance characteristics of training and deploying general-purpose convolutional neural networks across different hardware architectures. We find that, by employing standard batching optimizations for CPU training, we achieve a 4.5x throughput improvement over Caffe on popular networks like CaffeNet. Moreover, with these improvements, the end-to-end training time for CNNs is directly proportional to the FLOPS delivered by the CPU, which enables us to efficiently train hybrid CPU-GPU systems for CNNs.
△ Less
Submitted 26 May, 2015; v1 submitted 16 April, 2015;
originally announced April 2015.