-
Execution Templates: Caching Control Plane Decisions for Strong Scaling of Data Analytics
Authors:
Omid Mashayekhi,
Hang Qu,
Chinmayee Shah,
Philip Levis
Abstract:
Control planes of cloud frameworks trade off between scheduling granularity and performance. Centralized systems schedule at task granularity, but only schedule a few thousand tasks per second. Distributed systems schedule hundreds of thousands of tasks per second but changing the schedule is costly.
We present execution templates, a control plane abstraction that can schedule hundreds of thousa…
▽ More
Control planes of cloud frameworks trade off between scheduling granularity and performance. Centralized systems schedule at task granularity, but only schedule a few thousand tasks per second. Distributed systems schedule hundreds of thousands of tasks per second but changing the schedule is costly.
We present execution templates, a control plane abstraction that can schedule hundreds of thousands of tasks per second while supporting fine-grained, per-task scheduling decisions. Execution templates leverage a program's repetitive control flow to cache blocks of frequently-executed tasks. Executing a task in a template requires sending a single message. Large-scale scheduling changes install new templates, while small changes apply edits to existing templates.
Evaluations of execution templates in Nimbus, a data analytics framework, find that they provide the fine-grained scheduling flexibility of centralized control planes while matching the strong scaling of distributed ones. Execution templates support complex, real-world applications, such as a fluid simulation with a triply nested loop and data dependent branches.
△ Less
Submitted 3 May, 2017;
originally announced May 2017.
-
Scalable, Fast Cloud Computing with Execution Templates
Authors:
Omid Mashayekhi,
Hang Qu,
Chinmayee Shah,
Philip Levis
Abstract:
Large scale cloud data analytics applications are often CPU bound. Most of these cycles are wasted: benchmarks written in C++ run 10-51 times faster than frameworks such as Naiad and Spark. However, calling faster implementations from those frameworks only sees moderate (3-5x) speedups because their control planes cannot schedule work fast enough.
This paper presents execution templates, a contr…
▽ More
Large scale cloud data analytics applications are often CPU bound. Most of these cycles are wasted: benchmarks written in C++ run 10-51 times faster than frameworks such as Naiad and Spark. However, calling faster implementations from those frameworks only sees moderate (3-5x) speedups because their control planes cannot schedule work fast enough.
This paper presents execution templates, a control plane abstraction for CPU-bound cloud applications, such as machine learning. Execution templates leverage highly repetitive control flow to cache scheduling decisions as {\it templates}. Rather than reschedule hundreds of thousands of tasks on every loop execution, nodes instantiate these templates. A controller's template specifies the execution across all worker nodes, which it partitions into per-worker templates. To ensure that templates execute correctly, controllers dynamically patch templates to match program control flow. We have implemented execution templates in Nimbus, a C++ cloud computing framework. Running in Nimbus, analytics benchmarks can run 16-43 times faster than in Naiad and Spark. Nimbus's control plane can scale out to run these faster benchmarks on up to 100 nodes (800 cores).
△ Less
Submitted 6 June, 2016;
originally announced June 2016.
-
Distributed Graphical Simulation in the Cloud
Authors:
Omid Mashayekhi,
Chinmayee Shah,
Hang Qu,
Andrew Lim,
Philip Levis
Abstract:
Graphical simulations are a cornerstone of modern media and films. But existing software packages are designed to run on HPC nodes, and perform poorly in the computing cloud. These simulations have complex data access patterns over complex data structures, and mutate data arbitrarily, and so are a poor fit for existing cloud computing systems. We describe a software architecture for running graphi…
▽ More
Graphical simulations are a cornerstone of modern media and films. But existing software packages are designed to run on HPC nodes, and perform poorly in the computing cloud. These simulations have complex data access patterns over complex data structures, and mutate data arbitrarily, and so are a poor fit for existing cloud computing systems. We describe a software architecture for running graphical simulations in the cloud that decouples control logic, computations and data exchanges. This allows a central controller to balance load by redistributing computations, and recover from failures. Evaluations show that the architecture can run existing, state-of-the-art simulations in the presence of stragglers and failures, thereby enabling this large class of applications to use the computing cloud for the first time.
△ Less
Submitted 6 June, 2016;
originally announced June 2016.
-
Canary: A Scheduling Architecture for High Performance Cloud Computing
Authors:
Hang Qu,
Omid Mashayekhi,
David Terei,
Philip Levis
Abstract:
We present Canary, a scheduling architecture that allows high performance analytics workloads to scale out to run on thousands of cores. Canary is motivated by the observation that a central scheduler is a bottleneck for high performance codes: a handful of multicore workers can execute tasks faster than a controller can schedule them.
The key insight in Canary is to reverse the responsibilities…
▽ More
We present Canary, a scheduling architecture that allows high performance analytics workloads to scale out to run on thousands of cores. Canary is motivated by the observation that a central scheduler is a bottleneck for high performance codes: a handful of multicore workers can execute tasks faster than a controller can schedule them.
The key insight in Canary is to reverse the responsibilities between controllers and workers. Rather than dispatch tasks to workers, which then fetch data as necessary, in Canary the controller assigns data partitions to workers, which then spawn and schedule tasks locally.
We evaluate three benchmark applications in Canary on up to 64 servers and 1,152 cores on Amazon EC2. Canary achieves up to 9-90X speedup over Spark and up to 4X speedup over GraphX, a highly optimized graph analytics engine. While current centralized schedulers can schedule 2,500 tasks/second, each Canary worker can schedule 136,000 tasks/second per core and experiments show this scales out linearly, with 64 workers scheduling over 120 million tasks per second, allowing Canary to support optimized jobs running on thousands of cores.
△ Less
Submitted 14 April, 2016; v1 submitted 3 February, 2016;
originally announced February 2016.
-
New Power Estimation Methods for Highly Overloaded Synchronous CDMA Systems
Authors:
Damoun Nashtaali,
Omid Mashayekhi,
Pedram Pad,
Seyed Reza Moghadasi,
Farokh Marvasti
Abstract:
In CDMA systems, the received user powers vary due to moving distance of users. Thus, the CDMA receivers consist of two stages. The first stage is the power estimator and the second one is a Multi-User Detector (MUD). Conventional methods for estimating the user powers are suitable for underor fully-loaded cases (when the number of users is less than or equal to the spreading gain). These methods…
▽ More
In CDMA systems, the received user powers vary due to moving distance of users. Thus, the CDMA receivers consist of two stages. The first stage is the power estimator and the second one is a Multi-User Detector (MUD). Conventional methods for estimating the user powers are suitable for underor fully-loaded cases (when the number of users is less than or equal to the spreading gain). These methods fail to work for overloaded CDMA systems because of high interference among the users. Since the bandwidth is becoming more and more valuable, it is worth considering overloaded CDMA systems. In this paper, an optimum user power estimation for over-loaded CDMA systems with Gaussian inputs is proposed. We also introduce a suboptimum method with lower complexity whose performance is very close to the optimum one. We shall show that the proposed methods work for highly over-loaded systems (up to m(m + 1) =2 users for a system with only m chips). The performance of the proposed methods is demonstrated by simulations. In addition, a class of signature sets is proposed that seems to be optimum from a power estimation point of view. Additionally, an iterative estimation for binary input CDMA systems is proposed which works more accurately than the optimal Gaussian input method.
△ Less
Submitted 24 April, 2011;
originally announced April 2011.