Search | arXiv e-print repository

Execution Templates: Caching Control Plane Decisions for Strong Scaling of Data Analytics

Authors: Omid Mashayekhi, Hang Qu, Chinmayee Shah, Philip Levis

Abstract: Control planes of cloud frameworks trade off between scheduling granularity and performance. Centralized systems schedule at task granularity, but only schedule a few thousand tasks per second. Distributed systems schedule hundreds of thousands of tasks per second but changing the schedule is costly. We present execution templates, a control plane abstraction that can schedule hundreds of thousa… ▽ More Control planes of cloud frameworks trade off between scheduling granularity and performance. Centralized systems schedule at task granularity, but only schedule a few thousand tasks per second. Distributed systems schedule hundreds of thousands of tasks per second but changing the schedule is costly. We present execution templates, a control plane abstraction that can schedule hundreds of thousands of tasks per second while supporting fine-grained, per-task scheduling decisions. Execution templates leverage a program's repetitive control flow to cache blocks of frequently-executed tasks. Executing a task in a template requires sending a single message. Large-scale scheduling changes install new templates, while small changes apply edits to existing templates. Evaluations of execution templates in Nimbus, a data analytics framework, find that they provide the fine-grained scheduling flexibility of centralized control planes while matching the strong scaling of distributed ones. Execution templates support complex, real-world applications, such as a fluid simulation with a triply nested loop and data dependent branches. △ Less

Submitted 3 May, 2017; originally announced May 2017.

Comments: To appear at USENIX ATC 2017

arXiv:1606.01972 [pdf, other]

Scalable, Fast Cloud Computing with Execution Templates

Authors: Omid Mashayekhi, Hang Qu, Chinmayee Shah, Philip Levis

Abstract: Large scale cloud data analytics applications are often CPU bound. Most of these cycles are wasted: benchmarks written in C++ run 10-51 times faster than frameworks such as Naiad and Spark. However, calling faster implementations from those frameworks only sees moderate (3-5x) speedups because their control planes cannot schedule work fast enough. This paper presents execution templates, a contr… ▽ More Large scale cloud data analytics applications are often CPU bound. Most of these cycles are wasted: benchmarks written in C++ run 10-51 times faster than frameworks such as Naiad and Spark. However, calling faster implementations from those frameworks only sees moderate (3-5x) speedups because their control planes cannot schedule work fast enough. This paper presents execution templates, a control plane abstraction for CPU-bound cloud applications, such as machine learning. Execution templates leverage highly repetitive control flow to cache scheduling decisions as {\it templates}. Rather than reschedule hundreds of thousands of tasks on every loop execution, nodes instantiate these templates. A controller's template specifies the execution across all worker nodes, which it partitions into per-worker templates. To ensure that templates execute correctly, controllers dynamically patch templates to match program control flow. We have implemented execution templates in Nimbus, a C++ cloud computing framework. Running in Nimbus, analytics benchmarks can run 16-43 times faster than in Naiad and Spark. Nimbus's control plane can scale out to run these faster benchmarks on up to 100 nodes (800 cores). △ Less

Submitted 6 June, 2016; originally announced June 2016.

arXiv:1606.01966 [pdf, other]

Distributed Graphical Simulation in the Cloud

Authors: Omid Mashayekhi, Chinmayee Shah, Hang Qu, Andrew Lim, Philip Levis

Abstract: Graphical simulations are a cornerstone of modern media and films. But existing software packages are designed to run on HPC nodes, and perform poorly in the computing cloud. These simulations have complex data access patterns over complex data structures, and mutate data arbitrarily, and so are a poor fit for existing cloud computing systems. We describe a software architecture for running graphi… ▽ More Graphical simulations are a cornerstone of modern media and films. But existing software packages are designed to run on HPC nodes, and perform poorly in the computing cloud. These simulations have complex data access patterns over complex data structures, and mutate data arbitrarily, and so are a poor fit for existing cloud computing systems. We describe a software architecture for running graphical simulations in the cloud that decouples control logic, computations and data exchanges. This allows a central controller to balance load by redistributing computations, and recover from failures. Evaluations show that the architecture can run existing, state-of-the-art simulations in the presence of stragglers and failures, thereby enabling this large class of applications to use the computing cloud for the first time. △ Less

Submitted 6 June, 2016; originally announced June 2016.

arXiv:1602.01412

Canary: A Scheduling Architecture for High Performance Cloud Computing

Authors: Hang Qu, Omid Mashayekhi, David Terei, Philip Levis

Abstract: We present Canary, a scheduling architecture that allows high performance analytics workloads to scale out to run on thousands of cores. Canary is motivated by the observation that a central scheduler is a bottleneck for high performance codes: a handful of multicore workers can execute tasks faster than a controller can schedule them. The key insight in Canary is to reverse the responsibilities… ▽ More We present Canary, a scheduling architecture that allows high performance analytics workloads to scale out to run on thousands of cores. Canary is motivated by the observation that a central scheduler is a bottleneck for high performance codes: a handful of multicore workers can execute tasks faster than a controller can schedule them. The key insight in Canary is to reverse the responsibilities between controllers and workers. Rather than dispatch tasks to workers, which then fetch data as necessary, in Canary the controller assigns data partitions to workers, which then spawn and schedule tasks locally. We evaluate three benchmark applications in Canary on up to 64 servers and 1,152 cores on Amazon EC2. Canary achieves up to 9-90X speedup over Spark and up to 4X speedup over GraphX, a highly optimized graph analytics engine. While current centralized schedulers can schedule 2,500 tasks/second, each Canary worker can schedule 136,000 tasks/second per core and experiments show this scales out linearly, with 64 workers scheduling over 120 million tasks per second, allowing Canary to support optimized jobs running on thousands of cores. △ Less

Submitted 14 April, 2016; v1 submitted 3 February, 2016; originally announced February 2016.

Comments: We have some presentation issues with the paper

arXiv:1104.4612 [pdf, ps, other]

New Power Estimation Methods for Highly Overloaded Synchronous CDMA Systems

Authors: Damoun Nashtaali, Omid Mashayekhi, Pedram Pad, Seyed Reza Moghadasi, Farokh Marvasti

Abstract: In CDMA systems, the received user powers vary due to moving distance of users. Thus, the CDMA receivers consist of two stages. The first stage is the power estimator and the second one is a Multi-User Detector (MUD). Conventional methods for estimating the user powers are suitable for underor fully-loaded cases (when the number of users is less than or equal to the spreading gain). These methods… ▽ More In CDMA systems, the received user powers vary due to moving distance of users. Thus, the CDMA receivers consist of two stages. The first stage is the power estimator and the second one is a Multi-User Detector (MUD). Conventional methods for estimating the user powers are suitable for underor fully-loaded cases (when the number of users is less than or equal to the spreading gain). These methods fail to work for overloaded CDMA systems because of high interference among the users. Since the bandwidth is becoming more and more valuable, it is worth considering overloaded CDMA systems. In this paper, an optimum user power estimation for over-loaded CDMA systems with Gaussian inputs is proposed. We also introduce a suboptimum method with lower complexity whose performance is very close to the optimum one. We shall show that the proposed methods work for highly over-loaded systems (up to m(m + 1) =2 users for a system with only m chips). The performance of the proposed methods is demonstrated by simulations. In addition, a class of signature sets is proposed that seems to be optimum from a power estimation point of view. Additionally, an iterative estimation for binary input CDMA systems is proposed which works more accurately than the optimal Gaussian input method. △ Less

Submitted 24 April, 2011; originally announced April 2011.

Showing 1–5 of 5 results for author: Mashayekhi, O