Search | arXiv e-print repository

SlipStream: Adapting Pipelines for Distributed Training of Large DNNs Amid Failures

Authors: Swapnil Gandhi, Mark Zhao, Athinagoras Skiadopoulos, Christos Kozyrakis

Abstract: Training large Deep Neural Network (DNN) models requires thousands of GPUs for days or weeks at a time. At these scales, failures are frequent and can have a big impact on training throughput. Restoring performance using spare GPU servers becomes increasingly expensive as models grow. SlipStream is a system for efficient DNN training in the presence of failures, without using spare servers. It exp… ▽ More Training large Deep Neural Network (DNN) models requires thousands of GPUs for days or weeks at a time. At these scales, failures are frequent and can have a big impact on training throughput. Restoring performance using spare GPU servers becomes increasingly expensive as models grow. SlipStream is a system for efficient DNN training in the presence of failures, without using spare servers. It exploits the functional redundancy inherent in distributed training systems -- servers hold the same model parameters across data-parallel groups -- as well as the bubbles in the pipeline schedule within each data-parallel group. SlipStream dynamically re-routes the work of a failed server to its data-parallel peers, ensuring continuous training despite multiple failures. However, re-routing work leads to imbalances across pipeline stages that degrades training throughput. SlipStream introduces two optimizations that allow re-routed work to execute within bubbles of the original pipeline schedule. First, it decouples the backward pass computation into two phases. Second, it staggers the execution of the optimizer step across pipeline stages. Combined, these optimizations enable schedules that minimize or even eliminate training throughput degradation during failures. We describe a prototype for SlipStream and show that it achieves high training throughput under multiple failures, outperforming recent proposals for fault-tolerant training such as Oobleck and Bamboo by up to 1.46x and 1.64x, respectively. △ Less

Submitted 22 May, 2024; originally announced May 2024.

arXiv:2401.08895 [pdf, other]

cedar: Composable and Optimized Machine Learning Input Data Pipelines

Authors: Mark Zhao, Emanuel Adamiak, Christos Kozyrakis

Abstract: The input data pipeline is an essential component of each machine learning (ML) training job. It is responsible for reading massive amounts of training data, processing batches of samples using complex transformations, and loading them onto training nodes at low latency and high throughput. Performant input data systems are becoming increasingly critical, driven by skyrocketing data volumes and tr… ▽ More The input data pipeline is an essential component of each machine learning (ML) training job. It is responsible for reading massive amounts of training data, processing batches of samples using complex transformations, and loading them onto training nodes at low latency and high throughput. Performant input data systems are becoming increasingly critical, driven by skyrocketing data volumes and training throughput demands. Unfortunately, current input data systems cannot fully leverage key performance optimizations, resulting in hugely inefficient infrastructures that require significant resources -- or worse -- underutilize expensive accelerators. To address these demands, we present cedar, a programming model and framework that allows users to easily build, optimize, and execute input data pipelines. cedar presents an easy-to-use programming interface, allowing users to define input data pipelines using composable operators that support arbitrary ML frameworks and libraries. Meanwhile, cedar transparently applies a complex and extensible set of optimization techniques (e.g., offloading, caching, prefetching, fusion, and reordering). It then orchestrates processing across a customizable set of local and distributed compute resources in order to maximize processing performance and efficiency, all without user input. On average across six diverse input data pipelines, cedar achieves a 2.49x, 1.87x, 2.18x, and 2.74x higher performance compared to tf.data, tf.data service, Ray Data, and PyTorch's DataLoader, respectively. △ Less

Submitted 25 January, 2024; v1 submitted 16 January, 2024; originally announced January 2024.

arXiv:2312.07104 [pdf, other]

SGLang: Efficient Execution of Structured Language Model Programs

Authors: Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng

Abstract: Large language models (LLMs) are increasingly used for complex tasks that require multiple generation calls, advanced prompting techniques, control flow, and structured inputs/outputs. However, efficient systems are lacking for programming and executing these applications. We introduce SGLang, a system for efficient execution of complex language model programs. SGLang consists of a frontend langua… ▽ More Large language models (LLMs) are increasingly used for complex tasks that require multiple generation calls, advanced prompting techniques, control flow, and structured inputs/outputs. However, efficient systems are lacking for programming and executing these applications. We introduce SGLang, a system for efficient execution of complex language model programs. SGLang consists of a frontend language and a runtime. The frontend simplifies programming with primitives for generation and parallelism control. The runtime accelerates execution with novel optimizations like RadixAttention for KV cache reuse and compressed finite state machines for faster structured output decoding. Experiments show that SGLang achieves up to 6.4x higher throughput compared to state-of-the-art inference systems on various large language and multi-modal models on tasks including agent control, logical reasoning, few-shot learning benchmarks, JSON decoding, retrieval-augmented generation pipelines, and multi-turn chat. The code is publicly available at https://github.com/sgl-project/sglang △ Less

Submitted 5 June, 2024; v1 submitted 12 December, 2023; originally announced December 2023.

arXiv:2305.03785 [pdf, other]

Zelda: Video Analytics using Vision-Language Models

Authors: Francisco Romero, Caleb Winston, Johann Hauswald, Matei Zaharia, Christos Kozyrakis

Abstract: Advances in ML have motivated the design of video analytics systems that allow for structured queries over video datasets. However, existing systems limit query expressivity, require users to specify an ML model per predicate, rely on complex optimizations that trade off accuracy for performance, and return large amounts of redundant and low-quality results. This paper focuses on the recently deve… ▽ More Advances in ML have motivated the design of video analytics systems that allow for structured queries over video datasets. However, existing systems limit query expressivity, require users to specify an ML model per predicate, rely on complex optimizations that trade off accuracy for performance, and return large amounts of redundant and low-quality results. This paper focuses on the recently developed Vision-Language Models (VLMs) that allow users to query images using natural language like "cars during daytime at traffic intersections." Through an in-depth analysis, we show VLMs address three limitations of current video analytics systems: general expressivity, a single general purpose model to query many predicates, and are both simple and fast. However, VLMs still return large numbers of redundant and low-quality results that can overwhelm and burden users. In addition, VLMs often require manual prompt engineering to improve result relevance. We present Zelda: a video analytics system that uses VLMs to return both relevant and semantically diverse results for top-K queries on large video datasets. Zelda prompts the VLM with the user's query in natural language. Zelda then automatically adds discriminator and synonym terms to boost accuracy, and terms to identify low-quality frames. To improve result diversity, Zelda uses semantic-rich VLM embeddings in an algorithm that prunes similar frames while considering their relevance to the query and the number of top-K results requested. We evaluate Zelda across five datasets and 19 queries and quantitatively show it achieves higher mean average precision (up to 1.15x) and improves average pairwise similarity (up to 1.16x) compared to using VLMs out-of-the-box. We also compare Zelda to a state-of-the-art video analytics engine and show that Zelda retrieves results 7.5x (up to 10.4x) faster for the same accuracy and frame diversity. △ Less

Submitted 7 November, 2023; v1 submitted 5 May, 2023; originally announced May 2023.

arXiv:2301.02959 [pdf, other]

FlexShard: Flexible Sharding for Industry-Scale Sequence Recommendation Models

Authors: Geet Sethi, Pallab Bhattacharya, Dhruv Choudhary, Carole-Jean Wu, Christos Kozyrakis

Abstract: Sequence-based deep learning recommendation models (DLRMs) are an emerging class of DLRMs showing great improvements over their prior sum-pooling based counterparts at capturing users' long term interests. These improvements come at immense system cost however, with sequence-based DLRMs requiring substantial amounts of data to be dynamically materialized and communicated by each accelerator during… ▽ More Sequence-based deep learning recommendation models (DLRMs) are an emerging class of DLRMs showing great improvements over their prior sum-pooling based counterparts at capturing users' long term interests. These improvements come at immense system cost however, with sequence-based DLRMs requiring substantial amounts of data to be dynamically materialized and communicated by each accelerator during a single iteration. To address this rapidly growing bottleneck, we present FlexShard, a new tiered sequence embedding table sharding algorithm which operates at a per-row granularity by exploiting the insight that not every row is equal. Through precise replication of embedding rows based on their underlying probability distribution, along with the introduction of a new sharding strategy adapted to the heterogeneous, skewed performance of real-world cluster network topologies, FlexShard is able to significantly reduce communication demand while using no additional memory compared to the prior state-of-the-art. When evaluated on production-scale sequence DLRMs, FlexShard was able to reduce overall global all-to-all communication traffic by over 85%, resulting in end-to-end training communication latency improvements of almost 6x over the prior state-of-the-art approach. △ Less

Submitted 7 January, 2023; originally announced January 2023.

arXiv:2212.14161 [pdf, other]

Transactions Make Debugging Easy

Authors: Qian Li, Peter Kraft, Michael Cafarella, Çağatay Demiralp, Goetz Graefe, Christos Kozyrakis, Michael Stonebraker, Lalith Suresh, Matei Zaharia

Abstract: We propose TROD, a novel transaction-oriented framework for debugging modern distributed web applications and online services. Our critical insight is that if applications store all state in databases and only access state transactionally, TROD can use lightweight always-on tracing to track the history of application state changes and data provenance, and then leverage the captured traces and tran… ▽ More We propose TROD, a novel transaction-oriented framework for debugging modern distributed web applications and online services. Our critical insight is that if applications store all state in databases and only access state transactionally, TROD can use lightweight always-on tracing to track the history of application state changes and data provenance, and then leverage the captured traces and transaction logs to faithfully replay or even test modified code retroactively on any past event. We demonstrate how TROD can simplify programming and debugging in production applications, list several research challenges and directions, and encourage the database and systems communities to drastically rethink the synergy between the way people develop and debug applications. △ Less

Submitted 28 December, 2022; originally announced December 2022.

Comments: CIDR'23

arXiv:2211.05239 [pdf, other]

RecD: Deduplication for End-to-End Deep Learning Recommendation Model Training Infrastructure

Authors: Mark Zhao, Dhruv Choudhary, Devashish Tyagi, Ajay Somani, Max Kaplan, Sung-Han Lin, Sarunya Pumma, Jongsoo Park, Aarti Basant, Niket Agarwal, Carole-Jean Wu, Christos Kozyrakis

Abstract: We present RecD (Recommendation Deduplication), a suite of end-to-end infrastructure optimizations across the Deep Learning Recommendation Model (DLRM) training pipeline. RecD addresses immense storage, preprocessing, and training overheads caused by feature duplication inherent in industry-scale DLRM training datasets. Feature duplication arises because DLRM datasets are generated from interactio… ▽ More We present RecD (Recommendation Deduplication), a suite of end-to-end infrastructure optimizations across the Deep Learning Recommendation Model (DLRM) training pipeline. RecD addresses immense storage, preprocessing, and training overheads caused by feature duplication inherent in industry-scale DLRM training datasets. Feature duplication arises because DLRM datasets are generated from interactions. While each user session can generate multiple training samples, many features' values do not change across these samples. We demonstrate how RecD exploits this property, end-to-end, across a deployed training pipeline. RecD optimizes data generation pipelines to decrease dataset storage and preprocessing resource demands and to maximize duplication within a training batch. RecD introduces a new tensor format, InverseKeyedJaggedTensors (IKJTs), to deduplicate feature values in each batch. We show how DLRM model architectures can leverage IKJTs to drastically increase training throughput. RecD improves the training and preprocessing throughput and storage efficiency by up to 2.48x, 1.79x, and 3.71x, respectively, in an industry-scale DLRM training system. △ Less

Submitted 1 May, 2023; v1 submitted 9 November, 2022; originally announced November 2022.

Comments: Published in the Proceedings of the Sixth Conference on Machine Learning and Systems (MLSys 2023)

arXiv:2208.13068 [pdf, other]

Apiary: A DBMS-Integrated Transactional Function-as-a-Service Framework

Authors: Peter Kraft, Qian Li, Kostis Kaffes, Athinagoras Skiadopoulos, Deeptaanshu Kumar, Danny Cho, Jason Li, Robert Redmond, Nathan Weckwerth, Brian Xia, Peter Bailis, Michael Cafarella, Goetz Graefe, Jeremy Kepner, Christos Kozyrakis, Michael Stonebraker, Lalith Suresh, Xiangyao Yu, Matei Zaharia

Abstract: Developers increasingly use function-as-a-service (FaaS) platforms for data-centric applications that perform low-latency and transactional operations on data, such as for microservices or web serving. Unfortunately, existing FaaS platforms support these applications poorly because they physically and logically separate application logic, executed in cloud functions, from data management, done in… ▽ More Developers increasingly use function-as-a-service (FaaS) platforms for data-centric applications that perform low-latency and transactional operations on data, such as for microservices or web serving. Unfortunately, existing FaaS platforms support these applications poorly because they physically and logically separate application logic, executed in cloud functions, from data management, done in interactive transactions accessing remote storage. Physical separation harms performance while logical separation complicates efficiently providing transactional guarantees and fault tolerance. This paper introduces Apiary, a novel DBMS-integrated FaaS platform for deploying and composing fault-tolerant transactional functions. Apiary physically co-locates and logically integrates function execution and data management by wrap** a distributed DBMS engine and using it as a unified runtime for function execution, data management, and operational logging, thus providing similar or stronger transactional guarantees as comparable systems while greatly improving performance and observability. To allow developers to write complex stateful programs, we leverage this integration to enable efficient and fault-tolerant function composition, building a frontend for orchestrating workflows of functions with the guarantees that each workflow runs to completion and each function in a workflow executes exactly once. We evaluate Apiary against research and production FaaS platforms and show it outperforms them by 2--68x on microservice workloads by reducing communication overhead. △ Less

Submitted 30 June, 2023; v1 submitted 27 August, 2022; originally announced August 2022.

Comments: 14 pages, 13 figures, 3 tables. Preprint

arXiv:2201.10477 [pdf, other]

SOL: Safe On-Node Learning in Cloud Platforms

Authors: Yawen Wang, Daniel Crankshaw, Neeraja J. Yadwadkar, Daniel Berger, Christos Kozyrakis, Ricardo Bianchini

Abstract: Cloud platforms run many software agents on each server node. These agents manage all aspects of node operation, and in some cases frequently collect data and make decisions. Unfortunately, their behavior is typically based on pre-defined static heuristics or offline analysis; they do not leverage on-node machine learning (ML). In this paper, we first characterize the spectrum of node agents in Az… ▽ More Cloud platforms run many software agents on each server node. These agents manage all aspects of node operation, and in some cases frequently collect data and make decisions. Unfortunately, their behavior is typically based on pre-defined static heuristics or offline analysis; they do not leverage on-node machine learning (ML). In this paper, we first characterize the spectrum of node agents in Azure, and identify the classes of agents that are most likely to benefit from on-node ML. We then propose SOL, an extensible framework for designing ML-based agents that are safe and robust to the range of failure conditions that occur in production. SOL provides a simple API to agent developers and manages the scheduling and running of the agent-specific functions they write. We illustrate the use of SOL by implementing three ML-based agents that manage CPU cores, node power, and memory placement. Our experiments show that (1) ML substantially improves our agents, and (2) SOL ensures that agents operate safely under a variety of failure conditions. We conclude that ML-based agents show significant potential and that SOL can help build them. △ Less

Submitted 25 January, 2022; originally announced January 2022.

arXiv:2201.10095 [pdf, other]

RecShard: Statistical Feature-Based Memory Optimization for Industry-Scale Neural Recommendation

Authors: Geet Sethi, Bilge Acun, Niket Agarwal, Christos Kozyrakis, Caroline Trippel, Carole-Jean Wu

Abstract: We propose RecShard, a fine-grained embedding table (EMB) partitioning and placement technique for deep learning recommendation models (DLRMs). RecShard is designed based on two key observations. First, not all EMBs are equal, nor all rows within an EMB are equal in terms of access patterns. EMBs exhibit distinct memory characteristics, providing performance optimization opportunities for intellig… ▽ More We propose RecShard, a fine-grained embedding table (EMB) partitioning and placement technique for deep learning recommendation models (DLRMs). RecShard is designed based on two key observations. First, not all EMBs are equal, nor all rows within an EMB are equal in terms of access patterns. EMBs exhibit distinct memory characteristics, providing performance optimization opportunities for intelligent EMB partitioning and placement across a tiered memory hierarchy. Second, in modern DLRMs, EMBs function as hash tables. As a result, EMBs display interesting phenomena, such as the birthday paradox, leaving EMBs severely under-utilized. RecShard determines an optimal EMB sharding strategy for a set of EMBs based on training data distributions and model characteristics, along with the bandwidth characteristics of the underlying tiered memory hierarchy. In doing so, RecShard achieves over 6 times higher EMB training throughput on average for capacity constrained DLRMs. The throughput increase comes from improved EMB load balance by over 12 times and from the reduced access to the slower memory by over 87 times. △ Less

Submitted 24 January, 2022; originally announced January 2022.

arXiv:2111.07226 [pdf, other]

Practical Scheduling for Real-World Serverless Computing

Authors: Kostis Kaffes, Neeraja J. Yadwadkar, Christos Kozyrakis

Abstract: Serverless computing has seen rapid growth due to the ease-of-use and cost-efficiency it provides. However, function scheduling, a critical component of serverless systems, has been overlooked. In this paper, we take a first-principles approach toward designing a scheduler that caters to the unique characteristics of serverless functions as seen in real-world deployments. We first create a taxonom… ▽ More Serverless computing has seen rapid growth due to the ease-of-use and cost-efficiency it provides. However, function scheduling, a critical component of serverless systems, has been overlooked. In this paper, we take a first-principles approach toward designing a scheduler that caters to the unique characteristics of serverless functions as seen in real-world deployments. We first create a taxonomy of scheduling policies along three dimensions. Next, we use simulation to explore the scheduling policy space for the function characteristics in a 14-day trace of Azure functions and conclude that frequently used features such as late binding and random load balancing are sub-optimal for common execution time distributions and load ranges. We use these insights to design Hermes, a scheduler for serverless functions with three key characteristics. First, to avoid head-of-line blocking due to high function execution time variability, Hermes uses a combination of early binding and processor sharing for scheduling at individual worker machines. Second, Hermes uses a hybrid load balancing approach that improves consolidation at low load while employing least-loaded balancing at high load to retain high performance. Third, Hermes is both load and locality-aware, reducing the number of cold starts compared to pure load-based policies. We implement Hermes for Apache OpenWhisk and demonstrate that, for the case of the function patterns observed both in the Azure and in other real-world traces, it achieves up to 85% lower function slowdown and 60% higher throughput compared to existing policies. △ Less

Submitted 13 November, 2021; originally announced November 2021.

arXiv:2108.09373 [pdf, other]

doi 10.1145/3470496.3533044

Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training

Authors: Mark Zhao, Niket Agarwal, Aarti Basant, Bugra Gedik, Satadru Pan, Mustafa Ozdal, Rakesh Komuravelli, Jerry Pan, Tianshu Bao, Haowei Lu, Sundaram Narayanan, Jack Langman, Kevin Wilfong, Harsha Rastogi, Carole-Jean Wu, Christos Kozyrakis, Parik Pol

Abstract: Datacenter-scale AI training clusters consisting of thousands of domain-specific accelerators (DSA) are used to train increasingly-complex deep learning models. These clusters rely on a data storage and ingestion (DSI) pipeline, responsible for storing exabytes of training data and serving it at tens of terabytes per second. As DSAs continue to push training efficiency and throughput, the DSI pipe… ▽ More Datacenter-scale AI training clusters consisting of thousands of domain-specific accelerators (DSA) are used to train increasingly-complex deep learning models. These clusters rely on a data storage and ingestion (DSI) pipeline, responsible for storing exabytes of training data and serving it at tens of terabytes per second. As DSAs continue to push training efficiency and throughput, the DSI pipeline is becoming the dominating factor that constrains the overall training performance and capacity. Innovations that improve the efficiency and performance of DSI systems and hardware are urgent, demanding a deep understanding of DSI characteristics and infrastructure at scale. This paper presents Meta's end-to-end DSI pipeline, composed of a central data warehouse built on distributed storage and a Data PreProcessing Service that scales to eliminate data stalls. We characterize how hundreds of models are collaboratively trained across geo-distributed datacenters via diverse and continuous training jobs. These training jobs read and heavily filter massive and evolving datasets, resulting in popular features and samples used across training jobs. We measure the intense network, memory, and compute resources required by each training job to preprocess samples during training. Finally, we synthesize key takeaways based on our production infrastructure characterization. These include identifying hardware bottlenecks, discussing opportunities for heterogeneous DSI hardware, motivating research in datacenter scheduling and benchmark datasets, and assimilating lessons learned in optimizing DSI infrastructure. △ Less

Submitted 22 April, 2022; v1 submitted 20 August, 2021; originally announced August 2021.

Comments: In The 49th Annual International Symposium on Computer Architecture (ISCA 2022)

arXiv:2104.13869 [pdf, other]

Faa$T: A Transparent Auto-Scaling Cache for Serverless Applications

Authors: Francisco Romero, Gohar Irfan Chaudhry, Íñigo Goiri, Pragna Gopa, Paul Batum, Neeraja J. Yadwadkar, Rodrigo Fonseca, Christos Kozyrakis, Ricardo Bianchini

Abstract: Function-as-a-Service (FaaS) has become an increasingly popular way for users to deploy their applications without the burden of managing the underlying infrastructure. However, existing FaaS platforms rely on remote storage to maintain state, limiting the set of applications that can be run efficiently. Recent caching work for FaaS platforms has tried to address this problem, but has fallen short… ▽ More Function-as-a-Service (FaaS) has become an increasingly popular way for users to deploy their applications without the burden of managing the underlying infrastructure. However, existing FaaS platforms rely on remote storage to maintain state, limiting the set of applications that can be run efficiently. Recent caching work for FaaS platforms has tried to address this problem, but has fallen short: it disregards the widely different characteristics of FaaS applications, does not scale the cache based on data access patterns, or requires changes to applications. To address these limitations, we present Faa\$T, a transparent auto-scaling distributed cache for serverless applications. Each application gets its own Faa\$T cache. After a function executes and the application becomes inactive, the cache is unloaded from memory with the application. Upon reloading for the next invocation, Faa\$T pre-warms the cache with objects likely to be accessed. In addition to traditional compute-based scaling, Faa\$T scales based on working set and object sizes to manage cache space and I/O bandwidth. We motivate our design with a comprehensive study of data access patterns in a large-scale commercial FaaS provider. We implement Faa\$T for the provider's production FaaS platform. Our experiments show that Faa\$T can improve performance by up to 92% (57% on average) for challenging applications, and reduce cost for most users compared to state-of-the-art caching systems, i.e. the cost of having to stand up additional serverful resources. △ Less

Submitted 28 April, 2021; originally announced April 2021.

Comments: 18 pages, 15 figures

arXiv:2103.03500 [pdf, other]

doi 10.1145/3503222.3507733

ShEF: Shielded Enclaves for Cloud FPGAs

Authors: Mark Zhao, Mingyu Gao, Christos Kozyrakis

Abstract: FPGAs are now used in public clouds to accelerate a wide range of applications, including many that operate on sensitive data such as financial and medical records. We present ShEF, a trusted execution environment (TEE) for cloud-based reconfigurable accelerators. ShEF is independent from CPU-based TEEs and allows secure execution under a threat model where the adversary can control all software r… ▽ More FPGAs are now used in public clouds to accelerate a wide range of applications, including many that operate on sensitive data such as financial and medical records. We present ShEF, a trusted execution environment (TEE) for cloud-based reconfigurable accelerators. ShEF is independent from CPU-based TEEs and allows secure execution under a threat model where the adversary can control all software running on the CPU connected to the FPGA, has physical access to the FPGA, and can compromise the FPGA interface logic of the cloud provider. ShEF provides a secure boot and remote attestation process that relies solely on existing FPGA mechanisms for root of trust. It also includes a Shield component that provides secure access to data while the accelerator is in use. The Shield is highly customizable and extensible, allowing users to craft a bespoke security solution that fits their accelerator's memory access patterns, bandwidth, and security requirements at minimum performance and area overheads. We describe a prototype implementation of ShEF for existing cloud FPGAs, map ShEF to a performant and secure storage application, and measure the performance benefits of customizable security using five additional accelerators. △ Less

Submitted 27 January, 2022; v1 submitted 5 March, 2021; originally announced March 2021.

arXiv:2102.01887 [pdf, other]

Llama: A Heterogeneous & Serverless Framework for Auto-Tuning Video Analytics Pipelines

Authors: Francisco Romero, Mark Zhao, Neeraja J. Yadwadkar, Christos Kozyrakis

Abstract: The proliferation of camera-enabled devices and large video repositories has led to a diverse set of video analytics applications. These applications rely on video pipelines, represented as DAGs of operations, to transform videos, process extracted metadata, and answer questions like, "Is this intersection congested?" The latency and resource efficiency of pipelines can be optimized using configur… ▽ More The proliferation of camera-enabled devices and large video repositories has led to a diverse set of video analytics applications. These applications rely on video pipelines, represented as DAGs of operations, to transform videos, process extracted metadata, and answer questions like, "Is this intersection congested?" The latency and resource efficiency of pipelines can be optimized using configurable knobs for each operation (e.g., sampling rate, batch size, or type of hardware used). However, determining efficient configurations is challenging because (a) the configuration search space is exponentially large, and (b) the optimal configuration depends on users' desired latency and cost targets, (c) input video contents may exercise different paths in the DAG and produce a variable amount intermediate results. Existing video analytics and processing systems leave it to the users to manually configure operations and select hardware resources. We present Llama: a heterogeneous and serverless framework for auto-tuning video pipelines. Given an end-to-end latency target, Llama optimizes for cost efficiency by (a) calculating a latency target for each operation invocation, and (b) dynamically running a cost-based optimizer to assign configurations across heterogeneous hardware that best meet the calculated per-invocation latency target. This makes the problem of auto-tuning large video pipelines tractable and allows us to handle input-dependent behavior, conditional branches in the DAG, and execution variability. We describe the algorithms in Llama and evaluate it on a cloud platform using serverless CPU and GPU resources. We show that compared to state-of-the-art cluster and serverless video analytics and processing systems, Llama achieves 7.8x lower latency and 16x cost reduction on average. △ Less

Submitted 28 May, 2021; v1 submitted 3 February, 2021; originally announced February 2021.

arXiv:2010.05969 [pdf, other]

RackSched: A Microsecond-Scale Scheduler for Rack-Scale Computers (Technical Report)

Authors: Hang Zhu, Kostis Kaffes, Zixu Chen, Zhenming Liu, Christos Kozyrakis, Ion Stoica, Xin **

Abstract: Low-latency online services have strict Service Level Objectives (SLOs) that require datacenter systems to support high throughput at microsecond-scale tail latency. Dataplane operating systems have been designed to scale up multi-core servers with minimal overhead for such SLOs. However, as application demands continue to increase, scaling up is not enough, and serving larger demands requires the… ▽ More Low-latency online services have strict Service Level Objectives (SLOs) that require datacenter systems to support high throughput at microsecond-scale tail latency. Dataplane operating systems have been designed to scale up multi-core servers with minimal overhead for such SLOs. However, as application demands continue to increase, scaling up is not enough, and serving larger demands requires these systems to scale out to multiple servers in a rack. We present RackSched, the first rack-level microsecond-scale scheduler that provides the abstraction of a rack-scale computer (i.e., a huge server with hundreds to thousands of cores) to an external service with network-system co-design. The core of RackSched is a two-layer scheduling framework that integrates inter-server scheduling in the top-of-rack (ToR) switch with intra-server scheduling in each server. We use a combination of analytical results and simulations to show that it provides near-optimal performance as centralized scheduling policies, and is robust for both low-dispersion and high-dispersion workloads. We design a custom switch data plane for the inter-server scheduler, which realizes power-of-k-choices, ensures request affinity, and tracks server loads accurately and efficiently. We implement a RackSched prototype on a cluster of commodity servers connected by a Barefoot Tofino switch. End-to-end experiments on a twelve-server testbed show that RackSched improves the throughput by up to 1.44x, and scales out the throughput near linearly, while maintaining the same tail latency as one server until the system is saturated. △ Less

Submitted 15 October, 2020; v1 submitted 12 October, 2020; originally announced October 2020.

arXiv:2007.11112 [pdf, other]

DBOS: A Proposal for a Data-Centric Operating System

Authors: Michael Cafarella, David DeWitt, Vijay Gadepally, Jeremy Kepner, Christos Kozyrakis, Tim Kraska, Michael Stonebraker, Matei Zaharia

Abstract: Current operating systems are complex systems that were designed before today's computing environments. This makes it difficult for them to meet the scalability, heterogeneity, availability, and security challenges in current cloud and parallel computing environments. To address these problems, we propose a radically new OS design based on data-centric architecture: all operating system state shou… ▽ More Current operating systems are complex systems that were designed before today's computing environments. This makes it difficult for them to meet the scalability, heterogeneity, availability, and security challenges in current cloud and parallel computing environments. To address these problems, we propose a radically new OS design based on data-centric architecture: all operating system state should be represented uniformly as database tables, and operations on this state should be made via queries from otherwise stateless tasks. This design makes it easy to scale and evolve the OS without whole-system refactoring, inspect and debug system state, upgrade components without downtime, manage decisions using machine learning, and implement sophisticated security features. We discuss how a database OS (DBOS) can improve the programmability and performance of many of today's most important applications and propose a plan for the development of a DBOS proof of concept. △ Less

Submitted 21 July, 2020; originally announced July 2020.

arXiv:1905.13348 [pdf, other]

INFaaS: A Model-less and Managed Inference Serving System

Authors: Francisco Romero, Qian Li, Neeraja J. Yadwadkar, Christos Kozyrakis

Abstract: Despite existing work in machine learning inference serving, ease-of-use and cost efficiency remain challenges at large scales. Developers must manually search through thousands of model-variants -- versions of already-trained models that differ in hardware, resource footprints, latencies, costs, and accuracies -- to meet the diverse application requirements. Since requirements, query load, and ap… ▽ More Despite existing work in machine learning inference serving, ease-of-use and cost efficiency remain challenges at large scales. Developers must manually search through thousands of model-variants -- versions of already-trained models that differ in hardware, resource footprints, latencies, costs, and accuracies -- to meet the diverse application requirements. Since requirements, query load, and applications themselves evolve over time, these decisions need to be made dynamically for each inference query to avoid excessive costs through naive autoscaling. To avoid navigating through the large and complex trade-off space of model-variants, developers often fix a variant across queries, and replicate it when load increases. However, given the diversity across variants and hardware platforms in the cloud, a lack of understanding of the trade-off space can incur significant costs to developers. This paper introduces INFaaS, a managed and model-less system for distributed inference serving, where developers simply specify the performance and accuracy requirements for their applications without needing to specify a specific model-variant for each query. INFaaS generates model-variants, and efficiently navigates the large trade-off space of model-variants on behalf of developers to meet application-specific objectives: (a) for each query, it selects a model, hardware architecture, and model optimizations, (b) it combines VM-level horizontal autoscaling with model-level autoscaling, where multiple, different model-variants are used to serve queries within each machine. By leveraging diverse variants and sharing hardware resources across models, INFaaS achieves 1.3x higher throughput, violates latency objectives 1.6x less often, and saves up to 21.6x in cost (8.5x on average) compared to state-of-the-art inference serving systems on AWS EC2. △ Less

Submitted 15 December, 2020; v1 submitted 30 May, 2019; originally announced May 2019.

Report number: https://www.usenix.org/system/files/atc21-romero.pdf

arXiv:1903.07754 [pdf, other]

A New Frontier for Pull-Based Graph Processing

Authors: Samuel Grossman, Christos Kozyrakis

Abstract: The trade-off between pull-based and push-based graph processing engines is well-understood. On one hand, pull-based engines can achieve higher throughput because their workloads are read-dominant, rather than write-dominant, and can proceed without synchronization between threads. On the other hand, push-based engines are much better able to take advantage of the frontier optimization, which leve… ▽ More The trade-off between pull-based and push-based graph processing engines is well-understood. On one hand, pull-based engines can achieve higher throughput because their workloads are read-dominant, rather than write-dominant, and can proceed without synchronization between threads. On the other hand, push-based engines are much better able to take advantage of the frontier optimization, which leverages the fact that often only a small subset of the graph needs to be accessed to complete an iteration of a graph processing application. Hybrid engines attempt to overcome this trade-off by dynamically switching between push and pull, but there are two key disadvantages with this approach. First, applications must be implemented twice (once for push and once for pull), and second, processing throughput is reduced for iterations that run with push. We propose a radically different solution: rebuild the frontier optimization entirely such that it is well-suited for a pull-based engine. In doing so, we remove the only advantage that a push-based engine had over a pull-based engine, making it possible to eliminate the push-based engine entirely. We introduce Wedge, a pull-only graph processing framework that transforms the traditional source-oriented vertex-based frontier into a pull-friendly format called the Wedge Frontier. The transformation itself is expensive even when parallelized, so we introduce two key optimizations to make it practical. First, we perform the transformation only when the resulting Wedge Frontier is sufficiently sparse. Second, we coarsen the granularity of the representation of elements in the Wedge Frontier. These optimizations respectively improve Wedge's performance by up to 5x and 2x, enabling it to outperform Grazelle, Ligra, and GraphMat respectively by up to 2.8x, 4.9x, and 185.5x. △ Less

Submitted 18 March, 2019; originally announced March 2019.

arXiv:1812.09442 [pdf, other]

Trevor: Automatic configuration and scaling of stream processing pipelines

Authors: Manu Bansal, Eyal Cidon, Arjun Balasingam, Aditya Gudipati, Christos Kozyrakis, Sachin Katti

Abstract: Operating a distributed data stream processing workload efficiently at scale is hard. The operator of the workload must parallelize and lay out tasks of the workload with resources that match the requirement of target data rate. The challenge is that neither the operator nor the programmer is typically aware of the scaling behavior of the workload as a function of resources. An operator manually s… ▽ More Operating a distributed data stream processing workload efficiently at scale is hard. The operator of the workload must parallelize and lay out tasks of the workload with resources that match the requirement of target data rate. The challenge is that neither the operator nor the programmer is typically aware of the scaling behavior of the workload as a function of resources. An operator manually searches for a safe operating point that can handle predicted peak load and deploys with ample headroom for absorbing unpredictable spikes. Such empirical, static over-provisioning is wasteful of both compute and human resources. We show that precise performance models can be automatically learned for distributed stream processing systems that can predict the execution performance of a job even before deployment. Further, those models can be used to optimally schedule logically specified jobs onto available physical hardware. Finally, those models and the derived execution schedules can be refined online to dynamically adapt to unpredictable changes in the runtime environment or auto-scale with variations in job load. △ Less

Submitted 21 December, 2018; originally announced December 2018.

arXiv:1809.04070 [pdf, other]

doi 10.1145/3373376.3378514

Interstellar: Using Halide's Scheduling Language to Analyze DNN Accelerators

Authors: Xuan Yang, Mingyu Gao, Qiaoyi Liu, Jeff Ou Setter, **g Pu, Ankita Nayak, Steven Emberton Bell, Kaidi Cao, Heonjae Ha, Priyanka Raina, Christos Kozyrakis, Mark Horowitz

Abstract: We show that DNN accelerator micro-architectures and their program map**s represent specific choices of loop order and hardware parallelism for computing the seven nested loops of DNNs, which enables us to create a formal taxonomy of all existing dense DNN accelerators. Surprisingly, the loop transformations needed to create these hardware variants can be precisely and concisely represented by H… ▽ More We show that DNN accelerator micro-architectures and their program map**s represent specific choices of loop order and hardware parallelism for computing the seven nested loops of DNNs, which enables us to create a formal taxonomy of all existing dense DNN accelerators. Surprisingly, the loop transformations needed to create these hardware variants can be precisely and concisely represented by Halide's scheduling language. By modifying the Halide compiler to generate hardware, we create a system that can fairly compare these prior accelerators. As long as proper loop blocking schemes are used, and the hardware can support map** replicated loops, many different hardware dataflows yield similar energy efficiency with good performance. This is because the loop blocking can ensure that most data references stay on-chip with good locality and the processing units have high resource utilization. How resources are allocated, especially in the memory system, has a large impact on energy and performance. By optimizing hardware resource allocation while kee** throughput constant, we achieve up to 4.2X energy improvement for Convolutional Neural Networks (CNNs), 1.6X and 1.8X improvement for Long Short-Term Memories (LSTMs) and multi-layer perceptrons (MLPs), respectively. △ Less

Submitted 26 April, 2020; v1 submitted 10 September, 2018; originally announced September 2018.

Comments: Published as a conference paper at ASPLOS 2020

ACM Class: C.1.4; C.3; C.4

Journal ref: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, March, 2020, Pages 369-383

arXiv:1803.02329 [pdf, other]

Learning Memory Access Patterns

Authors: Milad Hashemi, Kevin Swersky, Jamie A. Smith, Grant Ayers, Heiner Litz, Jichuan Chang, Christos Kozyrakis, Parthasarathy Ranganathan

Abstract: The explosion in workload complexity and the recent slow-down in Moore's law scaling call for new approaches towards efficient computing. Researchers are now beginning to use recent advances in machine learning in software optimizations, augmenting or replacing traditional heuristics and data structures. However, the space of machine learning for computer hardware architecture is only lightly expl… ▽ More The explosion in workload complexity and the recent slow-down in Moore's law scaling call for new approaches towards efficient computing. Researchers are now beginning to use recent advances in machine learning in software optimizations, augmenting or replacing traditional heuristics and data structures. However, the space of machine learning for computer hardware architecture is only lightly explored. In this paper, we demonstrate the potential of deep learning to address the von Neumann bottleneck of memory performance. We focus on the critical problem of learning memory access patterns, with the goal of constructing accurate and efficient memory prefetchers. We relate contemporary prefetching strategies to n-gram models in natural language processing, and show how recurrent neural networks can serve as a drop-in replacement. On a suite of challenging benchmark datasets, we find that neural networks consistently demonstrate superior performance in terms of precision and recall. This work represents the first step towards practical neural-network based prefetching, and opens a wide range of exciting directions for machine learning in computer architecture research. △ Less

Submitted 6 March, 2018; originally announced March 2018.

arXiv:1711.02294 [pdf, other]

AppSwitch: Resolving the Application Identity Crisis

Authors: Dinesh Subhraveti, Sri Goli, Serge Hallyn, Ravi Chamarthy, Christos Kozyrakis

Abstract: Networked applications traditionally derive their identity from the identity of the host on which they run. The default application identity acquired from the host results in subtle and substantial problems related to application deployment, discovery and access, especially for modern distributed applications. A number of mechanisms and workarounds, often quite elaborate, are used to address those… ▽ More Networked applications traditionally derive their identity from the identity of the host on which they run. The default application identity acquired from the host results in subtle and substantial problems related to application deployment, discovery and access, especially for modern distributed applications. A number of mechanisms and workarounds, often quite elaborate, are used to address those problems but they only address them indirectly and incompletely. This paper presents AppSwitch, a novel transport layer network element that decouples applications from underlying network at the system call layer and enables them to be identified independently of the network. Without requiring changes to existing applications or infrastructure, it removes the cost and complexity associated with operating distributed applications while offering a number of benefits including an efficient implementation of common network functions such as application firewall and load balancer. Experiments with our implementation show that AppSwitch model also effectively removes the performance penalty associated with unnecessary data path processing that is typical in those application environments. △ Less

Submitted 8 November, 2017; v1 submitted 7 November, 2017; originally announced November 2017.

arXiv:1511.06968 [pdf, other]

Generating Configurable Hardware from Parallel Patterns

Authors: Raghu Prabhakar, David Koeplinger, Kevin Brown, HyoukJoong Lee, Christopher De Sa, Christos Kozyrakis, Kunle Olukotun

Abstract: In recent years the computing landscape has seen an in- creasing shift towards specialized accelerators. Field pro- grammable gate arrays (FPGAs) are particularly promising as they offer significant performance and energy improvements compared to CPUs for a wide class of applications and are far more flexible than fixed-function ASICs. However, FPGAs are difficult to program. Traditional programmi… ▽ More In recent years the computing landscape has seen an in- creasing shift towards specialized accelerators. Field pro- grammable gate arrays (FPGAs) are particularly promising as they offer significant performance and energy improvements compared to CPUs for a wide class of applications and are far more flexible than fixed-function ASICs. However, FPGAs are difficult to program. Traditional programming models for reconfigurable logic use low-level hardware description languages like Verilog and VHDL, which have none of the pro- ductivity features of modern software development languages but produce very efficient designs, and low-level software lan- guages like C and OpenCL coupled with high-level synthesis (HLS) tools that typically produce designs that are far less efficient. Functional languages with parallel patterns are a better fit for hardware generation because they both provide high-level abstractions to programmers with little experience in hard- ware design and avoid many of the problems faced when gen- erating hardware from imperative languages. In this paper, we identify two optimizations that are important when using par- allel patterns to generate hardware: tiling and metapipelining. We present a general representation of tiled parallel patterns, and provide rules for automatically tiling patterns and gen- erating metapipelines. We demonstrate experimentally that these optimizations result in speedups up to 40x on a set of benchmarks from the data analytics domain. △ Less

Submitted 22 November, 2015; originally announced November 2015.

Showing 1–24 of 24 results for author: Kozyrakis, C