Skip to main content

Showing 1–24 of 24 results for author: Kozyrakis, C

.
  1. arXiv:2405.14009  [pdf, other

    cs.DC cs.LG

    SlipStream: Adapting Pipelines for Distributed Training of Large DNNs Amid Failures

    Authors: Swapnil Gandhi, Mark Zhao, Athinagoras Skiadopoulos, Christos Kozyrakis

    Abstract: Training large Deep Neural Network (DNN) models requires thousands of GPUs for days or weeks at a time. At these scales, failures are frequent and can have a big impact on training throughput. Restoring performance using spare GPU servers becomes increasingly expensive as models grow. SlipStream is a system for efficient DNN training in the presence of failures, without using spare servers. It exp… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

  2. arXiv:2401.08895  [pdf, other

    cs.LG cs.DC cs.PF

    cedar: Composable and Optimized Machine Learning Input Data Pipelines

    Authors: Mark Zhao, Emanuel Adamiak, Christos Kozyrakis

    Abstract: The input data pipeline is an essential component of each machine learning (ML) training job. It is responsible for reading massive amounts of training data, processing batches of samples using complex transformations, and loading them onto training nodes at low latency and high throughput. Performant input data systems are becoming increasingly critical, driven by skyrocketing data volumes and tr… ▽ More

    Submitted 25 January, 2024; v1 submitted 16 January, 2024; originally announced January 2024.

  3. arXiv:2312.07104  [pdf, other

    cs.AI cs.PL

    SGLang: Efficient Execution of Structured Language Model Programs

    Authors: Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng

    Abstract: Large language models (LLMs) are increasingly used for complex tasks that require multiple generation calls, advanced prompting techniques, control flow, and structured inputs/outputs. However, efficient systems are lacking for programming and executing these applications. We introduce SGLang, a system for efficient execution of complex language model programs. SGLang consists of a frontend langua… ▽ More

    Submitted 5 June, 2024; v1 submitted 12 December, 2023; originally announced December 2023.

  4. arXiv:2305.03785  [pdf, other

    cs.DB

    Zelda: Video Analytics using Vision-Language Models

    Authors: Francisco Romero, Caleb Winston, Johann Hauswald, Matei Zaharia, Christos Kozyrakis

    Abstract: Advances in ML have motivated the design of video analytics systems that allow for structured queries over video datasets. However, existing systems limit query expressivity, require users to specify an ML model per predicate, rely on complex optimizations that trade off accuracy for performance, and return large amounts of redundant and low-quality results. This paper focuses on the recently deve… ▽ More

    Submitted 7 November, 2023; v1 submitted 5 May, 2023; originally announced May 2023.

  5. arXiv:2301.02959  [pdf, other

    cs.LG cs.DC cs.IR cs.PF

    FlexShard: Flexible Sharding for Industry-Scale Sequence Recommendation Models

    Authors: Geet Sethi, Pallab Bhattacharya, Dhruv Choudhary, Carole-Jean Wu, Christos Kozyrakis

    Abstract: Sequence-based deep learning recommendation models (DLRMs) are an emerging class of DLRMs showing great improvements over their prior sum-pooling based counterparts at capturing users' long term interests. These improvements come at immense system cost however, with sequence-based DLRMs requiring substantial amounts of data to be dynamically materialized and communicated by each accelerator during… ▽ More

    Submitted 7 January, 2023; originally announced January 2023.

  6. arXiv:2212.14161  [pdf, other

    cs.DB cs.DC cs.SE

    Transactions Make Debugging Easy

    Authors: Qian Li, Peter Kraft, Michael Cafarella, Çağatay Demiralp, Goetz Graefe, Christos Kozyrakis, Michael Stonebraker, Lalith Suresh, Matei Zaharia

    Abstract: We propose TROD, a novel transaction-oriented framework for debugging modern distributed web applications and online services. Our critical insight is that if applications store all state in databases and only access state transactionally, TROD can use lightweight always-on tracing to track the history of application state changes and data provenance, and then leverage the captured traces and tran… ▽ More

    Submitted 28 December, 2022; originally announced December 2022.

    Comments: CIDR'23

  7. arXiv:2211.05239  [pdf, other

    cs.LG cs.DC cs.IR cs.PF

    RecD: Deduplication for End-to-End Deep Learning Recommendation Model Training Infrastructure

    Authors: Mark Zhao, Dhruv Choudhary, Devashish Tyagi, Ajay Somani, Max Kaplan, Sung-Han Lin, Sarunya Pumma, Jongsoo Park, Aarti Basant, Niket Agarwal, Carole-Jean Wu, Christos Kozyrakis

    Abstract: We present RecD (Recommendation Deduplication), a suite of end-to-end infrastructure optimizations across the Deep Learning Recommendation Model (DLRM) training pipeline. RecD addresses immense storage, preprocessing, and training overheads caused by feature duplication inherent in industry-scale DLRM training datasets. Feature duplication arises because DLRM datasets are generated from interactio… ▽ More

    Submitted 1 May, 2023; v1 submitted 9 November, 2022; originally announced November 2022.

    Comments: Published in the Proceedings of the Sixth Conference on Machine Learning and Systems (MLSys 2023)

  8. arXiv:2208.13068  [pdf, other

    cs.DB cs.DC

    Apiary: A DBMS-Integrated Transactional Function-as-a-Service Framework

    Authors: Peter Kraft, Qian Li, Kostis Kaffes, Athinagoras Skiadopoulos, Deeptaanshu Kumar, Danny Cho, Jason Li, Robert Redmond, Nathan Weckwerth, Brian Xia, Peter Bailis, Michael Cafarella, Goetz Graefe, Jeremy Kepner, Christos Kozyrakis, Michael Stonebraker, Lalith Suresh, Xiangyao Yu, Matei Zaharia

    Abstract: Developers increasingly use function-as-a-service (FaaS) platforms for data-centric applications that perform low-latency and transactional operations on data, such as for microservices or web serving. Unfortunately, existing FaaS platforms support these applications poorly because they physically and logically separate application logic, executed in cloud functions, from data management, done in… ▽ More

    Submitted 30 June, 2023; v1 submitted 27 August, 2022; originally announced August 2022.

    Comments: 14 pages, 13 figures, 3 tables. Preprint

  9. arXiv:2201.10477  [pdf, other

    cs.OS

    SOL: Safe On-Node Learning in Cloud Platforms

    Authors: Yawen Wang, Daniel Crankshaw, Neeraja J. Yadwadkar, Daniel Berger, Christos Kozyrakis, Ricardo Bianchini

    Abstract: Cloud platforms run many software agents on each server node. These agents manage all aspects of node operation, and in some cases frequently collect data and make decisions. Unfortunately, their behavior is typically based on pre-defined static heuristics or offline analysis; they do not leverage on-node machine learning (ML). In this paper, we first characterize the spectrum of node agents in Az… ▽ More

    Submitted 25 January, 2022; originally announced January 2022.

  10. arXiv:2201.10095  [pdf, other

    cs.LG cs.AR cs.DC cs.PF

    RecShard: Statistical Feature-Based Memory Optimization for Industry-Scale Neural Recommendation

    Authors: Geet Sethi, Bilge Acun, Niket Agarwal, Christos Kozyrakis, Caroline Trippel, Carole-Jean Wu

    Abstract: We propose RecShard, a fine-grained embedding table (EMB) partitioning and placement technique for deep learning recommendation models (DLRMs). RecShard is designed based on two key observations. First, not all EMBs are equal, nor all rows within an EMB are equal in terms of access patterns. EMBs exhibit distinct memory characteristics, providing performance optimization opportunities for intellig… ▽ More

    Submitted 24 January, 2022; originally announced January 2022.

  11. arXiv:2111.07226  [pdf, other

    cs.DC

    Practical Scheduling for Real-World Serverless Computing

    Authors: Kostis Kaffes, Neeraja J. Yadwadkar, Christos Kozyrakis

    Abstract: Serverless computing has seen rapid growth due to the ease-of-use and cost-efficiency it provides. However, function scheduling, a critical component of serverless systems, has been overlooked. In this paper, we take a first-principles approach toward designing a scheduler that caters to the unique characteristics of serverless functions as seen in real-world deployments. We first create a taxonom… ▽ More

    Submitted 13 November, 2021; originally announced November 2021.

  12. Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training

    Authors: Mark Zhao, Niket Agarwal, Aarti Basant, Bugra Gedik, Satadru Pan, Mustafa Ozdal, Rakesh Komuravelli, Jerry Pan, Tianshu Bao, Haowei Lu, Sundaram Narayanan, Jack Langman, Kevin Wilfong, Harsha Rastogi, Carole-Jean Wu, Christos Kozyrakis, Parik Pol

    Abstract: Datacenter-scale AI training clusters consisting of thousands of domain-specific accelerators (DSA) are used to train increasingly-complex deep learning models. These clusters rely on a data storage and ingestion (DSI) pipeline, responsible for storing exabytes of training data and serving it at tens of terabytes per second. As DSAs continue to push training efficiency and throughput, the DSI pipe… ▽ More

    Submitted 22 April, 2022; v1 submitted 20 August, 2021; originally announced August 2021.

    Comments: In The 49th Annual International Symposium on Computer Architecture (ISCA 2022)

  13. arXiv:2104.13869  [pdf, other

    cs.DC

    Faa$T: A Transparent Auto-Scaling Cache for Serverless Applications

    Authors: Francisco Romero, Gohar Irfan Chaudhry, Íñigo Goiri, Pragna Gopa, Paul Batum, Neeraja J. Yadwadkar, Rodrigo Fonseca, Christos Kozyrakis, Ricardo Bianchini

    Abstract: Function-as-a-Service (FaaS) has become an increasingly popular way for users to deploy their applications without the burden of managing the underlying infrastructure. However, existing FaaS platforms rely on remote storage to maintain state, limiting the set of applications that can be run efficiently. Recent caching work for FaaS platforms has tried to address this problem, but has fallen short… ▽ More

    Submitted 28 April, 2021; originally announced April 2021.

    Comments: 18 pages, 15 figures

  14. ShEF: Shielded Enclaves for Cloud FPGAs

    Authors: Mark Zhao, Mingyu Gao, Christos Kozyrakis

    Abstract: FPGAs are now used in public clouds to accelerate a wide range of applications, including many that operate on sensitive data such as financial and medical records. We present ShEF, a trusted execution environment (TEE) for cloud-based reconfigurable accelerators. ShEF is independent from CPU-based TEEs and allows secure execution under a threat model where the adversary can control all software r… ▽ More

    Submitted 27 January, 2022; v1 submitted 5 March, 2021; originally announced March 2021.

  15. arXiv:2102.01887  [pdf, other

    cs.DC

    Llama: A Heterogeneous & Serverless Framework for Auto-Tuning Video Analytics Pipelines

    Authors: Francisco Romero, Mark Zhao, Neeraja J. Yadwadkar, Christos Kozyrakis

    Abstract: The proliferation of camera-enabled devices and large video repositories has led to a diverse set of video analytics applications. These applications rely on video pipelines, represented as DAGs of operations, to transform videos, process extracted metadata, and answer questions like, "Is this intersection congested?" The latency and resource efficiency of pipelines can be optimized using configur… ▽ More

    Submitted 28 May, 2021; v1 submitted 3 February, 2021; originally announced February 2021.

  16. arXiv:2010.05969  [pdf, other

    cs.DC

    RackSched: A Microsecond-Scale Scheduler for Rack-Scale Computers (Technical Report)

    Authors: Hang Zhu, Kostis Kaffes, Zixu Chen, Zhenming Liu, Christos Kozyrakis, Ion Stoica, Xin **

    Abstract: Low-latency online services have strict Service Level Objectives (SLOs) that require datacenter systems to support high throughput at microsecond-scale tail latency. Dataplane operating systems have been designed to scale up multi-core servers with minimal overhead for such SLOs. However, as application demands continue to increase, scaling up is not enough, and serving larger demands requires the… ▽ More

    Submitted 15 October, 2020; v1 submitted 12 October, 2020; originally announced October 2020.

  17. arXiv:2007.11112  [pdf, other

    cs.OS cs.AR cs.DB cs.DC cs.NI

    DBOS: A Proposal for a Data-Centric Operating System

    Authors: Michael Cafarella, David DeWitt, Vijay Gadepally, Jeremy Kepner, Christos Kozyrakis, Tim Kraska, Michael Stonebraker, Matei Zaharia

    Abstract: Current operating systems are complex systems that were designed before today's computing environments. This makes it difficult for them to meet the scalability, heterogeneity, availability, and security challenges in current cloud and parallel computing environments. To address these problems, we propose a radically new OS design based on data-centric architecture: all operating system state shou… ▽ More

    Submitted 21 July, 2020; originally announced July 2020.

  18. arXiv:1905.13348  [pdf, other

    cs.DC cs.LG

    INFaaS: A Model-less and Managed Inference Serving System

    Authors: Francisco Romero, Qian Li, Neeraja J. Yadwadkar, Christos Kozyrakis

    Abstract: Despite existing work in machine learning inference serving, ease-of-use and cost efficiency remain challenges at large scales. Developers must manually search through thousands of model-variants -- versions of already-trained models that differ in hardware, resource footprints, latencies, costs, and accuracies -- to meet the diverse application requirements. Since requirements, query load, and ap… ▽ More

    Submitted 15 December, 2020; v1 submitted 30 May, 2019; originally announced May 2019.

    Report number: https://www.usenix.org/system/files/atc21-romero.pdf

  19. arXiv:1903.07754  [pdf, other

    cs.DC

    A New Frontier for Pull-Based Graph Processing

    Authors: Samuel Grossman, Christos Kozyrakis

    Abstract: The trade-off between pull-based and push-based graph processing engines is well-understood. On one hand, pull-based engines can achieve higher throughput because their workloads are read-dominant, rather than write-dominant, and can proceed without synchronization between threads. On the other hand, push-based engines are much better able to take advantage of the frontier optimization, which leve… ▽ More

    Submitted 18 March, 2019; originally announced March 2019.

  20. arXiv:1812.09442  [pdf, other

    cs.DC

    Trevor: Automatic configuration and scaling of stream processing pipelines

    Authors: Manu Bansal, Eyal Cidon, Arjun Balasingam, Aditya Gudipati, Christos Kozyrakis, Sachin Katti

    Abstract: Operating a distributed data stream processing workload efficiently at scale is hard. The operator of the workload must parallelize and lay out tasks of the workload with resources that match the requirement of target data rate. The challenge is that neither the operator nor the programmer is typically aware of the scaling behavior of the workload as a function of resources. An operator manually s… ▽ More

    Submitted 21 December, 2018; originally announced December 2018.

  21. Interstellar: Using Halide's Scheduling Language to Analyze DNN Accelerators

    Authors: Xuan Yang, Mingyu Gao, Qiaoyi Liu, Jeff Ou Setter, **g Pu, Ankita Nayak, Steven Emberton Bell, Kaidi Cao, Heonjae Ha, Priyanka Raina, Christos Kozyrakis, Mark Horowitz

    Abstract: We show that DNN accelerator micro-architectures and their program map**s represent specific choices of loop order and hardware parallelism for computing the seven nested loops of DNNs, which enables us to create a formal taxonomy of all existing dense DNN accelerators. Surprisingly, the loop transformations needed to create these hardware variants can be precisely and concisely represented by H… ▽ More

    Submitted 26 April, 2020; v1 submitted 10 September, 2018; originally announced September 2018.

    Comments: Published as a conference paper at ASPLOS 2020

    ACM Class: C.1.4; C.3; C.4

    Journal ref: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, March, 2020, Pages 369-383

  22. arXiv:1803.02329  [pdf, other

    cs.LG stat.ML

    Learning Memory Access Patterns

    Authors: Milad Hashemi, Kevin Swersky, Jamie A. Smith, Grant Ayers, Heiner Litz, Jichuan Chang, Christos Kozyrakis, Parthasarathy Ranganathan

    Abstract: The explosion in workload complexity and the recent slow-down in Moore's law scaling call for new approaches towards efficient computing. Researchers are now beginning to use recent advances in machine learning in software optimizations, augmenting or replacing traditional heuristics and data structures. However, the space of machine learning for computer hardware architecture is only lightly expl… ▽ More

    Submitted 6 March, 2018; originally announced March 2018.

  23. arXiv:1711.02294  [pdf, other

    cs.NI cs.OS

    AppSwitch: Resolving the Application Identity Crisis

    Authors: Dinesh Subhraveti, Sri Goli, Serge Hallyn, Ravi Chamarthy, Christos Kozyrakis

    Abstract: Networked applications traditionally derive their identity from the identity of the host on which they run. The default application identity acquired from the host results in subtle and substantial problems related to application deployment, discovery and access, especially for modern distributed applications. A number of mechanisms and workarounds, often quite elaborate, are used to address those… ▽ More

    Submitted 8 November, 2017; v1 submitted 7 November, 2017; originally announced November 2017.

  24. arXiv:1511.06968  [pdf, other

    cs.DC cs.PL

    Generating Configurable Hardware from Parallel Patterns

    Authors: Raghu Prabhakar, David Koeplinger, Kevin Brown, HyoukJoong Lee, Christopher De Sa, Christos Kozyrakis, Kunle Olukotun

    Abstract: In recent years the computing landscape has seen an in- creasing shift towards specialized accelerators. Field pro- grammable gate arrays (FPGAs) are particularly promising as they offer significant performance and energy improvements compared to CPUs for a wide class of applications and are far more flexible than fixed-function ASICs. However, FPGAs are difficult to program. Traditional programmi… ▽ More

    Submitted 22 November, 2015; originally announced November 2015.