Skip to main content

Showing 1–10 of 10 results for author: Kalbarczyk, Z T

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.11169  [pdf, other

    cs.DC

    Mutiny! How does Kubernetes fail, and what can we do about it?

    Authors: Marco Barletta, Marcello Cinque, Catello Di Martino, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer

    Abstract: In this paper, we i) analyze and classify real-world failures of Kubernetes (the most popular container orchestration system), ii) develop a framework to perform a fault/error injection campaign targeting the data store preserving the cluster state, and iii) compare results of our fault/error injection experiments with real-world failures, showing that our fault/error injections can recreate many… ▽ More

    Submitted 17 April, 2024; originally announced April 2024.

  2. arXiv:2404.08509  [pdf, other

    cs.DC cs.CL cs.LG

    Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction

    Authors: Haoran Qiu, Weichao Mao, Archit Patke, Shengkun Cui, Saurabh Jha, Chen Wang, Hubertus Franke, Zbigniew T. Kalbarczyk, Tamer Başar, Ravishankar K. Iyer

    Abstract: Large language models (LLMs) have been driving a new wave of interactive AI applications across numerous domains. However, efficiently serving LLM inference requests is challenging due to their unpredictable execution times originating from the autoregressive nature of generative models. Existing LLM serving systems exploit first-come-first-serve (FCFS) scheduling, suffering from head-of-line bloc… ▽ More

    Submitted 12 April, 2024; originally announced April 2024.

    Comments: Accepted at AIOps'24

  3. arXiv:2109.11666  [pdf, other

    cs.OS cs.PF

    SLO beyond the Hardware Isolation Limits

    Authors: Haoran Qiu, Yongzhou Chen, Tianyin Xu, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer

    Abstract: Performance isolation is a keystone for SLO guarantees with shared resources in cloud and datacenter environments. To meet SLO requirements, the state of the art relies on hardware QoS support (e.g., Intel RDT) to allocate shared resources such as last-level caches and memory bandwidth for co-located latency-critical applications. As a result, the number of latency-critical applications that can b… ▽ More

    Submitted 23 September, 2021; originally announced September 2021.

  4. arXiv:2102.10837  [pdf, other

    cs.DC cs.AI cs.AR cs.PF

    BayesPerf: Minimizing Performance Monitoring Errors Using Bayesian Statistics

    Authors: Subho S. Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer

    Abstract: Hardware performance counters (HPCs) that measure low-level architectural and microarchitectural events provide dynamic contextual information about the state of the system. However, HPC measurements are error-prone due to non determinism (e.g., undercounting due to event multiplexing, or OS interrupt-handling behaviors). In this paper, we present BayesPerf, a system for quantifying uncertainty in… ▽ More

    Submitted 22 February, 2021; originally announced February 2021.

    Journal ref: Proceedings of the Twenty-Sixth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 21), 2021

  5. arXiv:2008.08509  [pdf, other

    cs.DC cs.PF

    FIRM: An Intelligent Fine-Grained Resource Management Framework for SLO-Oriented Microservices

    Authors: Haoran Qiu, Subho S. Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer

    Abstract: Modern user-facing latency-sensitive web services include numerous distributed, intercommunicating microservices that promise to simplify software development and operation. However, multiplexing of compute resources across microservices is still challenging in production because contention for shared resources can cause latency spikes that violate the service-level objectives (SLOs) of user reque… ▽ More

    Submitted 19 October, 2020; v1 submitted 19 August, 2020; originally announced August 2020.

    Comments: This paper was accepted in OSDI '20

  6. arXiv:1909.02119  [pdf, other

    cs.DC cs.LG

    Inductive-bias-driven Reinforcement Learning For Efficient Schedules in Heterogeneous Clusters

    Authors: Subho S Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer

    Abstract: The problem of scheduling of workloads onto heterogeneous processors (e.g., CPUs, GPUs, FPGAs) is of fundamental importance in modern data centers. Current system schedulers rely on application/system-specific heuristics that have to be built on a case-by-case basis. Recent work has demonstrated ML techniques for automating the heuristic search by using black-box approaches which require significa… ▽ More

    Submitted 30 June, 2020; v1 submitted 4 September, 2019; originally announced September 2019.

    Comments: Scheduling, Bayesian, POMDP, Sampling, Deep Reinforcement Learning, Accelerators, FPGA, GPU

    Journal ref: Proceedings of the 37th International Conference on Machine Learning, Vienna, Austria, PMLR 119, 2020

  7. arXiv:1907.10203  [pdf, other

    cs.DC cs.LG

    Live Forensics for Distributed Storage Systems

    Authors: Saurabh Jha, Shengkun Cui, Tianyin Xu, Jeremy Enos, Mike Showerman, Mark Dalton, Zbigniew T. Kalbarczyk, William T. Kramer, Ravishankar K. Iyer

    Abstract: We present Kaleidoscope an innovative system that supports live forensics for application performance problems caused by either individual component failures or resource contention issues in large-scale distributed storage systems. The design of Kaleidoscope is driven by our study of I/O failures observed in a peta-scale storage system anonymized as PetaStore. Kaleidoscope is built on three key fe… ▽ More

    Submitted 23 July, 2019; originally announced July 2019.

  8. arXiv:1907.05312  [pdf, other

    cs.DC cs.NI

    A Study of Network Congestion in Two Supercomputing High-Speed Interconnects

    Authors: Saurabh Jha, Archit Patke, Jim Brandt, Ann Gentile, Mike Showerman, Eric Roman, Zbigniew T. Kalbarczyk, William T. Kramer, Ravishankar K. Iyer

    Abstract: Network congestion in high-speed interconnects is a major source of application run time performance variation. Recent years have witnessed a surge of interest from both academia and industry in the development of novel approaches for congestion control at the network level and in application placement, map**, and scheduling at the system-level. However, these studies are based on proxy applicat… ▽ More

    Submitted 11 July, 2019; originally announced July 2019.

    Comments: Accepted for HOTI2019

  9. arXiv:1907.01051  [pdf, other

    cs.LG cs.SE stat.ML

    ML-based Fault Injection for Autonomous Vehicles: A Case for Bayesian Fault Injection

    Authors: Saurabh Jha, Subho S. Banerjee, Timothy Tsai, Siva K. S. Hari, Michael B. Sullivan, Zbigniew T. Kalbarczyk, Stephen W. Keckler, Ravishankar K. Iyer

    Abstract: The safety and resilience of fully autonomous vehicles (AVs) are of significant concern, as exemplified by several headline-making accidents. While AV development today involves verification, validation, and testing, end-to-end assessment of AV systems under accidental faults in realistic driving scenarios has been largely unexplored. This paper presents DriveFI, a machine learning-based fault inj… ▽ More

    Submitted 1 July, 2019; originally announced July 2019.

    Comments: Accepted at 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks

  10. ASAP: Accelerated Short-Read Alignment on Programmable Hardware

    Authors: Subho S. Banerjee, Mohamed El-Hadedy, Jong Bin Lim, Zbigniew T. Kalbarczyk, Deming Chen, Steve Lumetta, Ravishankar K. Iyer

    Abstract: The proliferation of high-throughput sequencing machines ensures rapid generation of up to billions of short nucleotide fragments in a short period of time. This massive amount of sequence data can quickly overwhelm today's storage and compute infrastructure. This paper explores the use of hardware acceleration to significantly improve the runtime of short-read alignment, a crucial step in preproc… ▽ More

    Submitted 23 May, 2018; v1 submitted 6 March, 2018; originally announced March 2018.