Search | arXiv e-print repository

arXiv:2006.12587 [pdf, other]

PipeSim: Trace-driven Simulation of Large-Scale AI Operations Platforms

Authors: Thomas Rausch, Waldemar Hummer, Vinod Muthusamy

Abstract: Operationalizing AI has become a major endeavor in both research and industry. Automated, operationalized pipelines that manage the AI application lifecycle will form a significant part of tomorrow's infrastructure workloads. To optimize operations of production-grade AI workflow platforms we can leverage existing scheduling approaches, yet it is challenging to fine-tune operational strategies tha… ▽ More Operationalizing AI has become a major endeavor in both research and industry. Automated, operationalized pipelines that manage the AI application lifecycle will form a significant part of tomorrow's infrastructure workloads. To optimize operations of production-grade AI workflow platforms we can leverage existing scheduling approaches, yet it is challenging to fine-tune operational strategies that achieve application-specific cost-benefit tradeoffs while catering to the specific domain characteristics of machine learning (ML) models, such as accuracy, robustness, or fairness. We present a trace-driven simulation-based experimentation and analytics environment that allows researchers and engineers to devise and evaluate such operational strategies for large-scale AI workflow systems. Analytics data from a production-grade AI platform developed at IBM are used to build a comprehensive simulation model. Our simulation model describes the interaction between pipelines and system infrastructure, and how pipeline tasks affect different ML model metrics. We implement the model in a standalone, stochastic, discrete event simulator, and provide a toolkit for running experiments. Synthetic traces are made available for ad-hoc exploration as well as statistical analysis of experiments to test and examine pipeline scheduling, cluster resource allocation, and similar operational mechanisms. △ Less

Submitted 22 June, 2020; originally announced June 2020.

Comments: 11 pages, 13 figures, extended version of OpML'20 paper

ACM Class: I.6; H.4; I.2.m

arXiv:1805.06801 [pdf, other]

Dependability in a Multi-tenant Multi-framework Deep Learning as-a-Service Platform

Authors: Scott Boag, Parijat Dube, Kaoutar El Maghraoui, Benjamin Herta, Waldemar Hummer, K. R. Jayaram, Rania Khalaf, Vinod Muthusamy, Michael Kalantar, Archit Verma

Abstract: Deep learning (DL), a form of machine learning, is becoming increasingly popular in several application domains. As a result, cloud-based Deep Learning as a Service (DLaaS) platforms have become an essential infrastructure in many organizations. These systems accept, schedule, manage and execute DL training jobs at scale. This paper explores dependability in the context of a DLaaS platform used… ▽ More Deep learning (DL), a form of machine learning, is becoming increasingly popular in several application domains. As a result, cloud-based Deep Learning as a Service (DLaaS) platforms have become an essential infrastructure in many organizations. These systems accept, schedule, manage and execute DL training jobs at scale. This paper explores dependability in the context of a DLaaS platform used in IBM. We begin by explaining how DL training workloads are different, and what features ensure dependability in this context. We then describe the architecture, design and implementation of a cloud-based orchestration system for DL training. We show how this system has been architected with dependability in mind while also being horizontally scalable, elastic, flexible and efficient. We also present an initial empirical evaluation of the overheads introduced by our platform, and discuss tradeoffs between efficiency and dependability. △ Less

Submitted 17 May, 2018; originally announced May 2018.

arXiv:1411.2392 [pdf, other]

JCloudScale: Closing the Gap Between IaaS and PaaS

Authors: Rostyslav Zabolotnyi, Philipp Leitner, Waldemar Hummer, Schahram Dustdar

Abstract: The Infrastructure-as-a-Service (IaaS) model of cloud computing is a promising approach towards building elastically scaling systems. Unfortunately, building such applications today is a complex, repetitive and error-prone endeavor, as IaaS does not provide any abstraction on top of naked virtual machines. Hence, all functionality related to elasticity needs to be implemented anew for each applica… ▽ More The Infrastructure-as-a-Service (IaaS) model of cloud computing is a promising approach towards building elastically scaling systems. Unfortunately, building such applications today is a complex, repetitive and error-prone endeavor, as IaaS does not provide any abstraction on top of naked virtual machines. Hence, all functionality related to elasticity needs to be implemented anew for each application. In this paper, we present JCloudScale, a Java-based middleware that supports building elastic applications on top of a public or private IaaS cloud. JCloudScale allows to easily bring applications to the cloud, with minimal changes to the application code. We discuss the general architecture of the middleware as well as its technical features, and evaluate our system with regard to both, user acceptance (based on a user study) and performance overhead. Our results indicate that JCloudScale indeed allowed many participants to build IaaS applications more efficiently, comparable to the convenience features provided by industrial Platform-as-a-Service (PaaS) solutions. However, unlike PaaS, using JCloudScale does not lead to a loss of control and vendor lock-in for the developer. △ Less

Submitted 10 November, 2014; originally announced November 2014.

Showing 1–3 of 3 results for author: Hummer, W