-
Serverless Supercomputing: High Performance Function as a Service for Science
Authors:
Ryan Chard,
Tyler J. Skluzacek,
Zhuozhao Li,
Yadu Babuji,
Anna Woodard,
Ben Blaiszik,
Steven Tuecke,
Ian Foster,
Kyle Chard
Abstract:
Growing data volumes and velocities are driving exciting new methods across the sciences in which data analytics and machine learning are increasingly intertwined with research. These new methods require new approaches for scientific computing in which computation is mobile, so that, for example, it can occur near data, be triggered by events (e.g., arrival of new data), or be offloaded to special…
▽ More
Growing data volumes and velocities are driving exciting new methods across the sciences in which data analytics and machine learning are increasingly intertwined with research. These new methods require new approaches for scientific computing in which computation is mobile, so that, for example, it can occur near data, be triggered by events (e.g., arrival of new data), or be offloaded to specialized accelerators. They also require new design approaches in which monolithic applications can be decomposed into smaller components, that may in turn be executed separately and on the most efficient resources. To address these needs we propose funcX---a high-performance function-as-a-service (FaaS) platform that enables intuitive, flexible, efficient, scalable, and performant remote function execution on existing infrastructure including clouds, clusters, and supercomputers. It allows users to register and then execute Python functions without regard for the physical resource location, scheduler architecture, or virtualization technology on which the function is executed---an approach we refer to as "serverless supercomputing." We motivate the need for funcX in science, describe our prototype implementation, and demonstrate, via experiments on two supercomputers, that funcX can process millions of functions across more than 65000 concurrent workers. We also outline five scientific scenarios in which funcX has been deployed and highlight the benefits of funcX in these scenarios.
△ Less
Submitted 13 August, 2019;
originally announced August 2019.
-
DLHub: Model and Data Serving for Science
Authors:
Ryan Chard,
Zhuozhao Li,
Kyle Chard,
Logan Ward,
Yadu Babuji,
Anna Woodard,
Steve Tuecke,
Ben Blaiszik,
Michael J. Franklin,
Ian Foster
Abstract:
While the Machine Learning (ML) landscape is evolving rapidly, there has been a relative lag in the development of the "learning systems" needed to enable broad adoption. Furthermore, few such systems are designed to support the specialized requirements of scientific ML. Here we present the Data and Learning Hub for science (DLHub), a multi-tenant system that provides both model repository and ser…
▽ More
While the Machine Learning (ML) landscape is evolving rapidly, there has been a relative lag in the development of the "learning systems" needed to enable broad adoption. Furthermore, few such systems are designed to support the specialized requirements of scientific ML. Here we present the Data and Learning Hub for science (DLHub), a multi-tenant system that provides both model repository and serving capabilities with a focus on science applications. DLHub addresses two significant shortcomings in current systems. First, its selfservice model repository allows users to share, publish, verify, reproduce, and reuse models, and addresses concerns related to model reproducibility by packaging and distributing models and all constituent components. Second, it implements scalable and low-latency serving capabilities that can leverage parallel and distributed computing resources to democratize access to published models through a simple web interface. Unlike other model serving frameworks, DLHub can store and serve any Python 3-compatible model or processing function, plus multiple-function pipelines. We show that relative to other model serving systems including TensorFlow Serving, SageMaker, and Clipper, DLHub provides greater capabilities, comparable performance without memoization and batching, and significantly better performance when the latter two techniques can be employed. We also describe early uses of DLHub for scientific applications.
△ Less
Submitted 27 November, 2018;
originally announced November 2018.
-
Security for Grid Services
Authors:
Von Welch,
Frank Siebenlist,
Ian Foster,
John Bresnahan,
Karl Czajkowski,
Jarek Gawor,
Carl Kesselman,
Sam Meder,
Laura Pearlman,
Steven Tuecke
Abstract:
Grid computing is concerned with the sharing and coordinated use of diverse resources in distributed "virtual organizations." The dynamic and multi-institutional nature of these environments introduces challenging security issues that demand new technical approaches. In particular, one must deal with diverse local mechanisms, support dynamic creation of services, and enable dynamic creation of t…
▽ More
Grid computing is concerned with the sharing and coordinated use of diverse resources in distributed "virtual organizations." The dynamic and multi-institutional nature of these environments introduces challenging security issues that demand new technical approaches. In particular, one must deal with diverse local mechanisms, support dynamic creation of services, and enable dynamic creation of trust domains. We describe how these issues are addressed in two generations of the Globus Toolkit. First, we review the Globus Toolkit version 2 (GT2) approach; then, we describe new approaches developed to support the Globus Toolkit version 3 (GT3) implementation of the Open Grid Services Architecture, an initiative that is recasting Grid concepts within a service oriented framework based on Web services. GT3's security implementation uses Web services security mechanisms for credential exchange and other purposes, and introduces a tight least-privilege model that avoids the need for any privileged network service.
△ Less
Submitted 24 June, 2003;
originally announced June 2003.
-
The Community Authorization Service: Status and Future
Authors:
L. Pearlman,
V. Welch,
I. Foster,
C. Kesselman,
S. Tuecke
Abstract:
Virtual organizations (VOs) are communities of resource providers and users distributed over multiple policy domains. These VOs often wish to define and enforce consistent policies in addition to the policies of their underlying domains. This is challenging, not only because of the problems in distributing the policy to the domains, but also because of the fact that those domains may each have d…
▽ More
Virtual organizations (VOs) are communities of resource providers and users distributed over multiple policy domains. These VOs often wish to define and enforce consistent policies in addition to the policies of their underlying domains. This is challenging, not only because of the problems in distributing the policy to the domains, but also because of the fact that those domains may each have different capabilities for enforcing the policy. The Community Authorization Service (CAS) solves this problem by allowing resource providers to delegate some policy authority to the VO while maintaining ultimate control over their resources. In this paper we describe CAS and our past and current implementations of CAS, and we discuss our plans for CAS-related research.
△ Less
Submitted 13 June, 2003;
originally announced June 2003.
-
A Community Authorization Service for Group Collaboration
Authors:
Laura Pearlman,
Von Welch,
Ian Foster,
Carl Kesselman,
Steven Tuecke
Abstract:
In "Grids" and "collaboratories," we find distributed communities of resource providers and resource consumers, within which often complex and dynamic policies govern who can use which resources for which purpose. We propose a new approach to the representation, maintenance, and enforcement of such policies that provides a scalable mechanism for specifying and enforcing these policies. Our appro…
▽ More
In "Grids" and "collaboratories," we find distributed communities of resource providers and resource consumers, within which often complex and dynamic policies govern who can use which resources for which purpose. We propose a new approach to the representation, maintenance, and enforcement of such policies that provides a scalable mechanism for specifying and enforcing these policies. Our approach allows resource providers to delegate some of the authority for maintaining fine-grained access control policies to communities, while still maintaining ultimate control over their resources. We also describe a prototype implementation of this approach and an application in a data management context.
△ Less
Submitted 12 June, 2003;
originally announced June 2003.
-
Replica Selection in the Globus Data Grid
Authors:
Sudharshan Vazhkudai,
Steven Tuecke,
Ian Foster
Abstract:
The Globus Data Grid architecture provides a scalable infrastructure for the management of storage resources and data that are distributed across Grid environments. These services are designed to support a variety of scientific applications, ranging from high-energy physics to computational genomics, that require access to large amounts of data (terabytes or even petabytes) with varied quality o…
▽ More
The Globus Data Grid architecture provides a scalable infrastructure for the management of storage resources and data that are distributed across Grid environments. These services are designed to support a variety of scientific applications, ranging from high-energy physics to computational genomics, that require access to large amounts of data (terabytes or even petabytes) with varied quality of service requirements. By layering on a set of core services, such as data transport, security, and replica cataloging, one can construct various higher-level services. In this paper, we discuss the design and implementation of a high-level replica selection service that uses information regarding replica location and user preferences to guide selection from among storage replica alternatives. We first present a basic replica selection service design, then show how dynamic information collected using Globus information service capabilities concerning storage system properties can help improve and optimize the selection process. We demonstrate the use of Condor's ClassAds resource description and matchmaking mechanism as an efficient tool for representing and matching storage resource capabilities and policies against application requirements.
△ Less
Submitted 2 April, 2001;
originally announced April 2001.
-
The Anatomy of the Grid - Enabling Scalable Virtual Organizations
Authors:
Ian Foster,
Carl Kesselman,
Steven Tuecke
Abstract:
"Grid" computing has emerged as an important new field, distinguished from conventional distributed computing by its focus on large-scale resource sharing, innovative applications, and, in some cases, high-performance orientation. In this article, we define this new field. First, we review the "Grid problem," which we define as flexible, secure, coordinated resource sharing among dynamic collect…
▽ More
"Grid" computing has emerged as an important new field, distinguished from conventional distributed computing by its focus on large-scale resource sharing, innovative applications, and, in some cases, high-performance orientation. In this article, we define this new field. First, we review the "Grid problem," which we define as flexible, secure, coordinated resource sharing among dynamic collections of individuals, institutions, and resources-what we refer to as virtual organizations. In such settings, we encounter unique authentication, authorization, resource access, resource discovery, and other challenges. It is this class of problem that is addressed by Grid technologies. Next, we present an extensible and open Grid architecture, in which protocols, services, application programming interfaces, and software development kits are categorized according to their roles in enabling resource sharing. We describe requirements that we believe any such mechanisms must satisfy, and we discuss the central role played by the intergrid protocols that enable interoperability among different Grid systems. Finally, we discuss how Grid technologies relate to other contemporary technologies, including enterprise integration, application service provider, storage service provider, and peer-to-peer computing. We maintain that Grid concepts and technologies complement and have much to contribute to these other approaches.
△ Less
Submitted 29 March, 2001;
originally announced March 2001.
-
Secure, Efficient Data Transport and Replica Management for High-Performance Data-Intensive Computing
Authors:
Bill Allcock,
Joe Bester,
John Bresnahan,
Ann L. Chervenak,
Ian Foster,
Carl Kesselman,
Sam Meder,
Veronika Nefedova,
Darcy Quesnel,
Steven Tuecke
Abstract:
An emerging class of data-intensive applications involve the geographically dispersed extraction of complex scientific information from very large collections of measured or computed data. Such applications arise, for example, in experimental physics, where the data in question is generated by accelerators, and in simulation science, where the data is generated by supercomputers. So-called Data…
▽ More
An emerging class of data-intensive applications involve the geographically dispersed extraction of complex scientific information from very large collections of measured or computed data. Such applications arise, for example, in experimental physics, where the data in question is generated by accelerators, and in simulation science, where the data is generated by supercomputers. So-called Data Grids provide essential infrastructure for such applications, much as the Internet provides essential services for applications such as e-mail and the Web. We describe here two services that we believe are fundamental to any Data Grid: reliable, high-speed transporet and replica management. Our high-speed transport service, GridFTP, extends the popular FTP protocol with new features required for Data Grid applciations, such as stri** and partial file access. Our replica management service integrates a replica catalog with GridFTP transfers to provide for the creation, registration, location, and management of dataset replicas. We present the design of both services and also preliminary performance results. Our implementations exploit security and other services provided by the Globus Toolkit.
△ Less
Submitted 28 March, 2001;
originally announced March 2001.