Data Sharing Options for Scientific Workflows on Amazon EC2
Authors:
Gideon Juve,
Ewa Deelman,
Karan Vahi,
Gaurang Mehta,
Bruce Berriman,
Benjamin P. Berman,
Phil Maechling
Abstract:
Efficient data management is a key component in achieving good performance for scientific workflows in distributed environments. Workflow applications typically communicate data between tasks using files. When tasks are distributed, these files are either transferred from one computational node to another, or accessed through a shared storage system. In grids and clusters, workflow data is often s…
▽ More
Efficient data management is a key component in achieving good performance for scientific workflows in distributed environments. Workflow applications typically communicate data between tasks using files. When tasks are distributed, these files are either transferred from one computational node to another, or accessed through a shared storage system. In grids and clusters, workflow data is often stored on network and parallel file systems. In this paper we investigate some of the ways in which data can be managed for workflows in the cloud. We ran experiments using three typical workflow applications on Amazon's EC2. We discuss the various storage and file systems we used, describe the issues and problems we encountered deploying them on EC2, and analyze the resulting performance and cost of the workflows.
△ Less
Submitted 22 October, 2010;
originally announced October 2010.
Scientific Workflow Applications on Amazon EC2
Authors:
Gideon Juve,
Ewa Deelman,
Karan Vahi,
Gaurang Mehta,
Bruce Berriman,
Benjamin P. Berman,
Phil Maechling
Abstract:
The proliferation of commercial cloud computing providers has generated significant interest in the scientific computing community. Much recent research has attempted to determine the benefits and drawbacks of cloud computing for scientific applications. Although clouds have many attractive features, such as virtualization, on-demand provisioning, and "pay as you go" usage-based pricing, it is not…
▽ More
The proliferation of commercial cloud computing providers has generated significant interest in the scientific computing community. Much recent research has attempted to determine the benefits and drawbacks of cloud computing for scientific applications. Although clouds have many attractive features, such as virtualization, on-demand provisioning, and "pay as you go" usage-based pricing, it is not clear whether they are able to deliver the performance required for scientific applications at a reasonable price. In this paper we examine the performance and cost of clouds from the perspective of scientific workflow applications. We use three characteristic workflows to compare the performance of a commercial cloud with that of a typical HPC system, and we analyze the various costs associated with running those workflows in the cloud. We find that the performance of clouds is not unreasonable given the hardware resources provided, and that performance comparable to HPC systems can be achieved given similar resources. We also find that the cost of running workflows on a commercial cloud can be reduced by storing data in the cloud rather than transferring it from outside.
△ Less
Submitted 15 May, 2010;
originally announced May 2010.