-
The DEEP-ER project: I/O and resiliency extensions for the Cluster-Booster architecture
Authors:
Anke Kreuzer,
Norbert Eicker,
Jorge Amaya,
Raphael Leger,
Estela Suarez
Abstract:
The recently completed research project DEEP-ER has developed a variety of hardware and software technologies to improve the I/O capabilities of next generation high-performance computers, and to enable applications recovering from the larger hardware failure rates expected on these machines.
The heterogeneous Cluster-Booster architecture --first introduced in the predecessor DEEP project-- has…
▽ More
The recently completed research project DEEP-ER has developed a variety of hardware and software technologies to improve the I/O capabilities of next generation high-performance computers, and to enable applications recovering from the larger hardware failure rates expected on these machines.
The heterogeneous Cluster-Booster architecture --first introduced in the predecessor DEEP project-- has been extended by a multi-level memory hierarchy employing non-volatile and network-attached memory devices. Based on this hardware infrastructure, an I/O and resiliency software stack has been implemented combining and extending well established libraries and software tools, and sticking to standard user-interfaces. Real-world scientific codes have tested the projects' developments and demonstrated the improvements achieved without compromising the portability of the applications.
△ Less
Submitted 15 April, 2019;
originally announced April 2019.
-
Application performance on a Cluster-Booster system
Authors:
Anke Kreuzer,
Jorge Amaya,
Norbert Eicker,
Estela Suarez
Abstract:
The DEEP projects have developed a variety of hardware and software technologies aiming at improving the efficiency and usability of next generation high-performance computers. They evolve around an innovative concept for heterogeneous systems: the Cluster-Booster architecture. In it, a general purpose cluster is tightly coupled to a many-core system (the Booster). This modular way of integrating…
▽ More
The DEEP projects have developed a variety of hardware and software technologies aiming at improving the efficiency and usability of next generation high-performance computers. They evolve around an innovative concept for heterogeneous systems: the Cluster-Booster architecture. In it, a general purpose cluster is tightly coupled to a many-core system (the Booster). This modular way of integrating heterogeneous components enables applications to freely choose the kind of computing resources on which it runs most efficiently. Codes might even be partitioned to map specific requirements of code-parts onto the best suited hardware. This paper presents for the first time measurements done by a real world scientific application demonstrating the performance gain achieved with this kind of code-partition approach.
△ Less
Submitted 10 April, 2019;
originally announced April 2019.
-
QPACE -- a QCD parallel computer based on Cell processors
Authors:
H. Baier,
H. Boettiger,
M. Drochner,
N. Eicker,
U. Fischer,
Z. Fodor,
A. Frommer,
C. Gomez,
G. Goldrian,
S. Heybrock,
D. Hierl,
M. Hüsken,
T. Huth,
B. Krill,
J. Lauritsen,
T. Lippert,
T. Maurer,
B. Mendl,
N. Meyer,
A. Nobile,
I. Ouda,
M. Pivanti,
D. Pleiter,
M. Ries,
A. Schäfer
, et al. (10 additional authors not shown)
Abstract:
QPACE is a novel parallel computer which has been developed to be primarily used for lattice QCD simulations. The compute power is provided by the IBM PowerXCell 8i processor, an enhanced version of the Cell processor that is used in the Playstation 3. The QPACE nodes are interconnected by a custom, application optimized 3-dimensional torus network implemented on an FPGA. To achieve the very hig…
▽ More
QPACE is a novel parallel computer which has been developed to be primarily used for lattice QCD simulations. The compute power is provided by the IBM PowerXCell 8i processor, an enhanced version of the Cell processor that is used in the Playstation 3. The QPACE nodes are interconnected by a custom, application optimized 3-dimensional torus network implemented on an FPGA. To achieve the very high packaging density of 26 TFlops per rack a new water cooling concept has been developed and successfully realized. In this paper we give an overview of the architecture and highlight some important technical details of the system. Furthermore, we provide initial performance results and report on the installation of 8 QPACE racks providing an aggregate peak performance of 200 TFlops.
△ Less
Submitted 23 December, 2009; v1 submitted 11 November, 2009;
originally announced November 2009.
-
On the scaling of computational particle physics codes on cluster computers
Authors:
Z. Sroczynski,
N. Eicker,
Th. Lippert,
B. Orth,
K. Schilling
Abstract:
Many appplications in computational science are sufficiently compute-intensive that they depend on the power of parallel computing for viability. For all but the "embarrassingly parallel" problems, the performance depends upon the level of granularity that can be achieved on the computer platform.
Our computational particle physics applications require machines that can support a wide range of…
▽ More
Many appplications in computational science are sufficiently compute-intensive that they depend on the power of parallel computing for viability. For all but the "embarrassingly parallel" problems, the performance depends upon the level of granularity that can be achieved on the computer platform.
Our computational particle physics applications require machines that can support a wide range of granularities, but in general, compute-intensive state-of-the-art projects will require finely grained distributions. Of the different types of machines available for the task, we consider cluster computers.
The use of clusters of commodity computers in high performance computing has many advantages including the raw price/performance ratio and the flexibility of machine configuration and upgrade. Here we focus on what is usually considered the weak point of cluster technology; the scaling behaviour when faced with a numerically intensive parallel computation. To this end we examine the scaling of our own applications from numerical quantum field theory on a cluster and infer conclusions about the more general case.
△ Less
Submitted 10 July, 2003; v1 submitted 9 July, 2003;
originally announced July 2003.
-
Fast Parallel I/O on Cluster Computers
Authors:
Thomas Duessel,
Norbert Eicker,
Florin Isaila,
Thomas Lippert,
Thomas Moschny,
Hartmut Neff,
Klaus Schilling,
Walter Tichy
Abstract:
Today's cluster computers suffer from slow I/O, which slows down I/O-intensive applications. We show that fast disk I/O can be achieved by operating a parallel file system over fast networks such as Myrinet or Gigabit Ethernet.
In this paper, we demonstrate how the ParaStation3 communication system helps speed-up the performance of parallel I/O on clusters using the open source parallel virtua…
▽ More
Today's cluster computers suffer from slow I/O, which slows down I/O-intensive applications. We show that fast disk I/O can be achieved by operating a parallel file system over fast networks such as Myrinet or Gigabit Ethernet.
In this paper, we demonstrate how the ParaStation3 communication system helps speed-up the performance of parallel I/O on clusters using the open source parallel virtual file system (PVFS) as testbed and production system. We will describe the set-up of PVFS on the Alpha-Linux-Cluster-Engine (ALiCE) located at Wuppertal University, Germany. Benchmarks on ALiCE achieve write-performances of up to 1 GB/s from a 32-processor compute-partition to a 32-processor PVFS I/O-partition, outperforming known benchmark results for PVFS on the same network by more than a factor of 2. Read-performance from buffer-cache reaches up to 2.2 GB/s. Our benchmarks are giant, I/O-intensive eigenmode problems from lattice quantum chromodynamics, demonstrating stability and performance of PVFS over Parastation in large-scale production runs.
△ Less
Submitted 19 March, 2003;
originally announced March 2003.