-
Efficient Intra-Rack Resource Disaggregation for HPC Using Co-Packaged DWDM Photonics
Authors:
George Michelogiannakis,
Yehia Arafa,
Brandon Cook,
Liang Yuan Dai,
Abdel Hameed Badawy,
Madeleine Glick,
Yuyang Wang,
Keren Bergman,
John Shalf
Abstract:
The diversity of workload requirements and increasing hardware heterogeneity in emerging high performance computing (HPC) systems motivate resource disaggregation. Resource disaggregation allows compute and memory resources to be allocated individually as required to each workload. However, it is unclear how to efficiently realize this capability and cost-effectively meet the stringent bandwidth a…
▽ More
The diversity of workload requirements and increasing hardware heterogeneity in emerging high performance computing (HPC) systems motivate resource disaggregation. Resource disaggregation allows compute and memory resources to be allocated individually as required to each workload. However, it is unclear how to efficiently realize this capability and cost-effectively meet the stringent bandwidth and latency requirements of HPC applications. To that end, we describe how modern photonics can be co-designed with modern HPC racks to implement flexible intra-rack resource disaggregation and fully meet the bit error rate (BER) and high escape bandwidth of all chip types in modern HPC racks. Our photonic-based disaggregated rack provides an average application speedup of 11% (46% maximum) for 25 CPU and 61% for 24 GPU benchmarks compared to a similar system that instead uses modern electronic switches for disaggregation. Using observed resource usage from a production system, we estimate that an iso-performance intra-rack disaggregated HPC system using photonics would require 4x fewer memory modules and 2x fewer NICs than a non-disaggregated baseline.
△ Less
Submitted 17 July, 2023; v1 submitted 9 January, 2023;
originally announced January 2023.
-
Preparing for the Future -- Rethinking Proxy Apps
Authors:
Satoshi Matsuoka,
Jens Domke,
Mohamed Wahib,
Aleksandr Drozd,
Ray Bair,
Andrew A. Chien,
Jeffrey S. Vetter,
John Shalf
Abstract:
A considerable amount of research and engineering went into designing proxy applications, which represent common high-performance computing workloads, to co-design and evaluate the current generation of supercomputers, e.g., RIKEN's Supercomputer Fugaku, ANL's Aurora, or ORNL's Frontier. This process was necessary to standardize the procurement while avoiding duplicated effort at each HPC center t…
▽ More
A considerable amount of research and engineering went into designing proxy applications, which represent common high-performance computing workloads, to co-design and evaluate the current generation of supercomputers, e.g., RIKEN's Supercomputer Fugaku, ANL's Aurora, or ORNL's Frontier. This process was necessary to standardize the procurement while avoiding duplicated effort at each HPC center to develop their own benchmarks. Unfortunately, proxy applications force HPC centers and providers (vendors) into a an undesirable state of rigidity, in contrast to the fast-moving trends of current technology and future heterogeneity. To accommodate an extremely-heterogeneous future, we have to reconsider how to co-design supercomputers during the next decade, and avoid repeating the past mistakes. This position paper outlines the current state-of-the-art in system co-design, challenges encountered over the past years, and a proposed plan to move forward.
△ Less
Submitted 15 April, 2022;
originally announced April 2022.
-
TIGER: Topology-aware Assignment using Ising machines Application to Classical Algorithm Tasks and Quantum Circuit Gates
Authors:
Anastasiia Butko,
Ilyas Turimbetov,
George Michelogiannakis,
David Donofrio,
Didem Unat,
John Shalf
Abstract:
Optimally map** a parallel application to compute and communication resources is increasingly important as both system size and heterogeneity increase. A similar map** problem exists in gate-based quantum computing where the objective is to map tasks to gates in a topology-aware fashion. This is an NP-complete graph isomorphism problem, and existing task assignment approaches are either heuris…
▽ More
Optimally map** a parallel application to compute and communication resources is increasingly important as both system size and heterogeneity increase. A similar map** problem exists in gate-based quantum computing where the objective is to map tasks to gates in a topology-aware fashion. This is an NP-complete graph isomorphism problem, and existing task assignment approaches are either heuristic or based on physical optimization algorithms, providing different speed and solution quality trade-offs. Ising machines such as quantum and digital annealers have recently become available and offer an alternative hardware solution to solve this type of optimization problems. In this paper, we propose an algorithm that allows solving the topology-aware assignment problem using Ising machines. We demonstrate the algorithm on two use cases, i.e. classical task scheduling and quantum circuit gate scheduling. TIGER---topology-aware task/gate assignment mapper tool---implements our proposed algorithms and automatically integrates them into the quantum software environment. To address the limitations of physical solver, we propose and implement a domain-specific partition strategy that allows solving larger-scale problems and a weight optimization algorithm that allows tuning Ising model parameters to achieve better restuls. We use D-Wave's quantum annealer to demonstrate our algorithm and evaluate the proposed tool flow in terms of performance, partition efficiency, and solution quality. Results show significant speed-up compared to classical solutions, better scalability, and higher solution quality when using TIGER together with the proposed partition method. It reduces the data movement cost by 68\% in average for quantum circuit assignment compared to the IBM QX optimizer.
△ Less
Submitted 21 September, 2020;
originally announced September 2020.
-
Understanding Quantum Control Processor Capabilities and Limitations through Circuit Characterization
Authors:
Anastasiia Butko,
George Michelogiannakis,
Samuel Williams,
Costin Iancu,
David Donofrio,
John Shalf,
Jonathan Carter,
Irfan Siddiqi
Abstract:
Continuing the scaling of quantum computers hinges on building classical control hardware pipelines that are scalable, extensible, and provide real time response. The instruction set architecture (ISA) of the control processor provides functional abstractions that map high-level semantics of quantum programming languages to low-level pulse generation by hardware. In this paper, we provide a method…
▽ More
Continuing the scaling of quantum computers hinges on building classical control hardware pipelines that are scalable, extensible, and provide real time response. The instruction set architecture (ISA) of the control processor provides functional abstractions that map high-level semantics of quantum programming languages to low-level pulse generation by hardware. In this paper, we provide a methodology to quantitatively assess the effectiveness of the ISA to encode quantum circuits for intermediate-scale quantum devices with O($10^2$) qubits. The characterization model that we define reflects performance, the ability to meet timing constraint implications, scalability for future quantum chips, and other important considerations making them useful guides for future designs. Using our methodology, we propose scalar (QUASAR) and vector (qV) quantum ISAs as extensions and compare them with other ISAs in metrics such as circuit encoding efficiency, the ability to meet real-time gate cycle requirements of quantum chips, and the ability to scale to more qubits.
△ Less
Submitted 3 December, 2020; v1 submitted 25 September, 2019;
originally announced September 2019.
-
TraceTracker: Hardware/Software Co-Evaluation for Large-Scale I/O Workload Reconstruction
Authors:
Miryeong Kwon,
Jie Zhang,
Gyuyoung Park,
Wonil Choi,
David Donofrio,
John Shalf,
Mahmut Kandemir,
Myoungsoo Jung
Abstract:
Block traces are widely used for system studies, model verifications, and design analyses in both industry and academia. While such traces include detailed block access patterns, existing trace-driven research unfortunately often fails to find true-north due to a lack of runtime contexts such as user idle periods and system delays, which are fundamentally linked to the characteristics of target st…
▽ More
Block traces are widely used for system studies, model verifications, and design analyses in both industry and academia. While such traces include detailed block access patterns, existing trace-driven research unfortunately often fails to find true-north due to a lack of runtime contexts such as user idle periods and system delays, which are fundamentally linked to the characteristics of target storage hardware. In this work, we propose TraceTracker, a novel hardware/software co-evaluation method that allows users to reuse a broad range of the existing block traces by kee** most their execution contexts and user scenarios while adjusting them with new system information. Specifically, our TraceTracker's software evaluation model can infer CPU burst times and user idle periods from old storage traces, whereas its hardware evaluation method remasters the storage traces by interoperating the inferred time information, and updates all inter-arrival times by making them aware of the target storage system. We apply the proposed co-evaluation model to 577 traces, which were collected by servers from different institutions and locations a decade ago, and revive the traces on a high-performance flash-based storage array. The evaluation results reveal that the accuracy of the execution contexts reconstructed by TraceTracker is on average 99% and 96% with regard to the frequency of idle operations and the total idle periods, respectively.
△ Less
Submitted 14 September, 2017;
originally announced September 2017.
-
SimpleSSD: Modeling Solid State Drives for Holistic System Simulation
Authors:
Myoungsoo Jung,
Jie Zhang,
Ahmed Abulila,
Miryeong Kwon,
Narges Shahidi,
John Shalf,
Nam Sung Kim,
Mahmut Kandemir
Abstract:
Existing solid state drive (SSD) simulators unfortunately lack hardware and/or software architecture models. Consequently, they are far from capturing the critical features of contemporary SSD devices. More importantly, while the performance of modern systems that adopt SSDs can vary based on their numerous internal design parameters and storage-level configurations, a full system simulation with…
▽ More
Existing solid state drive (SSD) simulators unfortunately lack hardware and/or software architecture models. Consequently, they are far from capturing the critical features of contemporary SSD devices. More importantly, while the performance of modern systems that adopt SSDs can vary based on their numerous internal design parameters and storage-level configurations, a full system simulation with traditional SSD models often requires unreasonably long runtimes and excessive computational resources. In this work, we propose SimpleSSD, a highfidelity simulator that models all detailed characteristics of hardware and software, while simplifying the nondescript features of storage internals. In contrast to existing SSD simulators, SimpleSSD can easily be integrated into publicly-available full system simulators. In addition, it can accommodate a complete storage stack and evaluate the performance of SSDs along with diverse memory technologies and microarchitectures. Thus, it facilitates simulations that explore the full design space at different levels of system abstraction.
△ Less
Submitted 14 September, 2017; v1 submitted 18 May, 2017;
originally announced May 2017.
-
BoxLib with Tiling: An AMR Software Framework
Authors:
Weiqun Zhang,
Ann Almgren,
Marcus Day,
Tan Nguyen,
John Shalf,
Didem Unat
Abstract:
In this paper we introduce a block-structured adaptive mesh refinement (AMR) software framework that incorporates tiling, a well-known loop transformation. Because the multiscale, multiphysics codes built in BoxLib are designed to solve complex systems at high resolution, performance on current and next generation architectures is essential. With the expectation of many more cores per node on next…
▽ More
In this paper we introduce a block-structured adaptive mesh refinement (AMR) software framework that incorporates tiling, a well-known loop transformation. Because the multiscale, multiphysics codes built in BoxLib are designed to solve complex systems at high resolution, performance on current and next generation architectures is essential. With the expectation of many more cores per node on next generation architectures, the ability to effectively utilize threads within a node is essential, and the current model for parallelization will not be sufficient. We describe a new version of BoxLib in which the tiling constructs are embedded so that BoxLib-based applications can easily realize expected performance gains without extra effort on the part of the application developer. We also discuss a path forward to enable future versions of BoxLib to take advantage of NUMA-aware optimizations using the TiDA portable library.
△ Less
Submitted 12 April, 2016;
originally announced April 2016.
-
Cactus Framework: Black Holes to Gamma Ray Bursts
Authors:
Erik Schnetter,
Christian D. Ott,
Gabrielle Allen,
Peter Diener,
Tom Goodale,
Thomas Radke,
Edward Seidel,
John Shalf
Abstract:
Gamma Ray Bursts (GRBs) are intense narrowly-beamed flashes of gamma-rays of cosmological origin. They are among the most scientifically interesting astrophysical systems, and the riddle concerning their central engines and emission mechanisms is one of the most complex and challenging problems of astrophysics today. In this article we outline our petascale approach to the GRB problem and discus…
▽ More
Gamma Ray Bursts (GRBs) are intense narrowly-beamed flashes of gamma-rays of cosmological origin. They are among the most scientifically interesting astrophysical systems, and the riddle concerning their central engines and emission mechanisms is one of the most complex and challenging problems of astrophysics today. In this article we outline our petascale approach to the GRB problem and discuss the computational toolkits and numerical codes that are currently in use and that will be scaled up to run on emerging petaflop scale computing platforms in the near future.
Petascale computing will require additional ingredients over conventional parallelism. We consider some of the challenges which will be caused by future petascale architectures, and discuss our plans for the future development of the Cactus framework and its applications to meet these challenges in order to profit from these new architectures.
△ Less
Submitted 11 July, 2007;
originally announced July 2007.
-
The Cactus Worm: Experiments with Dynamic Resource Discovery and Allocation in a Grid Environment
Authors:
Gabrielle Allen,
David Angulo,
Ian Foster,
Gerd Lanfermann,
Chuang Liu,
Thomas Radke,
Ed Seidel,
John Shalf
Abstract:
The ability to harness heterogeneous, dynamically available "Grid" resources is attractive to typically resource-starved computational scientists and engineers, as in principle it can increase, by significant factors, the number of cycles that can be delivered to applications. However, new adaptive application structures and dynamic runtime system mechanisms are required if we are to operate eff…
▽ More
The ability to harness heterogeneous, dynamically available "Grid" resources is attractive to typically resource-starved computational scientists and engineers, as in principle it can increase, by significant factors, the number of cycles that can be delivered to applications. However, new adaptive application structures and dynamic runtime system mechanisms are required if we are to operate effectively in Grid environments. In order to explore some of these issues in a practical setting, we are develo** an experimental framework, called Cactus, that incorporates both adaptive application structures for dealing with changing resource characteristics and adaptive resource selection mechanisms that allow applications to change their resource allocations (e.g., via migration) when performance falls outside specified limits. We describe here the adaptive resource selection mechanisms and describe how they are used to achieve automatic application migration to "better" resources following performance degradation. Our results provide insights into the architectural structures required to support adaptive resource selection. In addition, we suggest that this "Cactus Worm" is an interesting challenge problem for Grid computing.
△ Less
Submitted 1 August, 2001;
originally announced August 2001.