Search | arXiv e-print repository

Efficient Intra-Rack Resource Disaggregation for HPC Using Co-Packaged DWDM Photonics

Authors: George Michelogiannakis, Yehia Arafa, Brandon Cook, Liang Yuan Dai, Abdel Hameed Badawy, Madeleine Glick, Yuyang Wang, Keren Bergman, John Shalf

Abstract: The diversity of workload requirements and increasing hardware heterogeneity in emerging high performance computing (HPC) systems motivate resource disaggregation. Resource disaggregation allows compute and memory resources to be allocated individually as required to each workload. However, it is unclear how to efficiently realize this capability and cost-effectively meet the stringent bandwidth a… ▽ More The diversity of workload requirements and increasing hardware heterogeneity in emerging high performance computing (HPC) systems motivate resource disaggregation. Resource disaggregation allows compute and memory resources to be allocated individually as required to each workload. However, it is unclear how to efficiently realize this capability and cost-effectively meet the stringent bandwidth and latency requirements of HPC applications. To that end, we describe how modern photonics can be co-designed with modern HPC racks to implement flexible intra-rack resource disaggregation and fully meet the bit error rate (BER) and high escape bandwidth of all chip types in modern HPC racks. Our photonic-based disaggregated rack provides an average application speedup of 11% (46% maximum) for 25 CPU and 61% for 24 GPU benchmarks compared to a similar system that instead uses modern electronic switches for disaggregation. Using observed resource usage from a production system, we estimate that an iso-performance intra-rack disaggregated HPC system using photonics would require 4x fewer memory modules and 2x fewer NICs than a non-disaggregated baseline. △ Less

Submitted 17 July, 2023; v1 submitted 9 January, 2023; originally announced January 2023.

Comments: 15 pages, 12 figures, 4 tables. Published in IEEE Cluster 2023

ACM Class: C.2.1

arXiv:2204.07336 [pdf, ps, other]

Preparing for the Future -- Rethinking Proxy Apps

Authors: Satoshi Matsuoka, Jens Domke, Mohamed Wahib, Aleksandr Drozd, Ray Bair, Andrew A. Chien, Jeffrey S. Vetter, John Shalf

Abstract: A considerable amount of research and engineering went into designing proxy applications, which represent common high-performance computing workloads, to co-design and evaluate the current generation of supercomputers, e.g., RIKEN's Supercomputer Fugaku, ANL's Aurora, or ORNL's Frontier. This process was necessary to standardize the procurement while avoiding duplicated effort at each HPC center t… ▽ More A considerable amount of research and engineering went into designing proxy applications, which represent common high-performance computing workloads, to co-design and evaluate the current generation of supercomputers, e.g., RIKEN's Supercomputer Fugaku, ANL's Aurora, or ORNL's Frontier. This process was necessary to standardize the procurement while avoiding duplicated effort at each HPC center to develop their own benchmarks. Unfortunately, proxy applications force HPC centers and providers (vendors) into a an undesirable state of rigidity, in contrast to the fast-moving trends of current technology and future heterogeneity. To accommodate an extremely-heterogeneous future, we have to reconsider how to co-design supercomputers during the next decade, and avoid repeating the past mistakes. This position paper outlines the current state-of-the-art in system co-design, challenges encountered over the past years, and a proposed plan to move forward. △ Less

Submitted 15 April, 2022; originally announced April 2022.

arXiv:2009.10151 [pdf, other]

TIGER: Topology-aware Assignment using Ising machines Application to Classical Algorithm Tasks and Quantum Circuit Gates

Authors: Anastasiia Butko, Ilyas Turimbetov, George Michelogiannakis, David Donofrio, Didem Unat, John Shalf

Abstract: Optimally map** a parallel application to compute and communication resources is increasingly important as both system size and heterogeneity increase. A similar map** problem exists in gate-based quantum computing where the objective is to map tasks to gates in a topology-aware fashion. This is an NP-complete graph isomorphism problem, and existing task assignment approaches are either heuris… ▽ More Optimally map** a parallel application to compute and communication resources is increasingly important as both system size and heterogeneity increase. A similar map** problem exists in gate-based quantum computing where the objective is to map tasks to gates in a topology-aware fashion. This is an NP-complete graph isomorphism problem, and existing task assignment approaches are either heuristic or based on physical optimization algorithms, providing different speed and solution quality trade-offs. Ising machines such as quantum and digital annealers have recently become available and offer an alternative hardware solution to solve this type of optimization problems. In this paper, we propose an algorithm that allows solving the topology-aware assignment problem using Ising machines. We demonstrate the algorithm on two use cases, i.e. classical task scheduling and quantum circuit gate scheduling. TIGER---topology-aware task/gate assignment mapper tool---implements our proposed algorithms and automatically integrates them into the quantum software environment. To address the limitations of physical solver, we propose and implement a domain-specific partition strategy that allows solving larger-scale problems and a weight optimization algorithm that allows tuning Ising model parameters to achieve better restuls. We use D-Wave's quantum annealer to demonstrate our algorithm and evaluate the proposed tool flow in terms of performance, partition efficiency, and solution quality. Results show significant speed-up compared to classical solutions, better scalability, and higher solution quality when using TIGER together with the proposed partition method. It reduces the data movement cost by 68\% in average for quantum circuit assignment compared to the IBM QX optimizer. △ Less

Submitted 21 September, 2020; originally announced September 2020.

Comments: 15 pages, 10 figures

ACM Class: C.3; D.0; H.0

arXiv:1909.11719 [pdf, other]

doi 10.1109/ICRC2020.2020.00011

Understanding Quantum Control Processor Capabilities and Limitations through Circuit Characterization

Authors: Anastasiia Butko, George Michelogiannakis, Samuel Williams, Costin Iancu, David Donofrio, John Shalf, Jonathan Carter, Irfan Siddiqi

Abstract: Continuing the scaling of quantum computers hinges on building classical control hardware pipelines that are scalable, extensible, and provide real time response. The instruction set architecture (ISA) of the control processor provides functional abstractions that map high-level semantics of quantum programming languages to low-level pulse generation by hardware. In this paper, we provide a method… ▽ More Continuing the scaling of quantum computers hinges on building classical control hardware pipelines that are scalable, extensible, and provide real time response. The instruction set architecture (ISA) of the control processor provides functional abstractions that map high-level semantics of quantum programming languages to low-level pulse generation by hardware. In this paper, we provide a methodology to quantitatively assess the effectiveness of the ISA to encode quantum circuits for intermediate-scale quantum devices with O($10^2$) qubits. The characterization model that we define reflects performance, the ability to meet timing constraint implications, scalability for future quantum chips, and other important considerations making them useful guides for future designs. Using our methodology, we propose scalar (QUASAR) and vector (qV) quantum ISAs as extensions and compare them with other ISAs in metrics such as circuit encoding efficiency, the ability to meet real-time gate cycle requirements of quantum chips, and the ability to scale to more qubits. △ Less

Submitted 3 December, 2020; v1 submitted 25 September, 2019; originally announced September 2019.

Comments: 10 pages, 8 figures

Journal ref: IEEE 2020 International Conference on Rebooting Computing (ICRC)

arXiv:1709.04806 [pdf, ps, other]

TraceTracker: Hardware/Software Co-Evaluation for Large-Scale I/O Workload Reconstruction

Authors: Miryeong Kwon, Jie Zhang, Gyuyoung Park, Wonil Choi, David Donofrio, John Shalf, Mahmut Kandemir, Myoungsoo Jung

Abstract: Block traces are widely used for system studies, model verifications, and design analyses in both industry and academia. While such traces include detailed block access patterns, existing trace-driven research unfortunately often fails to find true-north due to a lack of runtime contexts such as user idle periods and system delays, which are fundamentally linked to the characteristics of target st… ▽ More Block traces are widely used for system studies, model verifications, and design analyses in both industry and academia. While such traces include detailed block access patterns, existing trace-driven research unfortunately often fails to find true-north due to a lack of runtime contexts such as user idle periods and system delays, which are fundamentally linked to the characteristics of target storage hardware. In this work, we propose TraceTracker, a novel hardware/software co-evaluation method that allows users to reuse a broad range of the existing block traces by kee** most their execution contexts and user scenarios while adjusting them with new system information. Specifically, our TraceTracker's software evaluation model can infer CPU burst times and user idle periods from old storage traces, whereas its hardware evaluation method remasters the storage traces by interoperating the inferred time information, and updates all inter-arrival times by making them aware of the target storage system. We apply the proposed co-evaluation model to 577 traces, which were collected by servers from different institutions and locations a decade ago, and revive the traces on a high-performance flash-based storage array. The evaluation results reveal that the accuracy of the execution contexts reconstructed by TraceTracker is on average 99% and 96% with regard to the frequency of idle operations and the total idle periods, respectively. △ Less

Submitted 14 September, 2017; originally announced September 2017.

Comments: This paper is accepted by and will be published at 2017 IEEE International Symposium on Workload Characterization

arXiv:1705.06419 [pdf, ps, other]

doi 10.1109/LCA.2017.2750658

SimpleSSD: Modeling Solid State Drives for Holistic System Simulation

Authors: Myoungsoo Jung, Jie Zhang, Ahmed Abulila, Miryeong Kwon, Narges Shahidi, John Shalf, Nam Sung Kim, Mahmut Kandemir

Abstract: Existing solid state drive (SSD) simulators unfortunately lack hardware and/or software architecture models. Consequently, they are far from capturing the critical features of contemporary SSD devices. More importantly, while the performance of modern systems that adopt SSDs can vary based on their numerous internal design parameters and storage-level configurations, a full system simulation with… ▽ More Existing solid state drive (SSD) simulators unfortunately lack hardware and/or software architecture models. Consequently, they are far from capturing the critical features of contemporary SSD devices. More importantly, while the performance of modern systems that adopt SSDs can vary based on their numerous internal design parameters and storage-level configurations, a full system simulation with traditional SSD models often requires unreasonably long runtimes and excessive computational resources. In this work, we propose SimpleSSD, a highfidelity simulator that models all detailed characteristics of hardware and software, while simplifying the nondescript features of storage internals. In contrast to existing SSD simulators, SimpleSSD can easily be integrated into publicly-available full system simulators. In addition, it can accommodate a complete storage stack and evaluate the performance of SSDs along with diverse memory technologies and microarchitectures. Thus, it facilitates simulations that explore the full design space at different levels of system abstraction. △ Less

Submitted 14 September, 2017; v1 submitted 18 May, 2017; originally announced May 2017.

Comments: This paper has been accepted at IEEE Computer Architecture Letters (CAL)

arXiv:1604.03570 [pdf, ps, other]

BoxLib with Tiling: An AMR Software Framework

Authors: Weiqun Zhang, Ann Almgren, Marcus Day, Tan Nguyen, John Shalf, Didem Unat

Abstract: In this paper we introduce a block-structured adaptive mesh refinement (AMR) software framework that incorporates tiling, a well-known loop transformation. Because the multiscale, multiphysics codes built in BoxLib are designed to solve complex systems at high resolution, performance on current and next generation architectures is essential. With the expectation of many more cores per node on next… ▽ More In this paper we introduce a block-structured adaptive mesh refinement (AMR) software framework that incorporates tiling, a well-known loop transformation. Because the multiscale, multiphysics codes built in BoxLib are designed to solve complex systems at high resolution, performance on current and next generation architectures is essential. With the expectation of many more cores per node on next generation architectures, the ability to effectively utilize threads within a node is essential, and the current model for parallelization will not be sufficient. We describe a new version of BoxLib in which the tiling constructs are embedded so that BoxLib-based applications can easily realize expected performance gains without extra effort on the part of the application developer. We also discuss a path forward to enable future versions of BoxLib to take advantage of NUMA-aware optimizations using the TiDA portable library. △ Less

Submitted 12 April, 2016; originally announced April 2016.

Comments: Accepted for publication in SIAM J. on Scientific Computing

MSC Class: 97N80

arXiv:0707.1607 [pdf, ps, other]

Cactus Framework: Black Holes to Gamma Ray Bursts

Authors: Erik Schnetter, Christian D. Ott, Gabrielle Allen, Peter Diener, Tom Goodale, Thomas Radke, Edward Seidel, John Shalf

Abstract: Gamma Ray Bursts (GRBs) are intense narrowly-beamed flashes of gamma-rays of cosmological origin. They are among the most scientifically interesting astrophysical systems, and the riddle concerning their central engines and emission mechanisms is one of the most complex and challenging problems of astrophysics today. In this article we outline our petascale approach to the GRB problem and discus… ▽ More Gamma Ray Bursts (GRBs) are intense narrowly-beamed flashes of gamma-rays of cosmological origin. They are among the most scientifically interesting astrophysical systems, and the riddle concerning their central engines and emission mechanisms is one of the most complex and challenging problems of astrophysics today. In this article we outline our petascale approach to the GRB problem and discuss the computational toolkits and numerical codes that are currently in use and that will be scaled up to run on emerging petaflop scale computing platforms in the near future. Petascale computing will require additional ingredients over conventional parallelism. We consider some of the challenges which will be caused by future petascale architectures, and discuss our plans for the future development of the Cactus framework and its applications to meet these challenges in order to profit from these new architectures. △ Less

Submitted 11 July, 2007; originally announced July 2007.

Comments: 16 pages, 4 figures. To appear in Petascale Computing: Algorithms and Applications, Ed. D. Bader, CRC Press LLC (2007)

arXiv:cs/0108001 [pdf]

The Cactus Worm: Experiments with Dynamic Resource Discovery and Allocation in a Grid Environment

Authors: Gabrielle Allen, David Angulo, Ian Foster, Gerd Lanfermann, Chuang Liu, Thomas Radke, Ed Seidel, John Shalf

Abstract: The ability to harness heterogeneous, dynamically available "Grid" resources is attractive to typically resource-starved computational scientists and engineers, as in principle it can increase, by significant factors, the number of cycles that can be delivered to applications. However, new adaptive application structures and dynamic runtime system mechanisms are required if we are to operate eff… ▽ More The ability to harness heterogeneous, dynamically available "Grid" resources is attractive to typically resource-starved computational scientists and engineers, as in principle it can increase, by significant factors, the number of cycles that can be delivered to applications. However, new adaptive application structures and dynamic runtime system mechanisms are required if we are to operate effectively in Grid environments. In order to explore some of these issues in a practical setting, we are develo** an experimental framework, called Cactus, that incorporates both adaptive application structures for dealing with changing resource characteristics and adaptive resource selection mechanisms that allow applications to change their resource allocations (e.g., via migration) when performance falls outside specified limits. We describe here the adaptive resource selection mechanisms and describe how they are used to achieve automatic application migration to "better" resources following performance degradation. Our results provide insights into the architectural structures required to support adaptive resource selection. In addition, we suggest that this "Cactus Worm" is an interesting challenge problem for Grid computing. △ Less

Submitted 1 August, 2001; originally announced August 2001.

Comments: 14 pages, 5 figures, to be published in International Journal of Supercomputing Applications

Report number: TR-2001-28 ACM Class: D.1.3

Showing 1–9 of 9 results for author: Shalf, J