Search | arXiv e-print repository

Accelerating Variational Quantum Algorithms Using Circuit Concurrency

Authors: Salonik Resch, Anthony Gutierrez, Joon Suk Huh, Srikant Bharadwaj, Yasuko Eckert, Gabriel Loh, Mark Oskin, Swamit Tannu

Abstract: Variational quantum algorithms (VQAs) provide a promising approach to achieve quantum advantage in the noisy intermediate-scale quantum era. In this era, quantum computers experience high error rates and quantum error detection and correction is not feasible. VQAs can utilize noisy qubits in tandem with classical optimization algorithms to solve hard problems. However, VQAs are still slow relative… ▽ More Variational quantum algorithms (VQAs) provide a promising approach to achieve quantum advantage in the noisy intermediate-scale quantum era. In this era, quantum computers experience high error rates and quantum error detection and correction is not feasible. VQAs can utilize noisy qubits in tandem with classical optimization algorithms to solve hard problems. However, VQAs are still slow relative to their classical counterparts. Hence, improving the performance of VQAs will be necessary to make them competitive. While VQAs are expected perform better as the problem sizes increase, increasing their performance will make them a viable option sooner. In this work we show that circuit-level concurrency provides a means to increase the performance of variational quantum algorithms on noisy quantum computers. This involves map** multiple instances of the same circuit (program) onto the quantum computer at the same time, which allows multiple samples in a variational quantum algorithm to be gathered in parallel for each training iteration. We demonstrate that this technique provides a linear increase in training speed when increasing the number of concurrently running quantum circuits. Furthermore, even with pessimistic error rates concurrent quantum circuit sampling can speed up the quantum approximate optimization algorithm by up to 20x with low map** and run time overhead. △ Less

Submitted 3 September, 2021; originally announced September 2021.

arXiv:1808.09651 [pdf, other]

Implications of Integrated CPU-GPU Processors on Thermal and Power Management Techniques

Authors: Kapil Dev, Indrani Paul, Wei Huang, Yasuko Eckert, Wayne Burleson, Sherief Reda

Abstract: Heterogeneous processors with architecturally different cores (CPU and GPU) integrated on the same die lead to new challenges and opportunities for thermal and power management techniques because of shared thermal/power budgets between these cores. In this paper, we show that new parallel programming paradigms (e.g., OpenCL) for CPU-GPU processors create a tighter coupling between the workload, th… ▽ More Heterogeneous processors with architecturally different cores (CPU and GPU) integrated on the same die lead to new challenges and opportunities for thermal and power management techniques because of shared thermal/power budgets between these cores. In this paper, we show that new parallel programming paradigms (e.g., OpenCL) for CPU-GPU processors create a tighter coupling between the workload, the thermal/power management unit and the operating system. Using detailed thermal and power maps of the die from infrared imaging, we demonstrate that in contrast to traditional multi-core CPUs, heterogeneous processors exhibit higher coupled behavior for dynamic voltage and frequency scaling and workload scheduling, in terms of their effect on performance, power, and temperature. Further, we show that by taking the differences in core architectures and relative proximity of different computing cores on the die into consideration, better scheduling schemes could be implemented to reduce both the power density and peak temperature of the die. The findings presented in the paper can be used to improve thermal and power efficiency of heterogeneous CPU-GPU processors. △ Less

Submitted 29 August, 2018; originally announced August 2018.

Comments: 9 pages, 8 figures, 2 tables

arXiv:1710.09517 [pdf, other]

doi 10.1145/3232521

CODA: Enabling Co-location of Computation and Data for Near-Data Processing

Authors: Hyojong Kim, Ramyad Hadidi, Lifeng Nai, Hyesoon Kim, Nuwan Jayasena, Yasuko Eckert, Onur Kayiran, Gabriel H. Loh

Abstract: Recent studies have demonstrated that near-data processing (NDP) is an effective technique for improving performance and energy efficiency of data-intensive workloads. However, leveraging NDP in realistic systems with multiple memory modules introduces a new challenge. In today's systems, where no computation occurs in memory modules, the physical address space is interleaved at a fine granularity… ▽ More Recent studies have demonstrated that near-data processing (NDP) is an effective technique for improving performance and energy efficiency of data-intensive workloads. However, leveraging NDP in realistic systems with multiple memory modules introduces a new challenge. In today's systems, where no computation occurs in memory modules, the physical address space is interleaved at a fine granularity among all memory modules to help improve the utilization of processor-memory interfaces by distributing the memory traffic. However, this is at odds with efficient use of NDP, which requires careful placement of data in memory modules such that near-data computations and their exclusively used data can be localized in individual memory modules, while distributing shared data among memory modules to reduce hotspots. In order to address this new challenge, we propose a set of techniques that (1) enable collections of OS pages to either be fine-grain interleaved among memory modules (as is done today) or to be placed contiguously on individual memory modules (as is desirable for NDP private data), and (2) decide whether to localize or distribute each memory object based on its anticipated access pattern and steer computations to the memory where the data they access is located. Our evaluations across a wide range of workloads show that the proposed mechanism improves performance by 31% and reduces 38% remote data accesses over a baseline system that cannot exploit computate-data affinity characteristics. △ Less

Submitted 25 October, 2017; originally announced October 2017.

Comments: 14 pages, 16 figures

Journal ref: ACM Transactions on Architecture and Code Optimization (TACO) Volume 15 Issue 3, October 2018 Article No. 32

arXiv:1602.00722 [pdf, other]

Enabling Efficient Dynamic Resizing of Large DRAM Caches via A Hardware Consistent Hashing Mechanism

Authors: Kevin K. Chang, Gabriel H. Loh, Mithuna Thottethodi, Yasuko Eckert, Mike O'Connor, Srilatha Manne, Lisa Hsu, Lavanya Subramanian, Onur Mutlu

Abstract: Die-stacked DRAM has been proposed for use as a large, high-bandwidth, last-level cache with hundreds or thousands of megabytes of capacity. Not all workloads (or phases) can productively utilize this much cache space, however. Unfortunately, the unused (or under-used) cache continues to consume power due to leakage in the peripheral circuitry and periodic DRAM refresh. Dynamically adjusting the a… ▽ More Die-stacked DRAM has been proposed for use as a large, high-bandwidth, last-level cache with hundreds or thousands of megabytes of capacity. Not all workloads (or phases) can productively utilize this much cache space, however. Unfortunately, the unused (or under-used) cache continues to consume power due to leakage in the peripheral circuitry and periodic DRAM refresh. Dynamically adjusting the available DRAM cache capacity could largely eliminate this energy overhead. However, the current proposed DRAM cache organization introduces new challenges for dynamic cache resizing. The organization differs from a conventional SRAM cache organization because it places entire cache sets and their tags within a single bank to reduce on-chip area and power overhead. Hence, resizing a DRAM cache requires remap** sets from the powered-down banks to active banks. In this paper, we propose CRUNCH (Cache Resizing Using Native Consistent Hashing), a hardware data remap** scheme inspired by consistent hashing, an algorithm originally proposed to uniformly and dynamically distribute Internet traffic across a changing population of web servers. CRUNCH provides a load-balanced remap** of data from the powered-down banks alone to the active banks, without requiring sets from all banks to be remapped, unlike naive schemes to achieve load balancing. CRUNCH remaps only sets from the powered-down banks, so it achieves this load balancing with low bank power-up/down transition latencies. CRUNCH's combination of good load balancing and low transition latencies provides a substrate to enable efficient DRAM cache resizing. △ Less

Submitted 1 February, 2016; originally announced February 2016.

Comments: 13 pages

Showing 1–4 of 4 results for author: Eckert, Y