-
Accelerating Deep Learning by Focusing on the Biggest Losers
Authors:
Angela H. Jiang,
Daniel L. -K. Wong,
Giulio Zhou,
David G. Andersen,
Jeffrey Dean,
Gregory R. Ganger,
Gauri Joshi,
Michael Kaminksy,
Michael Kozuch,
Zachary C. Lipton,
Padmanabhan Pillai
Abstract:
This paper introduces Selective-Backprop, a technique that accelerates the training of deep neural networks (DNNs) by prioritizing examples with high loss at each iteration. Selective-Backprop uses the output of a training example's forward pass to decide whether to use that example to compute gradients and update parameters, or to skip immediately to the next example. By reducing the number of co…
▽ More
This paper introduces Selective-Backprop, a technique that accelerates the training of deep neural networks (DNNs) by prioritizing examples with high loss at each iteration. Selective-Backprop uses the output of a training example's forward pass to decide whether to use that example to compute gradients and update parameters, or to skip immediately to the next example. By reducing the number of computationally-expensive backpropagation steps performed, Selective-Backprop accelerates training. Evaluation on CIFAR10, CIFAR100, and SVHN, across a variety of modern image models, shows that Selective-Backprop converges to target error rates up to 3.5x faster than with standard SGD and between 1.02--1.8x faster than a state-of-the-art importance sampling approach. Further acceleration of 26% can be achieved by using stale forward pass results for selection, thus also skip** forward passes of low priority examples.
△ Less
Submitted 1 October, 2019;
originally announced October 2019.
-
RowClone: Accelerating Data Movement and Initialization Using DRAM
Authors:
Vivek Seshadri,
Yoongu Kim,
Chris Fallin,
Donghyuk Lee,
Rachata Ausavarungnirun,
Gennady Pekhimenko,
Yixin Luo,
Onur Mutlu,
Phillip B. Gibbons,
Michael A. Kozuch,
Todd C. Mowry
Abstract:
In existing systems, to perform any bulk data movement operation (copy or initialization), the data has to first be read into the on-chip processor, all the way into the L1 cache, and the result of the operation must be written back to main memory. This is despite the fact that these operations do not involve any actual computation. RowClone exploits the organization and operation of commodity DRA…
▽ More
In existing systems, to perform any bulk data movement operation (copy or initialization), the data has to first be read into the on-chip processor, all the way into the L1 cache, and the result of the operation must be written back to main memory. This is despite the fact that these operations do not involve any actual computation. RowClone exploits the organization and operation of commodity DRAM to perform these operations completely inside DRAM using two mechanisms. The first mechanism, Fast Parallel Mode, copies data between two rows inside the same DRAM subarray by issuing back-to-back activate commands to the source and the destination row. The second mechanism, Pipelined Serial Mode, transfers cache lines between two banks using the shared internal bus. RowClone significantly reduces the raw latency and energy consumption of bulk data copy and initialization. This reduction directly translates to improvement in performance and energy efficiency of systems running copy or initialization-intensive workloads
△ Less
Submitted 7 May, 2018;
originally announced May 2018.
-
Buddy-RAM: Improving the Performance and Efficiency of Bulk Bitwise Operations Using DRAM
Authors:
Vivek Seshadri,
Donghyuk Lee,
Thomas Mullins,
Hasan Hassan,
Amirali Boroumand,
Jeremie Kim,
Michael A. Kozuch,
Onur Mutlu,
Phillip B. Gibbons,
Todd C. Mowry
Abstract:
Bitwise operations are an important component of modern day programming. Many widely-used data structures (e.g., bitmap indices in databases) rely on fast bitwise operations on large bit vectors to achieve high performance. Unfortunately, in existing systems, regardless of the underlying architecture (e.g., CPU, GPU, FPGA), the throughput of such bulk bitwise operations is limited by the available…
▽ More
Bitwise operations are an important component of modern day programming. Many widely-used data structures (e.g., bitmap indices in databases) rely on fast bitwise operations on large bit vectors to achieve high performance. Unfortunately, in existing systems, regardless of the underlying architecture (e.g., CPU, GPU, FPGA), the throughput of such bulk bitwise operations is limited by the available memory bandwidth.
We propose Buddy, a new mechanism that exploits the analog operation of DRAM to perform bulk bitwise operations completely inside the DRAM chip. Buddy consists of two components. First, simultaneous activation of three DRAM rows that are connected to the same set of sense amplifiers enables us to perform bitwise AND and OR operations. Second, the inverters present in each sense amplifier enables us to perform bitwise NOT operations, with modest changes to the DRAM array. These two components make Buddy functionally complete. Our implementation of Buddy largely exploits the existing DRAM structure and interface, and incurs low overhead (1% of DRAM chip area).
Our evaluations based on SPICE simulations show that, across seven commonly-used bitwise operations, Buddy provides between 10.9X---25.6X improvement in raw throughput and 25.1X---59.5X reduction in energy consumption. We evaluate three real-world data-intensive applications that exploit bitwise operations: 1) bitmap indices, 2) BitWeaving, and 3) bitvector-based implementation of sets. Our evaluations show that Buddy significantly outperforms the state-of-the-art.
△ Less
Submitted 29 November, 2016;
originally announced November 2016.